Friday, July 3, 2020

How to read CSV file using pandas

How to read CSV files using pandas?


The DataFrame in pandas is used to handle two-dimensional data arranged in the tabular data structure. Pandas is a tool to analyze and manipulate the data. Large datasets can be easily handled with pandas. It is a flexible, efficient, and high performance, well suited for homogenous or heterogeneous datasets.

Pandas provide us the power to work with data from comprehensive types of resources like .csv, .tsv, excel sheets, and webpage.

Importing CSV file


The easiest way to import a CSV file is the read_csv() method. For example

# importing pandas
import pandas as pd

# creating data frame from .csv file
dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv")
#Read first 5 rows of dataframe
print(dataframe.head())

#Output
   Unnamed: 0   NAME  AGE GRADE  MARKS
0           1   ALEX   16     A     88
1           2  STEVE   16     C     34
2           3   JHON   17     B     66
3           4  WILEY   16     B     75
4           5  SMITH   18     A     82

read_csv() method


This method reads a comma-separated values (.csv) file into DataFrame. The read_csv() method supports iterating or breaking of the file into chunks optionally.  The syntax of this method is,

pandas.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~ AnyStr]],
                  sep=',', delimiter=None, header='infer', names=None, 
                  index_col=None, usecols=None, squeeze=False, prefix=None, 
                  mangle_dupe_cols=True, dtype=None, engine=None, converters=None, 
                  true_values=None, false_values=None, skipinitialspace=False, 
                  skiprows=None, skipfooter=0, nrows=None, na_values=None, 
                  keep_default_na=True, na_filter=True, verbose=False, 
                  skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, 
                  keep_date_col=False, date_parser=None, dayfirst=False, 
                  cache_dates=True, iterator=False, chunksize=None, compression='infer', 
                  thousands=None, decimal: str = '.', lineterminator=None, quotechar='"',
                  quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, 
                  dialect=None, error_bad_lines=True, warn_bad_lines=True, 
                  delim_whitespace=False, low_memory=True, memory_map=False, 
                  float_precision=None)


Important options available in read_csv method are,

filepath_or_buffer


This is a filename or the path to the file,

dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv") #filepath


sep


We can sep field also, by default separator is ‘,’. We can customize it as,

dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", sep='\t')

usecols


We can also import the dataset in the form of a column, using the usecols field, in the read_csv() method.

# importing pandas
import pandas as pd

# creating data frame from .csv file, ONLY selected columns
dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", usecols=['NAME','AGE'])
#Read first 5 rows of dataframe
print(dataframe.head())

#Output
    NAME  AGE
0   ALEX   16
1  STEVE   16
2   JHON   17
3  WILEY   16
4  SMITH   18

header 


We can specify the row that should be used as column names for the produced DataFrame. By default, the header field value is set to 0. The first row of the CSV file will specify the header. A DataFrame without header can be formed by simply setting header=None.

dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", header=None) #Without header

index_col


This field is used to specify the index column of the DataFrame. By default, it is set to None. It can be set as a column name or column index.

dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv",index_col='Name') 
# Use 'Name' column as index

nrows


The number of rows starting from first those must be selected. It is int value.

# importing pandas
import pandas as pd
dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", nrows=5)
print(dataframe)

#Output
   Unnamed: 0   NAME  AGE GRADE  MARKS
0           1   ALEX   16     A     88
1           2  STEVE   16     C     34
2           3   JHON   17     B     66
3           4  WILEY   16     B     75
4           5  SMITH   18     A     82

na_values 


The missing values are specified as NaN. If other strings are required to be considered as NaN. The expected input is a list of strings.

dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", na_values=['x','y']) 
# x and y values will be represented as NaN after importing into dataframe