How to read CSV files using pandas?
The DataFrame in pandas is used to handle two-dimensional data arranged in the tabular data structure. Pandas is a tool to analyze and manipulate the data. Large datasets can be easily handled with pandas. It is a flexible, efficient, and high performance, well suited for homogenous or heterogeneous datasets.
Pandas provide us the power to work with data from comprehensive types of resources like .csv, .tsv, excel sheets, and webpage.
Importing CSV file
The easiest way to import a CSV file is the read_csv() method. For example
# importing pandas import pandas as pd # creating data frame from .csv file dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv") #Read first 5 rows of dataframe print(dataframe.head()) #Output Unnamed: 0 NAME AGE GRADE MARKS 0 1 ALEX 16 A 88 1 2 STEVE 16 C 34 2 3 JHON 17 B 66 3 4 WILEY 16 B 75 4 5 SMITH 18 A 82
read_csv() method
pandas.read_csv(filepath_or_buffer: Union[str, pathlib.Path, IO[~ AnyStr]], sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: str = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
Important options available in read_csv method are,
filepath_or_buffer
This is a filename or the path to the file,
dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv") #filepath
sep
We can sep field also, by default separator is ‘,’. We can customize it as,
dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", sep='\t')
usecols
We can also import the dataset in the form of a column, using the usecols field, in the read_csv() method.
# importing pandas import pandas as pd # creating data frame from .csv file, ONLY selected columns dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", usecols=['NAME','AGE']) #Read first 5 rows of dataframe print(dataframe.head()) #Output NAME AGE 0 ALEX 16 1 STEVE 16 2 JHON 17 3 WILEY 16 4 SMITH 18
header
We can specify the row that should be used as column names for the produced DataFrame. By default, the header field value is set to 0. The first row of the CSV file will specify the header. A DataFrame without header can be formed by simply setting header=None.
dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", header=None) #Without header
index_col
This field is used to specify the index column of the DataFrame. By default, it is set to None. It can be set as a column name or column index.
dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv",index_col='Name') # Use 'Name' column as index
nrows
The number of rows starting from first those must be selected. It is int value.
# importing pandas import pandas as pd dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", nrows=5) print(dataframe) #Output Unnamed: 0 NAME AGE GRADE MARKS 0 1 ALEX 16 A 88 1 2 STEVE 16 C 34 2 3 JHON 17 B 66 3 4 WILEY 16 B 75 4 5 SMITH 18 A 82
na_values
The missing values are specified as NaN. If other strings are required to be considered as NaN. The expected input is a list of strings.
dataframe = pd.read_csv("C:/Users/my/Desktop/record.csv", na_values=['x','y']) # x and y values will be represented as NaN after importing into dataframe