Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
In this article we talk about
Introduction to pandas.
Installation.
Handle large data files .
Pandas library mainly used for the data manipulation and analysis.What’s cool about Pandas is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame that looks very similar to table. It easier to work with in comparison to working with lists and/or dictionaries through for loops or list comprehension
Installation
python3.6 -m venv env
source env/bin/activate
pip install pandas
pip install numpy
Read csv files
import pandas as pd
df = pd.read_csv ('file_name.csv')
print(df)
For small csv files the above code works , but when you are dealing with the data files which are having millions of rows in memory it can't load at once that's why you need chunk based reading
chunks based reading
code
For reducing the memory usage you need to convert the datatype using the numpy library
import pandas as pd
import numpy as np
def data_process(path):
chunksize = 10 ** 5
dtype={'BuyerId': np.int32,
'BuyerZipcode': np.int32,
'PurchaseRequestNumber': np.int32,
'Quantity': np.int32,
'ShippingZipcode':np.int32,
'SupplierId': np.int32,
}
print("chunk size:", chunksize)
for chunk in pd.read_csv(file_name.csv, chunksize=chunksize, dtype=dtype):
print(chunk)
Chunk size always in your hands , how much data you want to load at once then after iterate
Every time it will load the 1,00,000 rows then iterate then add to the data frame
Thank you . . .