Reading large csv files in Pandas

Reading large csv files in Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In this article we talk about

  1. Introduction to pandas.

  2. Installation.

  3. Handle large data files .

Pandas library mainly used for the data manipulation and analysis.What’s cool about Pandas is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame that looks very similar to table. It easier to work with in comparison to working with lists and/or dictionaries through for loops or list comprehension

Installation

python3.6 -m venv env
source env/bin/activate
pip install pandas
pip install numpy

Read csv files

import pandas as pd
df = pd.read_csv ('file_name.csv')
print(df)

For small csv files the above code works , but when you are dealing with the data files which are having millions of rows in memory it can't load at once that's why you need chunk based reading

chunks based reading

code

For reducing the memory usage you need to convert the datatype using the numpy library

import pandas as  pd 
import numpy as np


def data_process(path):
    chunksize = 10 ** 5
    dtype={'BuyerId': np.int32, 
       'BuyerZipcode': np.int32,
       'PurchaseRequestNumber': np.int32,
       'Quantity': np.int32,
       'ShippingZipcode':np.int32,
       'SupplierId': np.int32,

      }
    print("chunk size:", chunksize)
    for chunk in pd.read_csv(file_name.csv, chunksize=chunksize, dtype=dtype):
        print(chunk)

Chunk size always in your hands , how much data you want to load at once then after iterate

Every time it will load the 1,00,000 rows then iterate then add to the data frame

Thank you . . .