How to perform a check on a dataframe at the time of importing it using read_csv?

Question

I am trying to import a .csv file using pandas in python. I am using pandas.read_csv to do that. But I have a requirement to check each row in the dataframe and take values of two specific columns into an array. As my dataframe has almost 3milion(~1gb) rows doing it iteratively after the import is taking time. Can I do that while importing the file itself? Is it a good idea to modify read_csv library function to accommodate this?

df = pd.read_csv("file.csv")
def get():
    for a in list_A: #This list is of size ~2300
        for b in list_B: #This list is of size ~12000
            if a row exists such that it has a,b: 
                //do something

Due to very large size of lists, this function is running slow. Also, querying a dataframe of such big size is also slowing down the execution. Any suggestions/solutions to improve the performance.

Answer 1

Python's default csv module reads the file line by line, instead of loading it fully into memory.

Code would look something like this:

import csv
with open('file.csv') as csvfile:
  csvreader = csv.reader(csvfile)
  for row in csvreader:
    if row[1] in list_A and row[3] in list_B:
      # do something with the row

How to perform a check on a dataframe at the time of importing it using read_csv?

Question

1 answers

solution1
0 ACCPTED 2017-10-05 14:27:01

How to perform a check on a dataframe at the time of importing it using read_csv?

Question

1 answers

solution1 0 ACCPTED 2017-10-05 14:27:01

solution1
0 ACCPTED 2017-10-05 14:27:01