简体   繁体   中英

How to perform a check on a dataframe at the time of importing it using read_csv?

I am trying to import a .csv file using pandas in python. I am using pandas.read_csv to do that. But I have a requirement to check each row in the dataframe and take values of two specific columns into an array. As my dataframe has almost 3milion(~1gb) rows doing it iteratively after the import is taking time. Can I do that while importing the file itself? Is it a good idea to modify read_csv library function to accommodate this?

df = pd.read_csv("file.csv")
def get():
    for a in list_A: #This list is of size ~2300
        for b in list_B: #This list is of size ~12000
            if a row exists such that it has a,b: 
                //do something

Due to very large size of lists, this function is running slow. Also, querying a dataframe of such big size is also slowing down the execution. Any suggestions/solutions to improve the performance.

Python's default csv module reads the file line by line, instead of loading it fully into memory.

Code would look something like this:

import csv
with open('file.csv') as csvfile:
  csvreader = csv.reader(csvfile)
  for row in csvreader:
    if row[1] in list_A and row[3] in list_B:
      # do something with the row

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM