简体   繁体   中英

How to read only a slice of data stored in a big csv file in python

I cant read the data from a CSV file into memory because it is too large, ie doing pandas.read_csv using pandas won't work.

I only want to get data out based on some column values which should fit into memory. Using a pandas dataframe df that could hypothetically contain the full data from the CSV, I would do

df.loc[(df['column_name'] == 1)

The CSV file does contain a header, and it is ordered so I don't really need to use column_name but the order of that column if I have to.

How can I achieve this? I read a bit about pyspark but I don't know if this is something where it can be useful

You can read the CSV file chunk by chunk and retain the rows which you want to have

iter_csv = pd.read_csv('sample.csv', iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.loc[chunk['Column_name']==1] for chunk in iter_csv] )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM