How to read only a slice of data stored in a big csv file in python

Question

I cant read the data from a CSV file into memory because it is too large, ie doing pandas.read_csv using pandas won't work.

I only want to get data out based on some column values which should fit into memory. Using a pandas dataframe df that could hypothetically contain the full data from the CSV, I would do

df.loc[(df['column_name'] == 1)

The CSV file does contain a header, and it is ordered so I don't really need to use column_name but the order of that column if I have to.

How can I achieve this? I read a bit about pyspark but I don't know if this is something where it can be useful

Answer 1

You can read the CSV file chunk by chunk and retain the rows which you want to have

iter_csv = pd.read_csv('sample.csv', iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.loc[chunk['Column_name']==1] for chunk in iter_csv] )

How to read only a slice of data stored in a big csv file in python

Question

1 answers

solution1
5 ACCPTED 2018-09-26 11:59:20

How to read only a slice of data stored in a big csv file in python

Question

1 answers

solution1 5 ACCPTED 2018-09-26 11:59:20

solution1
5 ACCPTED 2018-09-26 11:59:20