I cant read the data from a CSV file into memory because it is too large, ie doing pandas.read_csv
using pandas won't work.
I only want to get data out based on some column values which should fit into memory. Using a pandas dataframe df
that could hypothetically contain the full data from the CSV, I would do
df.loc[(df['column_name'] == 1)
The CSV file does contain a header, and it is ordered so I don't really need to use column_name
but the order of that column if I have to.
How can I achieve this? I read a bit about pyspark but I don't know if this is something where it can be useful
You can read the CSV file chunk by chunk and retain the rows which you want to have
iter_csv = pd.read_csv('sample.csv', iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.loc[chunk['Column_name']==1] for chunk in iter_csv] )
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.