I'm trying to load a large bz2 file in chunks and aggregate into a pandas DataFrame, but Python keeps crashing. The methodology I'm using is below, which I've had success with on smaller datasets. What is a more efficient way to aggregate larger than memory files into Pandas?
Data is line delimited json compressed to bz2, taken from https://files.pushshift.io/reddit/comments/ (all publicly available reddit comments).
import pandas as pd
reader = pd.read_json('RC_2017-09.bz2', compression='bz2', lines=True, chunksize=100000) df = pd.DataFrame() for chunk in reader:
# Count of comments in each subreddit
count = chunk.groupby('subreddit').size()
df = pd.concat([df, count], axis=0)
df = df.groupby(df.index).sum()
reader.close()
EDIT: Python crashed when I used chunksize 1e5. The script worked when i increased chunksize to 1e6.
I used this iterator method which work for me without memory error. you can try it.
chunksize = 10 ** 6
cols=['a','b','c','d']
iter_csv = pd.read_csv(filename.bz2, compression='bz2', delimiter='\t', usecols=cols, low_memory=False, iterator=True, chunksize=chunksize, encoding="utf-8")
# some work related to your group by replacing below code
df = pd.concat([chunk[chunk['b'] == 1012] for chunk in iter_csv])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.