简体   繁体   中英

what is an efficient way to load and aggregate a large .bz2 file into pandas?

I'm trying to load a large bz2 file in chunks and aggregate into a pandas DataFrame, but Python keeps crashing. The methodology I'm using is below, which I've had success with on smaller datasets. What is a more efficient way to aggregate larger than memory files into Pandas?

Data is line delimited json compressed to bz2, taken from https://files.pushshift.io/reddit/comments/ (all publicly available reddit comments).

import pandas as pd
reader = pd.read_json('RC_2017-09.bz2', compression='bz2', lines=True, chunksize=100000) df = pd.DataFrame() for chunk in reader:
    # Count of comments in each subreddit
    count = chunk.groupby('subreddit').size()
    df = pd.concat([df, count], axis=0)
    df = df.groupby(df.index).sum() 
    reader.close()

EDIT: Python crashed when I used chunksize 1e5. The script worked when i increased chunksize to 1e6.

I used this iterator method which work for me without memory error. you can try it.

chunksize = 10 ** 6
cols=['a','b','c','d']
iter_csv = pd.read_csv(filename.bz2, compression='bz2', delimiter='\t', usecols=cols, low_memory=False, iterator=True, chunksize=chunksize, encoding="utf-8")
# some work related to your group by replacing below code
df = pd.concat([chunk[chunk['b'] == 1012] for chunk in iter_csv])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM