what is an efficient way to load and aggregate a large .bz2 file into pandas?

Question

I'm trying to load a large bz2 file in chunks and aggregate into a pandas DataFrame, but Python keeps crashing. The methodology I'm using is below, which I've had success with on smaller datasets. What is a more efficient way to aggregate larger than memory files into Pandas?

Data is line delimited json compressed to bz2, taken from https://files.pushshift.io/reddit/comments/ (all publicly available reddit comments).

import pandas as pd
reader = pd.read_json('RC_2017-09.bz2', compression='bz2', lines=True, chunksize=100000) df = pd.DataFrame() for chunk in reader:
    # Count of comments in each subreddit
    count = chunk.groupby('subreddit').size()
    df = pd.concat([df, count], axis=0)
    df = df.groupby(df.index).sum() 
    reader.close()

EDIT: Python crashed when I used chunksize 1e5. The script worked when i increased chunksize to 1e6.

Answer 1

I used this iterator method which work for me without memory error. you can try it.

chunksize = 10 ** 6
cols=['a','b','c','d']
iter_csv = pd.read_csv(filename.bz2, compression='bz2', delimiter='\t', usecols=cols, low_memory=False, iterator=True, chunksize=chunksize, encoding="utf-8")
# some work related to your group by replacing below code
df = pd.concat([chunk[chunk['b'] == 1012] for chunk in iter_csv])

what is an efficient way to load and aggregate a large .bz2 file into pandas?

Question

1 answers

solution1
0 2021-11-13 23:40:36

what is an efficient way to load and aggregate a large .bz2 file into pandas?

Question

1 answers

solution1 0 2021-11-13 23:40:36

solution1
0 2021-11-13 23:40:36