Pandas dataframe groupby duplicates memory use

Question

I read data from csv. It takes roughly 5Gb RAM (I am judging by the Jupyter notebook mem usage figure and by Linux htop).

df = pd.read_csv(r'~/data/a.txt',  usecols=[0, 1, 5, 15, 16])

then I group it and modify resulting dataframes and delete df .

df.set_index('Date')
y = df.groupby('Date')

days = [(key, value) for key,value in y] 

del df

for day in days:
    day[1].set_index('Time')
    del day[1]['Date']

At this point I would expect groupby to double memory but after del df to release half of it. But in fact it is using 9Gb.

How can I split dataframe by date without duping memory use?

EDIT: since it appeared that python does not release memory to OS, I had to use python memory_profiler to find whats the actual memory use:

print(memory_profiler.memory_usage()[0])

407 << mem use

df = pd.read_csv

4362 <<

groupby and create days list

6351 <<

df = None
gc.collect()

6351 <<

Answer 1

try this instead of grouping by date you can create a df for every date:

unique_date=df["Date"].unique()
days=[]
for date in unique_date:
  days.append(df[df["Date"]==date].set_index("Time"))

Pandas dataframe groupby duplicates memory use

Question

1 answers

solution1
0 2022-09-21 15:57:22

Pandas dataframe groupby duplicates memory use

Question

1 answers

solution1 0 2022-09-21 15:57:22

solution1
0 2022-09-21 15:57:22