简体   繁体   English

Pandas dataframe groupby 重复 memory 使用

[英]Pandas dataframe groupby duplicates memory use

I read data from csv.我从 csv 读取数据。 It takes roughly 5Gb RAM (I am judging by the Jupyter notebook mem usage figure and by Linux htop).它大约需要 5Gb RAM(我根据 Jupyter 笔记本内存使用数据和 Linux htop 判断)。

df = pd.read_csv(r'~/data/a.txt',  usecols=[0, 1, 5, 15, 16])

then I group it and modify resulting dataframes and delete df .然后我将它分组并修改结果数据框并删除df

df.set_index('Date')
y = df.groupby('Date')

days = [(key, value) for key,value in y] 

del df

for day in days:
    day[1].set_index('Time')
    del day[1]['Date']

At this point I would expect groupby to double memory but after del df to release half of it.在这一点上,我希望 groupby 将 memory 翻倍,但在del df释放一半之后。 But in fact it is using 9Gb.但实际上它使用的是 9Gb。

How can I split dataframe by date without duping memory use?如何在不重复使用 memory 的情况下按日期拆分 dataframe?

EDIT: since it appeared that python does not release memory to OS, I had to use python memory_profiler to find whats the actual memory use:编辑:因为看起来 python 没有将 memory 释放到操作系统,我不得不使用 python memory_profiler来查找实际的 ZE81817B4957F08CD 用途:

print(memory_profiler.memory_usage()[0])

407 << mem use

df = pd.read_csv

4362 <<

groupby and create days list

6351 <<

df = None
gc.collect()

6351 <<

try this instead of grouping by date you can create a df for every date:试试这个而不是按日期分组,你可以为每个日期创建一个df

unique_date=df["Date"].unique()
days=[]
for date in unique_date:
  days.append(df[df["Date"]==date].set_index("Time"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM