[英]Pandas: How to efficiently diff() after a groupby() operation?
I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to use the diff() function in a performant manner on a subset of the data.我在 Pandas 中确实有一个大型数据集(大约 800 万行 x 25 列),我正在努力在数据子集上以高性能方式使用 diff() function。
Here is how my dataset looks like:这是我的数据集的样子:
prec type
location_id hours
135 78 12.0 A
79 14.0 A
80 14.3 A
81 15.0 A
82 15.0 A
83 15.0 A
84 15.5 A
prec
column.我愿意做的是为prec
列上的每个位置应用 diff() function。 The original dataset piles up the prec
numbers;原始数据集堆积了prec
数字; by applying diff() I will get the appropriate prec
value for each hour.通过应用 diff() 我将获得每小时适当的prec
值。# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
for location_id, data_of_location in df_filtered.groupby(level="location_id"):
df_data.loc[data_of_location.index, "prec"] = data_of_location.prec.diff().replace(np.nan, 0.0)
del df_filtered
df_data
data frame doesn't change but the overall process memory consumption rises.主df_data
数据帧消耗的memory没有变化,但整个过程memory消耗上升。With the input from @Quang Hoang and @Ben.来自@Quang Hoang 和@Ben 的输入。 T, I figured out a solution that is pretty fast but still consumes a lot of memory. T,我想出了一个非常快但仍然消耗大量 memory 的解决方案。
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
df_diffed = df_data.groupby(level="location_id").prec.diff().replace(np.nan, 0.0)
df_data[df_diffed.index, "prec"] = df_diffed
del df_diffed
del df_filtered
I am guessing 2 things can be done to improve memory usage:我猜可以做两件事来改善 memory 的使用:
df_filtered
seems like a copy of the data; df_filtered
似乎是数据的副本; that should increase the memory a lot.那应该会增加很多 memory。df_diffed
is also a copy. df_diffed
也是一个副本。 The memory usage is very intensive while computing these two variables.在计算这两个变量时,memory 的使用非常密集。 I am not sure if there is any in-place
way to execute such operations.我不确定是否有任何in-place
方式来执行此类操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.