简体   繁体   English

Pandas:如何在 groupby() 操作后有效地 diff()?

[英]Pandas: How to efficiently diff() after a groupby() operation?

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to use the diff() function in a performant manner on a subset of the data.我在 Pandas 中确实有一个大型数据集(大约 800 万行 x 25 列),我正在努力在数据子集上以高性能方式使用 diff() function。

Here is how my dataset looks like:这是我的数据集的样子:

                   prec   type
location_id hours             
135         78     12.0      A
            79     14.0      A
            80     14.3      A
            81     15.0      A
            82     15.0      A
            83     15.0      A
            84     15.5      A
  • I have a multi-index on [location_id, hours].我在 [location_id, hours] 上有一个多索引。 I have around 60k locations and 140 hours for each location (making up the 8 million rows).我有大约 60k 个位置和每个位置 140 小时(构成 800 万行)。
  • The rest of the data is numeric (float) or categorical.数据的 rest 是数字(浮点)或分类。 I have only included 2 columns here, normally there are around 20 columns.我这里只包含了 2 列,通常有 20 列左右。
  • What I am willing to do is to apply the diff() function for each location on the prec column.我愿意做的是为prec列上的每个位置应用 diff() function。 The original dataset piles up the prec numbers;原始数据集堆积了prec数字; by applying diff() I will get the appropriate prec value for each hour.通过应用 diff() 我将获得每小时适当的prec值。
  • With these in mind, I have implemented the following algorithm in Pandas:考虑到这些,我在 Pandas 中实现了以下算法:
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours

# Apply the diff()
for location_id, data_of_location in df_filtered.groupby(level="location_id"):
    df_data.loc[data_of_location.index, "prec"] = data_of_location.prec.diff().replace(np.nan, 0.0)
del df_filtered

  • This works really well functionally, however the performance and the memory consumption is horrible.这在功能上非常有效,但是性能和 memory 消耗是可怕的。 It is taking around 30 minutes on my dataset and that is currently not acceptable.我的数据集大约需要 30 分钟,这目前是不可接受的。 The existence of the for loop is an indicator that this could be handled better. for 循环的存在表明这可以得到更好的处理。
  • Is there a better/faster way to implement this?有没有更好/更快的方法来实现这一点?
  • Also, the overall memory consumption of the Python script is sky-rocketing during this operation;此外,Python 脚本的整体 memory 消耗在此操作期间暴涨; it grows around 300%!它增长了 300% 左右! The memory consumed by the main df_data data frame doesn't change but the overall process memory consumption rises.df_data数据帧消耗的memory没有变化,但整个过程memory消耗上升。

With the input from @Quang Hoang and @Ben.来自@Quang Hoang 和@Ben 的输入。 T, I figured out a solution that is pretty fast but still consumes a lot of memory. T,我想出了一个非常快但仍然消耗大量 memory 的解决方案。

# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours

# Apply the diff()
df_diffed = df_data.groupby(level="location_id").prec.diff().replace(np.nan, 0.0)
df_data[df_diffed.index, "prec"] = df_diffed
del df_diffed
del df_filtered

I am guessing 2 things can be done to improve memory usage:我猜可以做两件事来改善 memory 的使用:

  • df_filtered seems like a copy of the data; df_filtered似乎是数据的副本; that should increase the memory a lot.那应该会增加很多 memory。
  • df_diffed is also a copy. df_diffed也是一个副本。

The memory usage is very intensive while computing these two variables.在计算这两个变量时,memory 的使用非常密集。 I am not sure if there is any in-place way to execute such operations.我不确定是否有任何in-place方式来执行此类操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM