通过聚合加速Pandas Dataframe groupby的最佳方法

Question

I have a large data frame price_d like this:我有一个像这样的大数据框price_d ：

+---------------------------------------------------+
| date          monthEndDate  stock  volume  logRet |
+---------------------------------------------------+
| 1990-01-01    1990-01-31    A      1       NA     |
| 1990-01-02    1990-01-31    A      2       0.2    |
| 1990-02-01    1990-02-28    A      3       0.3    |
| 1990-02-02    1990-02-28    A      4       0.4    |
| ...           ...                                 |
| 1990-01-01    1990-01-31    B      1       NA     |
| 1990-01-02    1990-01-31    B      2       0.08   |
| ...           ...                                 |
| 1990-02-01    1990-02-28    B      0       0.3    |
| 1990-02-02    1990-02-28    B      3       0.4    |
| ...           ...                                 |
+---------------------------------------------------+

The length of this dataframe would be in millions, with hundreds of distinct value in monthEndDate and thousands of distinct value in stock .该数据帧的长度将是数以百万计，在数百个不同的值monthEndDate ，数千独特价值的stock 。

I did a groupby aggregation on volume and logRet with three self-defined functions:我使用三个自定义函数对 volume 和 logRet 进行了 groupby 聚合：

def varLogRet(_s):
    return pd.Series({'varLogRet': np.var(_s.iloc[_s.to_numpy().nonzero()])})

def TotRet1M(_s):
    return pd.Series({'TotRet1M': np.exp(np.sum(_s))-1})


def avgVolume(_s):
    return pd.Series({'avgVolume': np.mean(_s.iloc[_s.to_numpy().nonzero()])})

return_m = price_d.groupby(['monthEndDate', 'tradingItemId']).agg({'logRet': [varLogRet, TotRet1M],
                                                                       'volume': avgVolume})

The groupby aggregation would take serveral minutes. groupby 聚合需要几分钟的时间。 In my case, what's the optimal way to speed up this process, would multiprocessing work?就我而言，加速此过程的最佳方法是什么，多处理是否有效？

Answer 1

You really don't need .agg when there are pandas built-in and possibly optimized functions directly available.当有内置的.agg和可能直接可用的优化功能时，你真的不需要.agg 。 NaN s are ignored by default.默认情况下忽略NaN 。 Just compute the columns you need separately and make use of them later on.只需单独计算您需要的列，然后再使用它们。

Benchmark: 8 million rows took less than 3s to complete on my ordinary Core i5-8250U (4C8T) laptop running 64 bit debian 10. The data is a simple repetition from which you provided.基准测试：在运行 64 位 debian 10 的普通 Core i5-8250U (4C8T) 笔记本电脑上，800 万行不到 3 秒即可完成。数据是您提供的简单重复数据。

# make a dataset of 8 million rows
df = pd.read_clipboard(sep=r"\s{2,}")
df2 = df.loc[df.index.repeat(1000000)].reset_index(drop=True)

# set 0's to nan's as requested...
df2[df2["logRet"] == 0] = np.nan

t0 = datetime.now()

dfgp = df2.groupby(['monthEndDate', 'stock'])  # groupby object
# what you want
tot = np.exp(dfgp["logRet"].sum() - 1)
var = dfgp["logRet"].var()  # ddof=1 by default in pandas 1.1.3
vol = dfgp["volume"].mean()

print(f"{(datetime.now() - t0).total_seconds():.2f}s elapsed...")
# 2.89s elapsed...

Then you can use these datasets as you wish.然后您可以根据需要使用这些数据集。 eg Use pd.concat([tot, var, vol], axis=1) to combine them together.例如使用pd.concat([tot, var, vol], axis=1)将它们组合在一起。

tot
Out[6]: 
monthEndDate  stock
1990-01-31    A        inf   <- will not happen with real data
              B        inf
1990-02-28    A        inf
              B        inf
Name: logRet, dtype: float64

var
Out[7]: 
monthEndDate  stock
1990-01-31    A        0.0000
              B        0.0000
1990-02-28    A        0.0025
              B        0.0025
Name: logRet, dtype: float64

vol
Out[8]: 
monthEndDate  stock
1990-01-31    A        1.5
              B        1.5
1990-02-28    A        3.5
              B        1.5
Name: volume, dtype: float64

NB Overflow on the tot part happened simply because of repetitive increment. NB 在tot部分发生溢出仅仅是因为重复增量。 This will not happen in real data.这不会发生在真实数据中。

通过聚合加速Pandas Dataframe groupby的最佳方法

问题描述

1 个解决方案

解决方案1
0 2020-10-22 18:25:47

通过聚合加速Pandas Dataframe groupby的最佳方法

问题描述

1 个解决方案

解决方案1 0 2020-10-22 18:25:47

解决方案1
0 2020-10-22 18:25:47