简体   繁体   English

通过聚合加速Pandas Dataframe groupby的最佳方法

[英]Optimal method to speed up Pandas Dataframe groupby aggregation

I have a large data frame price_d like this:我有一个像这样的大数据框price_d

+---------------------------------------------------+
| date          monthEndDate  stock  volume  logRet |
+---------------------------------------------------+
| 1990-01-01    1990-01-31    A      1       NA     |
| 1990-01-02    1990-01-31    A      2       0.2    |
| 1990-02-01    1990-02-28    A      3       0.3    |
| 1990-02-02    1990-02-28    A      4       0.4    |
| ...           ...                                 |
| 1990-01-01    1990-01-31    B      1       NA     |
| 1990-01-02    1990-01-31    B      2       0.08   |
| ...           ...                                 |
| 1990-02-01    1990-02-28    B      0       0.3    |
| 1990-02-02    1990-02-28    B      3       0.4    |
| ...           ...                                 |
+---------------------------------------------------+

The length of this dataframe would be in millions, with hundreds of distinct value in monthEndDate and thousands of distinct value in stock .该数据帧的长度将是数以百万计,在数百个不同的值monthEndDate ,数千独特价值的stock

I did a groupby aggregation on volume and logRet with three self-defined functions:我使用三个自定义函数对 volume 和 logRet 进行了 groupby 聚合:

def varLogRet(_s):
    return pd.Series({'varLogRet': np.var(_s.iloc[_s.to_numpy().nonzero()])})

def TotRet1M(_s):
    return pd.Series({'TotRet1M': np.exp(np.sum(_s))-1})


def avgVolume(_s):
    return pd.Series({'avgVolume': np.mean(_s.iloc[_s.to_numpy().nonzero()])})

return_m = price_d.groupby(['monthEndDate', 'tradingItemId']).agg({'logRet': [varLogRet, TotRet1M],
                                                                       'volume': avgVolume})

The groupby aggregation would take serveral minutes. groupby 聚合需要几分钟的时间。 In my case, what's the optimal way to speed up this process, would multiprocessing work?就我而言,加速此过程的最佳方法是什么,多处理是否有效?

You really don't need .agg when there are pandas built-in and possibly optimized functions directly available.当有内置的.agg和可能直接可用的优化功能时,你真的不需要.agg NaN s are ignored by default.默认情况下忽略NaN Just compute the columns you need separately and make use of them later on.只需单独计算您需要的列,然后再使用它们。

Benchmark: 8 million rows took less than 3s to complete on my ordinary Core i5-8250U (4C8T) laptop running 64 bit debian 10. The data is a simple repetition from which you provided.基准测试:在运行 64 位 debian 10 的普通 Core i5-8250U (4C8T) 笔记本电脑上,800 万行不到 3 秒即可完成。数据是您提供的简单重复数据。

# make a dataset of 8 million rows
df = pd.read_clipboard(sep=r"\s{2,}")
df2 = df.loc[df.index.repeat(1000000)].reset_index(drop=True)

# set 0's to nan's as requested...
df2[df2["logRet"] == 0] = np.nan

t0 = datetime.now()

dfgp = df2.groupby(['monthEndDate', 'stock'])  # groupby object
# what you want
tot = np.exp(dfgp["logRet"].sum() - 1)
var = dfgp["logRet"].var()  # ddof=1 by default in pandas 1.1.3
vol = dfgp["volume"].mean()

print(f"{(datetime.now() - t0).total_seconds():.2f}s elapsed...")
# 2.89s elapsed...

Then you can use these datasets as you wish.然后您可以根据需要使用这些数据集。 eg Use pd.concat([tot, var, vol], axis=1) to combine them together.例如使用pd.concat([tot, var, vol], axis=1)将它们组合在一起。

tot
Out[6]: 
monthEndDate  stock
1990-01-31    A        inf   <- will not happen with real data
              B        inf
1990-02-28    A        inf
              B        inf
Name: logRet, dtype: float64

var
Out[7]: 
monthEndDate  stock
1990-01-31    A        0.0000
              B        0.0000
1990-02-28    A        0.0025
              B        0.0025
Name: logRet, dtype: float64

vol
Out[8]: 
monthEndDate  stock
1990-01-31    A        1.5
              B        1.5
1990-02-28    A        3.5
              B        1.5
Name: volume, dtype: float64

NB Overflow on the tot part happened simply because of repetitive increment. NB 在tot部分发生溢出仅仅是因为重复增量。 This will not happen in real data.这不会发生在真实数据中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM