简体   繁体   中英

Optimal method to speed up Pandas Dataframe groupby aggregation

I have a large data frame price_d like this:

+---------------------------------------------------+
| date          monthEndDate  stock  volume  logRet |
+---------------------------------------------------+
| 1990-01-01    1990-01-31    A      1       NA     |
| 1990-01-02    1990-01-31    A      2       0.2    |
| 1990-02-01    1990-02-28    A      3       0.3    |
| 1990-02-02    1990-02-28    A      4       0.4    |
| ...           ...                                 |
| 1990-01-01    1990-01-31    B      1       NA     |
| 1990-01-02    1990-01-31    B      2       0.08   |
| ...           ...                                 |
| 1990-02-01    1990-02-28    B      0       0.3    |
| 1990-02-02    1990-02-28    B      3       0.4    |
| ...           ...                                 |
+---------------------------------------------------+

The length of this dataframe would be in millions, with hundreds of distinct value in monthEndDate and thousands of distinct value in stock .

I did a groupby aggregation on volume and logRet with three self-defined functions:

def varLogRet(_s):
    return pd.Series({'varLogRet': np.var(_s.iloc[_s.to_numpy().nonzero()])})

def TotRet1M(_s):
    return pd.Series({'TotRet1M': np.exp(np.sum(_s))-1})


def avgVolume(_s):
    return pd.Series({'avgVolume': np.mean(_s.iloc[_s.to_numpy().nonzero()])})

return_m = price_d.groupby(['monthEndDate', 'tradingItemId']).agg({'logRet': [varLogRet, TotRet1M],
                                                                       'volume': avgVolume})

The groupby aggregation would take serveral minutes. In my case, what's the optimal way to speed up this process, would multiprocessing work?

You really don't need .agg when there are pandas built-in and possibly optimized functions directly available. NaN s are ignored by default. Just compute the columns you need separately and make use of them later on.

Benchmark: 8 million rows took less than 3s to complete on my ordinary Core i5-8250U (4C8T) laptop running 64 bit debian 10. The data is a simple repetition from which you provided.

# make a dataset of 8 million rows
df = pd.read_clipboard(sep=r"\s{2,}")
df2 = df.loc[df.index.repeat(1000000)].reset_index(drop=True)

# set 0's to nan's as requested...
df2[df2["logRet"] == 0] = np.nan

t0 = datetime.now()

dfgp = df2.groupby(['monthEndDate', 'stock'])  # groupby object
# what you want
tot = np.exp(dfgp["logRet"].sum() - 1)
var = dfgp["logRet"].var()  # ddof=1 by default in pandas 1.1.3
vol = dfgp["volume"].mean()

print(f"{(datetime.now() - t0).total_seconds():.2f}s elapsed...")
# 2.89s elapsed...

Then you can use these datasets as you wish. eg Use pd.concat([tot, var, vol], axis=1) to combine them together.

tot
Out[6]: 
monthEndDate  stock
1990-01-31    A        inf   <- will not happen with real data
              B        inf
1990-02-28    A        inf
              B        inf
Name: logRet, dtype: float64

var
Out[7]: 
monthEndDate  stock
1990-01-31    A        0.0000
              B        0.0000
1990-02-28    A        0.0025
              B        0.0025
Name: logRet, dtype: float64

vol
Out[8]: 
monthEndDate  stock
1990-01-31    A        1.5
              B        1.5
1990-02-28    A        3.5
              B        1.5
Name: volume, dtype: float64

NB Overflow on the tot part happened simply because of repetitive increment. This will not happen in real data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM