Pandas 使用多列滾動應用

Question

我正在嘗試在多列上使用pandas.DataFrame.rolling.apply()滾動 function 。 Python 版本是 3.7，pandas 是 1.0.2。

import pandas as pd

#function to calculate
def masscenter(x):
    print(x); # for debug purposes
    return 0;

#simple DF creation routine
df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
                    ['03:00:01.042391', 87.51, 10],
                    ['03:00:01.630182', 87.51, 10],
                    ['03:00:01.635150', 88.00, 792],
                    ['03:00:01.914104', 88.00, 10]], 
                   columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df2['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)

'stamp'是單調且唯一的， 'price'是 double 並且不包含 NaN， 'nQty'是 integer 並且也不包含 NaN。

所以，我需要計算滾動的“質心”，即sum(price*nQty)/sum(nQty) 。

到目前為止我嘗試了什么：

df.apply(masscenter, axis = 1)

masscenter被單行調用 5 次，output 就像

price     87.6
nQty     739.0
Name: 1900-01-01 02:59:47.000282, dtype: float64

這是masscenter的理想輸入，因為我可以使用x[0], x[1]輕松訪問price和nQty 。 但是，我堅持使用rolling.apply()閱讀文檔DataFrame.rolling()和rolling.apply()我認為在rolling()中使用'axis' ，在apply中使用'raw'可以實現類似的行為。 一種天真的方法

rol = df.rolling(window=2)
rol.apply(masscenter)

逐行打印（將行數增加到 window 大小）

stamp
1900-01-01 02:59:47.000282    87.60
1900-01-01 03:00:01.042391    87.51
dtype: float64

然后

stamp
1900-01-01 02:59:47.000282    739.0
1900-01-01 03:00:01.042391     10.0
dtype: float64

因此，列分別傳遞到masscenter （預期）。

可悲的是，在文檔中幾乎沒有任何關於'axis'信息。 然而，下一個變體顯然是

rol = df.rolling(window=2, axis = 1)
rol.apply(masscenter)

從不調用masscenter並ValueError in rol.apply(..)

> Length of passed values is 1, index implies 5

我承認由於缺乏文檔，我不確定'axis'參數及其工作原理。 這是問題的第一部分：這里發生了什么？ 如何正確使用“軸”？ 它的設計目的是什么？

當然，之前也有答案，即：

如何將功能應用到熊貓數據框的兩列
它適用於整個 DataFrame，而不是滾動。

如何調用pandas-rolling-apply-with-parameters-from-multiple-column
答案建議編寫我自己的卷 function，但對我來說，罪魁禍首與評論中的問題相同：如果需要使用偏移量 window 大小（例如'1T' ）來處理非統一時間戳怎么辦？
我不喜歡從頭開始重新發明輪子的想法。 另外我想使用 pandas 來防止從 pandas 和“自制卷”獲得的集合之間的不一致。 該問題還有另一個答案，建議分別填充 dataframe 並計算我需要的任何內容，但它不起作用：存儲數據的大小將是巨大的。 這里提出了同樣的想法：
Apply-rolling-function-on-pandas-dataframe-with-multiple-arguments

此處發布的另一個問答
Pandas-using-rolling-on-multiple-columns
這很好，最接近我的問題，但同樣，不可能使用偏移 window 尺寸（ window = '1T' ）。

一些答案是在 pandas 1.0 出來之前被問到的，鑒於文檔可能會更好，我希望現在可以同時滾動多個列。

問題的第二部分是：是否有可能使用偏移 window 大小的 pandas 1.0.x 同時滾動多個列？

非常感謝。

Answer 1

這個怎么樣：

def masscenter(ser):
    print(df.loc[ser.index])
    return 0

rol = df.price.rolling(window=2)
rol.apply(masscenter, raw=False)

它使用滾動邏輯從任意列中獲取子集。 raw=False 選項為您提供這些子集的索引值（作為 Series 提供給您），然后您使用這些索引值從原始 DataFrame 獲取多列切片。

Answer 2

您可以使用numpy_ext模塊中的rolling_apply函數：

import numpy as np
import pandas as pd
from numpy_ext import rolling_apply


def masscenter(price, nQty):
    return np.sum(price * nQty) / np.sum(nQty)


df = pd.DataFrame( [['02:59:47.000282', 87.60, 739],
                    ['03:00:01.042391', 87.51, 10],
                    ['03:00:01.630182', 87.51, 10],
                    ['03:00:01.635150', 88.00, 792],
                    ['03:00:01.914104', 88.00, 10]], 
                   columns=['stamp', 'price','nQty'])
df['stamp'] = pd.to_datetime(df['stamp'], format='%H:%M:%S.%f')
df.set_index('stamp', inplace=True, drop=True)

window = 2
df['y'] = rolling_apply(masscenter, window, df.price.values, df.nQty.values)
print(df)

                            price  nQty          y
stamp                                             
1900-01-01 02:59:47.000282  87.60   739        NaN
1900-01-01 03:00:01.042391  87.51    10  87.598798
1900-01-01 03:00:01.630182  87.51    10  87.510000
1900-01-01 03:00:01.635150  88.00   792  87.993890
1900-01-01 03:00:01.914104  88.00    10  88.000000

Answer 3

所以我發現沒有辦法滾動兩列，但是沒有內置的 pandas 函數。 代碼如下所示。

# function to find an index corresponding
# to current value minus offset value
def prevInd(series, offset, date):
    offset = to_offset(offset)
    end_date = date - offset
    end = series.index.searchsorted(end_date, side="left")
    return end

# function to find an index corresponding
# to the first value greater than current
# it is useful when one has timeseries with non-unique
# but monotonically increasing values
def nextInd(series, date):
    end = series.index.searchsorted(date, side="right")
    return end

def twoColumnsRoll(dFrame, offset, usecols, fn, columnName = 'twoColRol'):
    # find all unique indices
    uniqueIndices = dFrame.index.unique()
    numOfPoints = len(uniqueIndices)
    # prepare an output array
    moving = np.zeros(numOfPoints)
    # nameholders
    price = dFrame[usecols[0]]
    qty   = dFrame[usecols[1]]

    # iterate over unique indices
    for ii in range(numOfPoints):
        # nameholder
        pp = uniqueIndices[ii]
        # right index - value greater than current
        rInd = afta.nextInd(dFrame,pp)
        # left index - the least value that 
        # is bigger or equal than (pp - offset)
        lInd = afta.prevInd(dFrame,offset,pp)
        # call the actual calcuating function over two arrays
        moving[ii] = fn(price[lInd:rInd], qty[lInd:rInd])
    # construct and return DataFrame
    return pd.DataFrame(data=moving,index=uniqueIndices,columns=[columnName])

此代碼有效，但速度相對較慢且效率低下。 我想人們可以使用How to invoke pandas.rolling.apply 中的 numpy.lib.stride_tricks 並使用來自多列的參數？ 加快速度。 但是，要么做大，要么回家——我結束了用 C++ 編寫的函數和它的包裝器。
我不想將其發布為答案，因為這是一種解決方法，而且我的問題的任何一部分都沒有回答，但是評論太長了。

Answer 4

參考@saninstein 的出色回答。

從以下位置安裝 numpy_ext： https ://pypi.org/project/numpy-ext/

import numpy as np
import pandas as pd
from numpy_ext import rolling_apply as rolling_apply_ext

def box_sum(a,b):
    return np.sum(a) + np.sum(b)

df = pd.DataFrame({"x": [1,2,3,4], "y": [1,2,3,4]})

window = 2
df["sum"] = rolling_apply_ext(box_sum, window , df.x.values, df.y.values)

輸出：

print(df.to_string(index=False))
 x  y  sum
 1  1  NaN
 2  2  6.0
 3  3 10.0
 4  4 14.0

筆記

滾動功能是時間序列友好的。 它默認總是向后看，所以 6 是數組中當前值和過去值的總和。
在上面的示例中，將 rolling_apply 導入為rolling_apply_ext ，因此它不可能干擾對 Pandas rolling_apply的任何現有調用（感謝rolling_apply的評論）。

作為旁注，我放棄了嘗試使用 Pandas。 它從根本上被破壞了：它處理單列聚合並應用幾乎沒有問題，但是當試圖讓它與更多兩列或更多列一起工作時，它是一個過於復雜的 rube-goldberg 機器。

Answer 5

這個怎么樣？

ggg = pd.DataFrame({"a":[1,2,3,4,5,6,7], "b":[7,6,5,4,3,2,1]})

def my_rolling_apply2(df, fun, window):
    prepend = [None] * (window - 1)
    end = len(df) - window
    mid = map(lambda start: fun(df[start:start + window]), np.arange(0,end))
    last =  fun(df[end:])
    return [*prepend, *mid, last]

my_rolling_apply2(ggg, lambda df: (df["a"].max(), df["b"].min()), 3)

結果是：

[None, None, (3, 5), (4, 4), (5, 3), (6, 2), (7, 1)]

Answer 6

要執行滾動 window 操作並訪問 dataframe 的所有列，您可以將 mehtod mehtod='table'傳遞給rolling() 。 例子：

import pandas as pd
import numpy as np
from numba import jit

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], 'b': [1, 3, 5, 7, 9, 11]})

@jit
def f(w):
    # we have access to both columns of the dataframe here
    return np.max(w), np.min(w)

df.rolling(3, method='table').apply(f, raw=True, engine='numba')

需要注意的是method='table'需要 numba 引擎（ pip install numba ）。 示例中的@jit部分不是強制性的，但有助於提高性能。 上述示例代碼的結果將是：

一個	b
鈉	鈉
鈉	鈉
5.0	1.0
7.0	2.0
9.0	3.0
11.0	4.0

Pandas 使用多列滾動應用

問題描述

6 個解決方案

解決方案1
19 已采納 2020-03-29 17:27:27

解決方案2
15 2020-03-18 16:11:32

解決方案3
1 2020-03-24 16:11:49

解決方案4
1 2021-04-24 16:30:49

解決方案5
0 2022-06-30 09:21:04

解決方案6
0 2022-08-21 17:03:09

Pandas 使用多列滾動應用

問題描述

6 個解決方案

解決方案1 19 已采納 2020-03-29 17:27:27

解決方案2 15 2020-03-18 16:11:32

解決方案3 1 2020-03-24 16:11:49

解決方案4 1 2021-04-24 16:30:49

解決方案5 0 2022-06-30 09:21:04

解決方案6 0 2022-08-21 17:03:09

解決方案1
19 已采納 2020-03-29 17:27:27

解決方案2
15 2020-03-18 16:11:32

解決方案3
1 2020-03-24 16:11:49

解決方案4
1 2021-04-24 16:30:49

解決方案5
0 2022-06-30 09:21:04

解決方案6
0 2022-08-21 17:03:09