Pandas 如何對依賴於先前行的計算進行矢量化

Question

我是 pandas 的新手，並試圖將指標從 pine 腳本遷移到 python。 我有一個計算依賴於動態計算的前一行值來獲取當前行的值。 我只能使用 for 循環來執行此操作，並且還沒有找到使用 numpy 或 dataframe.apply 執行此操作的好方法。 問題是這個計算運行得非常慢，太慢了，無法用於我的目的。 僅 21951 行 14 秒。

有誰知道如何在 pandas 中以更有效的方式做到這一點？ 當我構建其他指標時，弄清楚這一點肯定會對我有所幫助，因為大多數指標都依賴於先前的行值。

dataframe 看起來像：


"""
//
// @author LazyBear 
// List of all my indicators: 
// https://docs.google.com/document/d/15AGCufJZ8CIUvwFJ9W-IKns88gkWOKBCvByMEvm5MLo/edit?usp=sharing
// 
study(title="Coral Trend Indicator [LazyBear]", shorttitle="CTI_LB", overlay=true)
src=close
sm =input(21, title="Smoothing Period")
cd = input(0.4, title="Constant D")
ebc=input(false, title="Color Bars")
ribm=input(false, title="Ribbon Mode")
"""

# @jit(nopython=True) -- Tried this but was getting an error ==> argument 0: Cannot determine Numba type of <class 'pandas.core.frame.DataFrame'>
def coral_trend_filter(df, sm = 21, cd = 0.4):
  new_df = df.copy()

  di = (sm - 1.0) / 2.0 + 1.0
  c1 = 2 / (di + 1.0)
  c2 = 1 - c1
  c3 = 3.0 * (cd * cd + cd * cd * cd)
  c4 = -3.0 * (2.0 * cd * cd + cd + cd * cd * cd)
  c5 = 3.0 * cd + 1.0 + cd * cd * cd + 3.0 * cd * cd

  new_df['i1'] = 0
  new_df['i2'] = 0
  new_df['i3'] = 0
  new_df['i4'] = 0
  new_df['i5'] = 0
  new_df['i6'] = 0

  for i in range(1, len(new_df)):
    new_df.loc[i, 'i1'] = c1*new_df.loc[i, 'close'] + c2*new_df.loc[i - 1, 'i1']
    new_df.loc[i, 'i2'] = c1*new_df.loc[i, 'i1'] + c2*new_df.loc[i - 1, 'i2']
    new_df.loc[i, 'i3'] = c1*new_df.loc[i, 'i2'] + c2*new_df.loc[i - 1, 'i3']
    new_df.loc[i, 'i4'] = c1*new_df.loc[i, 'i3'] + c2*new_df.loc[i - 1, 'i4']
    new_df.loc[i, 'i5'] = c1*new_df.loc[i, 'i4'] + c2*new_df.loc[i - 1, 'i5']
    new_df.loc[i, 'i6'] = c1*new_df.loc[i, 'i5'] + c2*new_df.loc[i - 1, 'i6']

  new_df['cif'] = -cd*cd*cd*new_df['i6'] + c3*new_df['i5'] + c4*new_df['i4'] + c5*new_df['i3']
  new_df.dropna(inplace=True)
  
  # trend direction
  new_df['cifd'] = 0

  # trend direction color
  new_df['cifd'] = 'blue'
  
  new_df['cifd'] = np.where(new_df['cif'] < new_df['cif'].shift(-1), 1, -1)
  new_df['cifc'] = np.where(new_df['cifd'] == 1, 'green', 'red')


  new_df.drop(columns=['i1', 'i2', 'i3', 'i4', 'i5', 'i6'], inplace=True)

  return new_df

df = coral_trend_filter(data_frame)

評論回復：一個建議是使用 shift。 由於在每次迭代中都會更新每行計算，因此這不起作用。 移位存儲初始值並且不更新移位的列，因此計算值是錯誤的。 請參閱此屏幕截圖，該屏幕截圖與 cif 列中的原始屏幕不匹配。 另請注意，我留在 shift_i1 以顯示列保持為 0，這對於計算是不正確的。

更新：通過更改為使用.at而不是.loc我獲得了明顯更好的性能。 我的問題可能是我在這種類型的處理中使用了錯誤的訪問器。

Answer 1

編輯：由於問題的連續性，看起來這種方法不起作用。 留給后人。

像使用for循環一樣遍歷dataframe從來都不是一件好事。 Pandas最終只是Numpy的包裝器，因此最好弄清楚如何進行向量化數組操作。 基本上總有辦法。

對於您的情況，我會考慮使用pd.DataFrame.shift在同一行中獲取您的i - 1值，然后將apply （或不使用 - 可能實際上不是）與該新值一起使用。

像這樣的東西（對於你的前幾點）：

new_df["shifted_i1"] = new_df["i1"].shift(periods=1)
new_df["i1"] = c1 * new_df["close"] + c2 * new_df["shifted_i1"]

new_df["shifted_i2"] = new_df["i2"].shift(periods=1)
new_df["i2"] = c1 * new_df["i1"] + c2 * new_df["shifted_i2"])

new_df["shifted_i3"] = new_df["i3"].shift(periods=1)
new_df["i3"] = c1 * new_df["i2"] + c2 * new_df["shifted_i3"])

...

完成此操作后，您可以從 dataframe 中刪除移位的列： new_df.drop(columns=["shifted_i1", "shifted_i2", "shifted_i3"], inplace=True)

Answer 2

看起來矢量化僅在可以根據@hpaulj 的評論拆分和並行處理計算時才有用。 我通過轉換為數組並對數組執行循環解決了速度問題，然后將結果保存回 DataFrame。 這是代碼，希望它可以幫助其他人

"""
//
// @author LazyBear 
// List of all my indicators: 
// https://docs.google.com/document/d/15AGCufJZ8CIUvwFJ9W-IKns88gkWOKBCvByMEvm5MLo/edit?usp=sharing
// 
study(title="Coral Trend Indicator [LazyBear]", shorttitle="CTI_LB", overlay=true)
src=close
sm =input(21, title="Smoothing Period")
cd = input(0.4, title="Constant D")
ebc=input(false, title="Color Bars")
ribm=input(false, title="Ribbon Mode")
"""
def coral_trend_filter(df, sm = 25, cd = 0.4):
  new_df = df.copy()

  di = (sm - 1.0) / 2.0 + 1.0
  c1 = 2 / (di + 1.0)
  c2 = 1 - c1
  c3 = 3.0 * (cd * cd + cd * cd * cd)
  c4 = -3.0 * (2.0 * cd * cd + cd + cd * cd * cd)
  c5 = 3.0 * cd + 1.0 + cd * cd * cd + 3.0 * cd * cd

  new_df['i1'] = 0
  new_df['i2'] = 0
  new_df['i3'] = 0
  new_df['i4'] = 0
  new_df['i5'] = 0
  new_df['i6'] = 0

  close = new_df['close'].to_numpy()
  i1 = new_df['i1'].to_numpy()
  i2 = new_df['i2'].to_numpy()
  i3 = new_df['i3'].to_numpy()
  i4 = new_df['i4'].to_numpy()
  i5 = new_df['i5'].to_numpy()
  i6 = new_df['i6'].to_numpy()

  for i in range(1, len(close)):
    i1[i] = c1*close[i] + c2*i1[i-1]
    i2[i] = c1*i1[i] + c2*i2[i-1]
    i3[i] = c1*i2[i] + c2*i3[i-1]
    i4[i] = c1*i3[i] + c2*i4[i-1]
    i5[i] = c1*i4[i] + c2*i5[i-1]
    i6[i] = c1*i5[i] + c2*i6[i-1]

  new_df['i1'] = i1
  new_df['i2'] = i2
  new_df['i3'] = i3
  new_df['i4'] = i4
  new_df['i5'] = i5
  new_df['i6'] = i6

  new_df['cif'] = -cd*cd*cd*new_df['i6'] + c3*new_df['i5'] + c4*new_df['i4'] + c5*new_df['i3']
  new_df.dropna(inplace=True)
  
  new_df['cifd'] = 0
  new_df['cifd'] = np.where(new_df['cif'] < new_df['cif'].shift(), 1, -1)
  new_df['cifc'] = np.where(new_df['cifd'] == 1, 'green', 'red')

  new_df.drop(columns=['i1', 'i2', 'i3', 'i4', 'i5', 'i6'], inplace=True)

  return new_df

Answer 3

您可以嘗試使用以下內容替換 dataframe 行上的迭代：

import pandas as pd
import numpy as np

# sample dataframe
rng = np.random.default_rng(0)
new_df = pd.DataFrame({'close': rng.integers(1, 10, 10)})
new_df['i1'] = 0
new_df['i2'] = 0

c1 = 3
c2 = 2

N = len(new_df)
exps = c2**np.r_[:N - 1]
f = lambda x: c1 * np.convolve(new_df.loc[1:, x], exps, mode='full')[:N - 1]
new_df.loc[1:, 'i1'] = f('close')
new_df.loc[1:, 'i2'] = f('i1')

您可以通過使用新列名重復最后一行來計算列'i3' 、 'i4'等的值。

Pandas 如何對依賴於先前行的計算進行矢量化

問題描述

3 個解決方案

解決方案1
0 2022-09-26 00:49:15

解決方案2
0 2022-09-26 21:08:17

解決方案3
0 2022-09-27 01:49:21

Pandas 如何對依賴於先前行的計算進行矢量化

問題描述

3 個解決方案

解決方案1 0 2022-09-26 00:49:15

解決方案2 0 2022-09-26 21:08:17

解決方案3 0 2022-09-27 01:49:21

解決方案1
0 2022-09-26 00:49:15

解決方案2
0 2022-09-26 21:08:17

解決方案3
0 2022-09-27 01:49:21