Pandas 應用、滾動、groupby 多輸入和多 output 列

Question

I've been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby , and especially multiple input columns and multiple output columns. 我在 SO 上發現了大量關於這個主題的問題以及許多舊的和過時的答案。 因此，我開始為 x 輸入和輸出、滾動、滾動和 groupby 組合的每種可能組合創建一個筆記本，並且我也專注於性能。 由於我不是唯一一個在這些問題上苦苦掙扎的人，我想我會在這里提供我的解決方案和工作示例，希望它可以幫助任何現有/未來的熊貓用戶。

Answer 1

重要筆記

pandas 中應用和滾動的組合具有非常強的 output 要求。 您必須返回一個值。 你不能返回一個 pd.Series，不是一個列表，不是一個數組，不是一個數組中的一個數組，而是一個值，例如一個 integer。 當嘗試為多個列返回多個輸出時，此要求很難獲得有效的解決方案。 我不明白為什么它對“應用和滾動”有這個要求，因為不滾動“應用”就沒有這個要求。 一定是由於某些內部 pandas 功能所致。
“應用和滾動”與多個輸入列的組合根本不起作用，想象一下具有 2 列的 dataframe。 6 行，並且您想應用自定義 function 滾動 window 為 2。您的 function 應該獲得一個具有 2x2 行值的每個輸入數組的值 - 2 但似乎 pandas 無法同時處理滾動和多個輸入列。 我嘗試使用軸參數使其工作，但是：
- Axis = 0，每列將調用您的 function。 在上述 dataframe 中，它將調用您的 function 10 次（不是 12 次，因為滾動 = 2）並且由於它是每列，它只提供該列的 2 個滾動值……
- Axis = 1，每行將調用您的 function。 這可能是您想要的，但 pandas 不會提供 2x2 輸入。 它實際上完全忽略了滾動，只提供了一行 2 列的值......
當對多個輸入列使用“應用”時，您可以提供一個名為 raw（布爾值）的參數。 默認情況下為 False，這意味着輸入將是 pd.Series，因此在值旁邊包含索引。 如果您不需要索引，您可以將 raw 設置為 True 以獲得 Numpy 數組，這通常可以實現更好的性能。
當組合'rolling & groupby'時，它返回一個多索引系列，不能輕易地作為新列的輸入。 最簡單的解決方案是 append 一個 reset_index(drop=True) 作為回答和評論這里（ Python - GroupBy object 的滾動功能）。
您可能會問我，您什么時候想要使用具有多個輸出的滾動、groupby 自定義 function？：答案。 我最近不得不做傅里葉變換，滑動 windows（滾動）在一個包含 500 萬條記錄（速度/性能很重要）的數據集上，數據集中有不同的批次（groupby）。 而且我需要將傅里葉變換的功率和相位保存在不同的列（多個輸出）中，大多數人可能只需要下面的一些基本示例。 但我相信，尤其是在機器學習/數據科學領域，更復雜的示例可能會很有用。
如果您有更好、更清晰或更快的方法來執行以下任何解決方案，請告訴我。 我會更新我的答案，我們都可以受益！

代碼示例

讓我們首先創建一個 dataframe，它將在下面的所有示例中使用，包括 groupby 示例的組列。 對於滾動 window 和多個輸入/輸出列，我在下面的所有代碼示例中只使用 2，但顯然這可以是任何大於 1 的數字。

df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]

它看起來像這樣：

group   a   b
0   0   2   2
1   0   4   1
2   0   0   4
3   1   0   2
4   1   3   2
5   1   3   0

輸入1列，output 1列

基本的

def func_i1_o1(x):    
    return x+1

df['c'] = df['b'].apply(func_i1_o1)

滾動

def func_i1_o1_rolling(x):
    return (x[0] + x[1])

df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)

滾動和分組

將 reset_index 解決方案（見上文注釋）添加到滾動 function。

df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)

輸入2列，output 1列

基本的

def func_i2_o1(x):
    return np.sum(x)

df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)

滾動

正如上面注釋中的第 2 點所解釋的，沒有 2 個輸入的“正常”解決方案。 下面的解決方法使用 'raw=False' 來確保輸入是 pd.Series，這意味着我們還可以獲取值旁邊的索引。 這使我們能夠從要使用的正確索引處的其他列中獲取值。

def func_i2_o1_rolling(x):
    values_b = x
    values_c = df.loc[x.index, 'c'].to_numpy()
    return np.sum(values_b) + np.sum(values_c)

df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)

滾動和分組

將 reset_index 解決方案（見上文注釋）添加到滾動 function。

df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)

輸入1列，output 2列

基本的

您可以通過返回 pd.Series 來使用“正常”解決方案：

def func_i1_o2(x):
    return pd.Series((x+1, x+2))

df[['i', 'j']] = df['b'].apply(func_i1_o2)

或者你可以使用快 8 倍的 zip/tuple 組合！

def func_i1_o2_fast(x):
    return x+1, x+2

df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))

滾動

正如上面注釋中的第 1 點所解釋的，如果我們想在結合使用滾動和應用時返回超過 1 個值，我們需要一種解決方法。 我找到了 2 個可行的解決方案。

1

def func_i1_o2_rolling_solution1(x):
    output_1 = np.max(x)
    output_2 = np.min(x)
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['m', 'n']] = output_1, output_2
    return 0

df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)

優點：一切都在 1 function 內完成。
缺點：您必須先創建列，而且速度較慢，因為它不使用原始輸入。

2

rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()

def func_i1_o2_rolling_solution2(x):
    output_list_1.append(np.max(x))
    output_list_2.append(np.min(x))
    return 0

df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2

優點：它使用原始輸入，使其速度提高了兩倍。 而且由於它不使用索引來設置 output 值，因此代碼看起來更清晰（至少對我而言）。
缺點：您必須自己創建 nan 前綴，並且需要更多的代碼行。

滾動和分組

通常，我會使用上面更快的第二種解決方案。 但是，由於我們正在組合組並滾動這意味着您必須在數據集中間某處的正確索引處手動設置 NaN/零（取決於組的數量）。 在我看來，當結合滾動、groupby 和多個 output 列時，第一個解決方案更容易並自動解決自動 NaN/分組。 最后，我再次使用了 reset_index 解決方案。

def func_i1_o2_rolling_groupby(x):
    output_1 = np.max(x)
    output_2 = np.min(x)
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['q', 'r']] = output_1, output_2
    return 0

df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)

輸入2列，output 2列

基本的

我建議使用與 i1_o2 相同的“快速”方式，唯一的區別是您可以使用 2 個輸入值。

def func_i2_o2(x):
    return np.mean(x), np.median(x)

df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))

滾動

當我使用一種解決方法來應用多個輸入的滾動時，我使用另一種解決方法來滾動多個輸出，您可以猜想我需要將它們組合起來。
1. 使用索引從其他列中獲取值（參見 func_i2_o1_rolling）
2. 在正確的索引上設置最終的多個輸出（參見 func_i1_o2_rolling_solution1）

def func_i2_o2_rolling(x):
    values_b = x.to_numpy()
    values_c = df.loc[x.index, 'c'].to_numpy()
    output_1 = np.min([np.sum(values_b), np.sum(values_c)])
    output_2 = np.max([np.sum(values_b), np.sum(values_c)])    
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['u', 'v']] = output_1, output_2
    return 0

df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)

滾動和分組

將 reset_index 解決方案（見上文注釋）添加到滾動 function。

def func_i2_o2_rolling_groupby(x):
    values_b = x.to_numpy()
    values_c = df.loc[x.index, 'c'].to_numpy()
    output_1 = np.min([np.sum(values_b), np.sum(values_c)])
    output_2 = np.max([np.sum(values_b), np.sum(values_c)])    
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['w', 'x']] = output_1, output_2
    return 0

df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)

Pandas 應用、滾動、groupby 多輸入和多 output 列

問題描述

1 個解決方案

解決方案1
5 已采納 2020-05-05 15:47:10

重要筆記

代碼示例

輸入1列，output 1列

輸入2列，output 1列

輸入1列，output 2列

輸入2列，output 2列

Pandas 應用、滾動、groupby 多輸入和多 output 列

問題描述

1 個解決方案

解決方案1 5 已采納 2020-05-05 15:47:10

重要筆記

代碼示例

輸入1列，output 1列

輸入2列，output 1列

輸入1列，output 2列

輸入2列，output 2列

解決方案1
5 已采納 2020-05-05 15:47:10