Pandas 有效地應用依賴於索引值的函數

Question

我有一個帶有多重索引d, y, b, r, a的 Pandas 數據幀，我需要應用一個函數，該函數根據索引值在數據幀的元素之間進行減法。

為簡化起見，我將只考慮三個指數， d, y, r

索引 d 可以采用兩個值value0和value1 。 當d == value0索引 r 只能是"O" 。 相反，當d == value1 r 是 0 到 999 之間的整數時。其余的索引在value0和value1之間是通用的。

數據框可以構造為

import numpy as np
import pandas as pd
d_index = [*["value0", "value0"], *["value1" for _ in range(2000)]]

y_index = [*[0, 1], *[0 for _ in range(1000)], *[1 for _ in range(1000)]]

r_index = [*["O", "O"], *[i for i in range(1000)], *[i for i in range(1000)]]

rng = np.random.default_rng(12345)

results00 = rng.uniform(0, 2, 1000).tolist()
results01 = rng.uniform(0, 2, 1000).tolist()

results20 = rng.uniform(0, 6, 1000).tolist()
results21 = rng.uniform(0, 6, 1000).tolist()


variable0 = [1, 2, *results00, *results01]

variable2 = [
    2.5,
    3.5,
    *results20,
    *results21,
]

df = pd.DataFrame(
    {
        "d": d_index,
        "y": y_index,
        "r": r_index,
        "string0": variable0,
        "string2": variable2,
    }
)

df.set_index(["d", "y", "r"], inplace=True)

我需要計算某些列之間的差異，以便除value0和value1之外的所有索引都相同。 結果可以通過以下方式獲得：

df.loc[("value1", 0), "dstring0"] = (df.loc[("value0", 0), "string0"]).to_numpy() - (
    df.loc[("value1", 0), "string0"]
).to_numpy()

df.loc[("value1", 1), "dstring0"] = (df.loc[("value0", 1), "string0"]).to_numpy() - (
    df.loc[("value1", 1), "string0"]
).to_numpy()

df.loc[("value1", 0), "dstring2"] = (df.loc[("value0", 0), "string2"]).to_numpy() - (
    df.loc[("value1", 0), "string2"]
).to_numpy()

df.loc[("value1", 1), "dstring2"] = (df.loc[("value0", 1), "string2"]).to_numpy() - (
    df.loc[("value1", 1), "string2"]
).to_numpy()

我可以通過循環y, b, a索引並執行上面的減法來處理這種轉換，但是鑒於大量觀察（大約 800 萬），它不會有效

如何有效地處理操作？

編輯：添加了一個示例數據框和預期的輸出。 我也意識到以前的功能不起作用。 我把它保留在下面作為參考

錯誤的功能

def deviation(df_slice, df, variables):
    d, y, b, r, a = df_slice.name
    dvars = ["d" + var for var in variables]
    if d == "value1":
        df.loc[(d, y, b, r, a), dvars] = (
            df.loc[("value0", y, b, "O", a), variables].to_numpy()
            - df_slice[variables].to_numpy()
        )

df.apply(deviation, axis=1, variables=['string0','string2'],df=df)

Answer 1

簡單的答案是更快。 但是，它只會稍微提高您的性能（2 倍）：

import swifter

df.swifter.apply(deviation, axis=1, variables=['string0','string2'],df=df)

看看這篇文章從towardsdatascience.com約大熊貓更好量化的概念。

Pandas 有效地應用依賴於索引值的函數

問題描述

1 個解決方案

解決方案1
0 2021-11-04 15:21:30

Pandas 有效地應用依賴於索引值的函數

問題描述

1 個解決方案

解決方案1 0 2021-11-04 15:21:30

解決方案1
0 2021-11-04 15:21:30