简体   繁体   English

Pandas 有效地应用依赖于索引值的函数

[英]Pandas efficiently applying a function that depends on index value

I have a pandas dataframe with a multindex d, y, b, r, a and I need to apply a function that makes a subtraction among the elements of the dataframe depending on the index value.我有一个带有多重索引d, y, b, r, a的 Pandas 数据帧,我需要应用一个函数,该函数根据索引值在数据帧的元素之间进行减法。

To simplify things I will consider only three indices, d, y, r为简化起见,我将只考虑三个指数, d, y, r

Index d can take two values value0 and value1 .索引 d 可以采用两个值value0value1 When d == value0 the index r can only be "O" .d == value0索引 r 只能是"O" Instead when d == value1 r is an integer between 0 and 999. The rest of the indices are common between value0 and value1 .相反,当d == value1 r 是 0 到 999 之间的整数时。其余的索引在value0value1之间是通用的。

The dataframe can be constructed as数据框可以构造为

import numpy as np
import pandas as pd
d_index = [*["value0", "value0"], *["value1" for _ in range(2000)]]

y_index = [*[0, 1], *[0 for _ in range(1000)], *[1 for _ in range(1000)]]

r_index = [*["O", "O"], *[i for i in range(1000)], *[i for i in range(1000)]]

rng = np.random.default_rng(12345)

results00 = rng.uniform(0, 2, 1000).tolist()
results01 = rng.uniform(0, 2, 1000).tolist()

results20 = rng.uniform(0, 6, 1000).tolist()
results21 = rng.uniform(0, 6, 1000).tolist()


variable0 = [1, 2, *results00, *results01]

variable2 = [
    2.5,
    3.5,
    *results20,
    *results21,
]

df = pd.DataFrame(
    {
        "d": d_index,
        "y": y_index,
        "r": r_index,
        "string0": variable0,
        "string2": variable2,
    }
)

df.set_index(["d", "y", "r"], inplace=True)

I need to compute the difference between some columns, such that all indices are the same except for value0 and value1 .我需要计算某些列之间的差异,以便除value0value1之外的所有索引都相同。 The results can be obtained through:结果可以通过以下方式获得:

df.loc[("value1", 0), "dstring0"] = (df.loc[("value0", 0), "string0"]).to_numpy() - (
    df.loc[("value1", 0), "string0"]
).to_numpy()

df.loc[("value1", 1), "dstring0"] = (df.loc[("value0", 1), "string0"]).to_numpy() - (
    df.loc[("value1", 1), "string0"]
).to_numpy()

df.loc[("value1", 0), "dstring2"] = (df.loc[("value0", 0), "string2"]).to_numpy() - (
    df.loc[("value1", 0), "string2"]
).to_numpy()

df.loc[("value1", 1), "dstring2"] = (df.loc[("value0", 1), "string2"]).to_numpy() - (
    df.loc[("value1", 1), "string2"]
).to_numpy()

I can deal with this transformation by looping over the y, b, a indices and by performing the subtractions above, however it would not be efficient given the large number of observations (around 8 million)我可以通过循环y, b, a索引并执行上面的减法来处理这种转换,但是鉴于大量观察(大约 800 万),它不会有效

How can I deal with the operations efficiently?如何有效地处理操作?

Edit: added a sample dataframe and the expected output.编辑:添加了一个示例数据框和预期的输出。 I also realised that the previous function did not work.我也意识到以前的功能不起作用。 I kept it below as a reference我把它保留在下面作为参考

Wrong function错误的功能

def deviation(df_slice, df, variables):
    d, y, b, r, a = df_slice.name
    dvars = ["d" + var for var in variables]
    if d == "value1":
        df.loc[(d, y, b, r, a), dvars] = (
            df.loc[("value0", y, b, "O", a), variables].to_numpy()
            - df_slice[variables].to_numpy()
        )

df.apply(deviation, axis=1, variables=['string0','string2'],df=df)

The easy answer is swifter.简单的答案是更快。 However, it will only improve your performance by a little bit (2 times):但是,它只会稍微提高您的性能(2 倍):

import swifter

df.swifter.apply(deviation, axis=1, variables=['string0','string2'],df=df)

Check out this article from towardsdatascience.com about better pandas vectorization concepts.看看这篇文章从towardsdatascience.com约大熊猫更好量化的概念。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM