[英]Pandas efficiently applying a function that depends on index value
I have a pandas dataframe with a multindex d, y, b, r, a
and I need to apply a function that makes a subtraction among the elements of the dataframe depending on the index value.我有一个带有多重索引
d, y, b, r, a
的 Pandas 数据帧,我需要应用一个函数,该函数根据索引值在数据帧的元素之间进行减法。
To simplify things I will consider only three indices, d, y, r
为简化起见,我将只考虑三个指数,
d, y, r
Index d can take two values value0
and value1
.索引 d 可以采用两个值
value0
和value1
。 When d == value0
the index r can only be "O"
.当
d == value0
索引 r 只能是"O"
。 Instead when d == value1
r is an integer between 0 and 999. The rest of the indices are common between value0
and value1
.相反,当
d == value1
r 是 0 到 999 之间的整数时。其余的索引在value0
和value1
之间是通用的。
The dataframe can be constructed as数据框可以构造为
import numpy as np
import pandas as pd
d_index = [*["value0", "value0"], *["value1" for _ in range(2000)]]
y_index = [*[0, 1], *[0 for _ in range(1000)], *[1 for _ in range(1000)]]
r_index = [*["O", "O"], *[i for i in range(1000)], *[i for i in range(1000)]]
rng = np.random.default_rng(12345)
results00 = rng.uniform(0, 2, 1000).tolist()
results01 = rng.uniform(0, 2, 1000).tolist()
results20 = rng.uniform(0, 6, 1000).tolist()
results21 = rng.uniform(0, 6, 1000).tolist()
variable0 = [1, 2, *results00, *results01]
variable2 = [
2.5,
3.5,
*results20,
*results21,
]
df = pd.DataFrame(
{
"d": d_index,
"y": y_index,
"r": r_index,
"string0": variable0,
"string2": variable2,
}
)
df.set_index(["d", "y", "r"], inplace=True)
I need to compute the difference between some columns, such that all indices are the same except for value0
and value1
.我需要计算某些列之间的差异,以便除
value0
和value1
之外的所有索引都相同。 The results can be obtained through:结果可以通过以下方式获得:
df.loc[("value1", 0), "dstring0"] = (df.loc[("value0", 0), "string0"]).to_numpy() - (
df.loc[("value1", 0), "string0"]
).to_numpy()
df.loc[("value1", 1), "dstring0"] = (df.loc[("value0", 1), "string0"]).to_numpy() - (
df.loc[("value1", 1), "string0"]
).to_numpy()
df.loc[("value1", 0), "dstring2"] = (df.loc[("value0", 0), "string2"]).to_numpy() - (
df.loc[("value1", 0), "string2"]
).to_numpy()
df.loc[("value1", 1), "dstring2"] = (df.loc[("value0", 1), "string2"]).to_numpy() - (
df.loc[("value1", 1), "string2"]
).to_numpy()
I can deal with this transformation by looping over the y, b, a
indices and by performing the subtractions above, however it would not be efficient given the large number of observations (around 8 million)我可以通过循环
y, b, a
索引并执行上面的减法来处理这种转换,但是鉴于大量观察(大约 800 万),它不会有效
How can I deal with the operations efficiently?如何有效地处理操作?
Edit: added a sample dataframe and the expected output.编辑:添加了一个示例数据框和预期的输出。 I also realised that the previous function did not work.
我也意识到以前的功能不起作用。 I kept it below as a reference
我把它保留在下面作为参考
Wrong function错误的功能
def deviation(df_slice, df, variables):
d, y, b, r, a = df_slice.name
dvars = ["d" + var for var in variables]
if d == "value1":
df.loc[(d, y, b, r, a), dvars] = (
df.loc[("value0", y, b, "O", a), variables].to_numpy()
- df_slice[variables].to_numpy()
)
df.apply(deviation, axis=1, variables=['string0','string2'],df=df)
The easy answer is swifter.简单的答案是更快。 However, it will only improve your performance by a little bit (2 times):
但是,它只会稍微提高您的性能(2 倍):
import swifter
df.swifter.apply(deviation, axis=1, variables=['string0','string2'],df=df)
Check out this article from towardsdatascience.com about better pandas vectorization concepts.看看这篇文章从towardsdatascience.com约大熊猫更好量化的概念。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.