[英]Pandas apply, rolling, groupby with multiple input & multiple output columns
I've been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby , and especially multiple input columns and multiple output columns. I've been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby , and especially multiple input columns and multiple output columns. I found a large amount of questions on SO about this topic and many old & outdated answers.我在 SO 上发现了大量关于这个主题的问题以及许多旧的和过时的答案。 So I started to create a notebook for every possible combination of x inputs & outputs, rolling, rolling & groupby combined and I focused on performance as well.因此,我开始为 x 输入和输出、滚动、滚动和 groupby 组合的每种可能组合创建一个笔记本,并且我也专注于性能。 Since I'm not the only one struggling with these questions I thought I'd provide my solutions here with working examples, hoping it helps any existing/future pandas-users.由于我不是唯一一个在这些问题上苦苦挣扎的人,我想我会在这里提供我的解决方案和工作示例,希望它可以帮助任何现有/未来的熊猫用户。
Let's create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples.让我们首先创建一个 dataframe,它将在下面的所有示例中使用,包括 groupby 示例的组列。 For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.对于滚动 window 和多个输入/输出列,我在下面的所有代码示例中只使用 2,但显然这可以是任何大于 1 的数字。
df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]
It will look like this:它看起来像这样:
group a b
0 0 2 2
1 0 4 1
2 0 0 4
3 1 0 2
4 1 3 2
5 1 3 0
Basic基本的
def func_i1_o1(x):
return x+1
df['c'] = df['b'].apply(func_i1_o1)
Rolling滚动
def func_i1_o1_rolling(x):
return (x[0] + x[1])
df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)
Roling & Groupby滚动和分组
Add the reset_index solution (see notes above) to the rolling function.将 reset_index 解决方案(见上文注释)添加到滚动 function。
df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)
Basic基本的
def func_i2_o1(x):
return np.sum(x)
df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)
Rolling滚动
As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs.正如上面注释中的第 2 点所解释的,没有 2 个输入的“正常”解决方案。 The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values.下面的解决方法使用 'raw=False' 来确保输入是 pd.Series,这意味着我们还可以获取值旁边的索引。 This enables us to get values from other columns at the correct indexes to be used.这使我们能够从要使用的正确索引处的其他列中获取值。
def func_i2_o1_rolling(x):
values_b = x
values_c = df.loc[x.index, 'c'].to_numpy()
return np.sum(values_b) + np.sum(values_c)
df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)
Rolling & Groupby滚动和分组
Add the reset_index solution (see notes above) to the rolling function.将 reset_index 解决方案(见上文注释)添加到滚动 function。
df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)
Basic基本的
You could use a 'normal' solution by returning pd.Series:您可以通过返回 pd.Series 来使用“正常”解决方案:
def func_i1_o2(x):
return pd.Series((x+1, x+2))
df[['i', 'j']] = df['b'].apply(func_i1_o2)
Or you could use the zip/tuple combination which is about 8 times faster!或者你可以使用快 8 倍的 zip/tuple 组合!
def func_i1_o2_fast(x):
return x+1, x+2
df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))
Rolling滚动
As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined.正如上面注释中的第 1 点所解释的,如果我们想在结合使用滚动和应用时返回超过 1 个值,我们需要一种解决方法。 I found 2 working solutions.我找到了 2 个可行的解决方案。
1 1
def func_i1_o2_rolling_solution1(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['m', 'n']] = output_1, output_2
return 0
df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)
Pros : Everything is done within 1 function.优点:一切都在 1 function 内完成。
Cons : You have to create the columns first and it is slower since it doesn't use the raw input.缺点:您必须先创建列,而且速度较慢,因为它不使用原始输入。
2 2
rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()
def func_i1_o2_rolling_solution2(x):
output_list_1.append(np.max(x))
output_list_2.append(np.min(x))
return 0
df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2
Pros : It uses the raw input which makes it about twice as fast.优点:它使用原始输入,使其速度提高了两倍。 And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).而且由于它不使用索引来设置 output 值,因此代码看起来更清晰(至少对我而言)。
Cons : You have to create the nan-prefix yourself and it takes a bit more lines of code.缺点:您必须自己创建 nan 前缀,并且需要更多的代码行。
Rolling & Groupby滚动和分组
Normally, I would use the faster 2nd solution above.通常,我会使用上面更快的第二种解决方案。 However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset.但是,由于我们正在组合组并滚动这意味着您必须在数据集中间某处的正确索引处手动设置 NaN/零(取决于组的数量)。 To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically.在我看来,当结合滚动、groupby 和多个 output 列时,第一个解决方案更容易并自动解决自动 NaN/分组。 Once again, I use the reset_index solution at the end.最后,我再次使用了 reset_index 解决方案。
def func_i1_o2_rolling_groupby(x):
output_1 = np.max(x)
output_2 = np.min(x)
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['q', 'r']] = output_1, output_2
return 0
df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)
Basic基本的
I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.我建议使用与 i1_o2 相同的“快速”方式,唯一的区别是您可以使用 2 个输入值。
def func_i2_o2(x):
return np.mean(x), np.median(x)
df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))
Rolling滚动
As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs , you can guess I need to combine them for this one.当我使用一种解决方法来应用多个输入的滚动时,我使用另一种解决方法来滚动多个输出,您可以猜想我需要将它们组合起来。
1. Get values from other columns using indexes (see func_i2_o1_rolling) 1. 使用索引从其他列中获取值(参见 func_i2_o1_rolling)
2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1) 2. 在正确的索引上设置最终的多个输出(参见 func_i1_o2_rolling_solution1)
def func_i2_o2_rolling(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['u', 'v']] = output_1, output_2
return 0
df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)
Rolling & Groupby滚动和分组
Add the reset_index solution (see notes above) to the rolling function.将 reset_index 解决方案(见上文注释)添加到滚动 function。
def func_i2_o2_rolling_groupby(x):
values_b = x.to_numpy()
values_c = df.loc[x.index, 'c'].to_numpy()
output_1 = np.min([np.sum(values_b), np.sum(values_c)])
output_2 = np.max([np.sum(values_b), np.sum(values_c)])
# Last index is where to place the final values: x.index[-1]
df.at[x.index[-1], ['w', 'x']] = output_1, output_2
return 0
df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.