简体   繁体   English

Pandas 应用、滚动、groupby 多输入和多 output 列

[英]Pandas apply, rolling, groupby with multiple input & multiple output columns

I've been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby , and especially multiple input columns and multiple output columns. I've been struggling the past week trying to use apply to use functions over an entire pandas dataframe, including rolling windows, groupby , and especially multiple input columns and multiple output columns. I found a large amount of questions on SO about this topic and many old & outdated answers.我在 SO 上发现了大量关于这个主题的问题以及许多旧的和过时的答案。 So I started to create a notebook for every possible combination of x inputs & outputs, rolling, rolling & groupby combined and I focused on performance as well.因此,我开始为 x 输入和输出、滚动、滚动和 groupby 组合的每种可能组合创建一个笔记本,并且我也专注于性能 Since I'm not the only one struggling with these questions I thought I'd provide my solutions here with working examples, hoping it helps any existing/future pandas-users.由于我不是唯一一个在这些问题上苦苦挣扎的人,我想我会在这里提供我的解决方案和工作示例,希望它可以帮助任何现有/未来的熊猫用户。

Important notes重要笔记

  1. The combination of apply & rolling in pandas has a very strong output requirement. pandas 中应用和滚动的组合具有非常强的 output 要求。 You have to return one single value .您必须返回一个值 You can not return a pd.Series, not a list, not an array, not secretly an array within an array, but just one value, eg one integer.你不能返回一个 pd.Series,不是一个列表,不是一个数组,不是一个数组中的一个数组,而是一个值,例如一个 integer。 This requirement makes it hard to get a working solution when trying to return multiple outputs for multiple columns.当尝试为多个列返回多个输出时,此要求很难获得有效的解决方案。 I don't understand why it has this requirement for 'apply & rolling', because without rolling 'apply' doesn't have this requirement.我不明白为什么它对“应用和滚动”有这个要求,因为不滚动“应用”就没有这个要求。 Must be due to some internal pandas functions.一定是由于某些内部 pandas 功能所致。
  2. The combination of 'apply & rolling' combined with multiple input columns simply does not work, Imagine a dataframe with 2 columns. “应用和滚动”与多个输入列的组合根本不起作用,想象一下具有 2 列的 dataframe。 6 rows and you want to apply a custom function with a rolling window of 2. Your function should get an input array with 2x2 values - 2 values of each column for 2 rows. 6 行,并且您想应用自定义 function 滚动 window 为 2。您的 function 应该获得一个具有 2x2 行值的每个输入数组的值 - 2 But it seems pandas can't handle rolling and multiple input columns at the same time.但似乎 pandas 无法同时处理滚动和多个输入列。 I tried to use the axis parameter to get it working but:我尝试使用参数使其工作,但是:
    • Axis = 0, will call your function per column. Axis = 0,每列将调用您的 function。 In the dataframe described above, it will call your function 10 times (not 12 because rolling=2) and since it's per column, it only provides the 2 rolling values of that column…在上述 dataframe 中,它将调用您的 function 10 次(不是 12 次,因为滚动 = 2)并且由于它是每列,它只提供该列的 2 个滚动值……
    • Axis = 1, will call your function per row. Axis = 1,每行将调用您的 function。 This is what you probably want, but pandas will not provide a 2x2 input.这可能是您想要的,但 pandas 不会提供 2x2 输入。 It actually completely ignores the rolling and only provides one row with values of 2 columns...它实际上完全忽略了滚动,只提供了一行 2 列的值......
  3. When using 'apply' with multiple input columns, you can provide a parameter called raw (boolean).当对多个输入列使用“应用”时,您可以提供一个名为 raw(布尔值)的参数。 It's False by default, which means the input will be a pd.Series and thus includes indexes next to the values.默认情况下为 False,这意味着输入将是 pd.Series,因此在值旁边包含索引。 If you don't need the indexes you can set raw to True to get a Numpy array, which often achieves a much better performance.如果您不需要索引,您可以将 raw 设置为 True 以获得 Numpy 数组,这通常可以实现更好的性能。
  4. When combining 'rolling & groupby', it returns a multi-indexes series which can't easily serve as an input for a new column.当组合'rolling & groupby'时,它返回一个多索引系列,不能轻易地作为新列的输入。 The easiest solution is to append a reset_index(drop=True) as answered & commented here ( Python - rolling functions for GroupBy object ).最简单的解决方案是 append 一个 reset_index(drop=True) 作为回答和评论这里( Python - GroupBy object 的滚动功能)。
  5. You might ask me, when would you ever want to use a rolling, groupby custom function with multiple outputs?: Answer.您可能会问我,您什么时候想要使用具有多个输出的滚动、groupby 自定义 function?:答案。 I recently had to do a Fourier transform with sliding windows (rolling) over a dataset of 5 million records (speed/performance is important) with different batches within the dataset (groupby).我最近不得不做傅里叶变换,滑动 windows(滚动)在一个包含 500 万条记录(速度/性能很重要)的数据集上,数据集中有不同的批次(groupby)。 And I needed to save both the power & phase of the Fourier transform in different columns (multiple outputs), Most people probably only need some of the basic examples below.而且我需要将傅里叶变换的功率和相位保存在不同的列(多个输出)中,大多数人可能只需要下面的一些基本示例。 but I believe that especially in the Machine Learning/Data-science sectors the more complex examples can be useful.但我相信,尤其是在机器学习/数据科学领域,更复杂的示例可能会很有用。
  6. Please let me know if you have even better, clearer or faster ways to perform any of the solutions below.如果您有更好、更清晰或更快的方法来执行以下任何解决方案,告诉我。 I'll update my answer and we can all benefit!我会更新我的答案,我们都可以受益!


Code examples代码示例

Let's create a dataframe first that will be used in all the examples below, including a group-column for the groupby examples.让我们首先创建一个 dataframe,它将在下面的所有示例中使用,包括 groupby 示例的组列。 For the rolling window and multiple input/output columns I just use 2 in all code examples below, but obviously this could be any number > 1.对于滚动 window 和多个输入/输出列,我在下面的所有代码示例中只使用 2,但显然这可以是任何大于 1 的数字。

df = pd.DataFrame(np.random.randint(0,5,size=(6, 2)), columns=list('ab'))
df['group'] = [0, 0, 0, 1, 1, 1]
df = df[['group', 'a', 'b']]

It will look like this:它看起来像这样:

group   a   b
0   0   2   2
1   0   4   1
2   0   0   4
3   1   0   2
4   1   3   2
5   1   3   0


Input 1 column, output 1 column输入1列,output 1列

Basic基本的

def func_i1_o1(x):    
    return x+1

df['c'] = df['b'].apply(func_i1_o1)


Rolling滚动

def func_i1_o1_rolling(x):
    return (x[0] + x[1])

df['d'] = df['c'].rolling(2).apply(func_i1_o1_rolling, raw=True)


Roling & Groupby滚动和分组

Add the reset_index solution (see notes above) to the rolling function.将 reset_index 解决方案(见上文注释)添加到滚动 function。

df['e'] = df.groupby('group')['c'].rolling(2).apply(func_i1_o1_rolling, raw=True).reset_index(drop=True)




Input 2 columns, output 1 column输入2列,output 1列

Basic基本的

def func_i2_o1(x):
    return np.sum(x)

df['f'] = df[['b', 'c']].apply(func_i2_o1, axis=1, raw=True)


Rolling滚动

As explained in point 2 in the notes above, there isn't a 'normal' solution for 2 inputs.正如上面注释中的第 2 点所解释的,没有 2 个输入的“正常”解决方案。 The workaround below uses the 'raw=False' to ensure the input is a pd.Series, which means we also get the indexes next to the values.下面的解决方法使用 'raw=False' 来确保输入是 pd.Series,这意味着我们还可以获取值旁边的索引。 This enables us to get values from other columns at the correct indexes to be used.这使我们能够从要使用的正确索引处的其他列中获取值。

def func_i2_o1_rolling(x):
    values_b = x
    values_c = df.loc[x.index, 'c'].to_numpy()
    return np.sum(values_b) + np.sum(values_c)

df['g'] = df['b'].rolling(2).apply(func_i2_o1_rolling, raw=False)


Rolling & Groupby滚动和分组

Add the reset_index solution (see notes above) to the rolling function.将 reset_index 解决方案(见上文注释)添加到滚动 function。

df['h'] = df.groupby('group')['b'].rolling(2).apply(func_i2_o1_rolling, raw=False).reset_index(drop=True)




Input 1 column, output 2 columns输入1列,output 2列

Basic基本的

You could use a 'normal' solution by returning pd.Series:您可以通过返回 pd.Series 来使用“正常”解决方案:

def func_i1_o2(x):
    return pd.Series((x+1, x+2))

df[['i', 'j']] = df['b'].apply(func_i1_o2)

Or you could use the zip/tuple combination which is about 8 times faster!或者你可以使用快 8 倍的 zip/tuple 组合!

def func_i1_o2_fast(x):
    return x+1, x+2

df['k'], df['l'] = zip(*df['b'].apply(func_i1_o2_fast))


Rolling滚动

As explained in point 1 in the notes above, we need a workaround if we want to return more than 1 value when using rolling & apply combined.正如上面注释中的第 1 点所解释的,如果我们想在结合使用滚动和应用时返回超过 1 个值,我们需要一种解决方法。 I found 2 working solutions.我找到了 2 个可行的解决方案。

1 1

def func_i1_o2_rolling_solution1(x):
    output_1 = np.max(x)
    output_2 = np.min(x)
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['m', 'n']] = output_1, output_2
    return 0

df['m'], df['n'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i1_o2_rolling_solution1, raw=False)

Pros : Everything is done within 1 function.优点:一切都在 1 function 内完成。
Cons : You have to create the columns first and it is slower since it doesn't use the raw input.缺点:您必须先创建列,而且速度较慢,因为它不使用原始输入。

2 2

rolling_w = 2
nan_prefix = (rolling_w - 1) * [np.nan]
output_list_1 = nan_prefix.copy()
output_list_2 = nan_prefix.copy()

def func_i1_o2_rolling_solution2(x):
    output_list_1.append(np.max(x))
    output_list_2.append(np.min(x))
    return 0

df['b'].rolling(rolling_w).apply(func_i1_o2_rolling_solution2, raw=True)
df['o'] = output_list_1
df['p'] = output_list_2

Pros : It uses the raw input which makes it about twice as fast.优点:它使用原始输入,使其速度提高了两倍。 And since it doesn't use indexes to set the output values the code looks a bit more clear (to me at least).而且由于它不使用索引来设置 output 值,因此代码看起来更清晰(至少对我而言)。
Cons : You have to create the nan-prefix yourself and it takes a bit more lines of code.缺点:您必须自己创建 nan 前缀,并且需要更多的代码行。


Rolling & Groupby滚动和分组

Normally, I would use the faster 2nd solution above.通常,我会使用上面更快的第二种解决方案。 However, since we're combining groups and rolling this means you'd have to manually set NaN's/zeros (depending on the number of groups) at the right indexes somewhere in the middle of the dataset.但是,由于我们正在组合组并滚动这意味着您必须在数据集中间某处的正确索引处手动设置 NaN/零(取决于组的数量)。 To me it seems that when combining rolling, groupby and multiple output columns, the first solution is easier and solves the automatic NaNs/grouping automatically.在我看来,当结合滚动、groupby 和多个 output 列时,第一个解决方案更容易并自动解决自动 NaN/分组。 Once again, I use the reset_index solution at the end.最后,我再次使用了 reset_index 解决方案。

def func_i1_o2_rolling_groupby(x):
    output_1 = np.max(x)
    output_2 = np.min(x)
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['q', 'r']] = output_1, output_2
    return 0

df['q'], df['r'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i1_o2_rolling_groupby, raw=False).reset_index(drop=True)




Input 2 columns, output 2 columns输入2列,output 2列

Basic基本的

I suggest using the same 'fast' way as for i1_o2 with the only difference that you get 2 input values to use.我建议使用与 i1_o2 相同的“快速”方式,唯一的区别是您可以使用 2 个输入值。

def func_i2_o2(x):
    return np.mean(x), np.median(x)

df['s'], df['t'] = zip(*df[['b', 'c']].apply(func_i2_o2, axis=1))


Rolling滚动

As I use a workaround for applying rolling with multiple inputs and I use another workaround for rolling with multiple outputs , you can guess I need to combine them for this one.当我使用一种解决方法来应用多个输入的滚动时,我使用另一种解决方法来滚动多个输出,您可以猜想我需要将它们组合起来。
1. Get values from other columns using indexes (see func_i2_o1_rolling) 1. 使用索引从其他列中获取值(参见 func_i2_o1_rolling)
2. Set the final multiple outputs on the correct index (see func_i1_o2_rolling_solution1) 2. 在正确的索引上设置最终的多个输出(参见 func_i1_o2_rolling_solution1)

def func_i2_o2_rolling(x):
    values_b = x.to_numpy()
    values_c = df.loc[x.index, 'c'].to_numpy()
    output_1 = np.min([np.sum(values_b), np.sum(values_c)])
    output_2 = np.max([np.sum(values_b), np.sum(values_c)])    
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['u', 'v']] = output_1, output_2
    return 0

df['u'], df['v'] = (np.nan, np.nan)
df['b'].rolling(2).apply(func_i2_o2_rolling, raw=False)


Rolling & Groupby滚动和分组

Add the reset_index solution (see notes above) to the rolling function.将 reset_index 解决方案(见上文注释)添加到滚动 function。

def func_i2_o2_rolling_groupby(x):
    values_b = x.to_numpy()
    values_c = df.loc[x.index, 'c'].to_numpy()
    output_1 = np.min([np.sum(values_b), np.sum(values_c)])
    output_2 = np.max([np.sum(values_b), np.sum(values_c)])    
    # Last index is where to place the final values: x.index[-1]
    df.at[x.index[-1], ['w', 'x']] = output_1, output_2
    return 0

df['w'], df['x'] = (np.nan, np.nan)
df.groupby('group')['b'].rolling(2).apply(func_i2_o2_rolling_groupby, raw=False).reset_index(drop=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM