Pandas 将多列分组聚合作为用户定义函数的输入

Question

I am still trying to learn pandas.我仍在努力学习熊猫。 I have a custom user defined function which will require two columns as input.我有一个自定义的用户定义函数，它需要两列作为输入。 It is an aggregation function so it needs to be done by group.它是一个聚合函数，因此需要按组完成。

This is my question: How can I get a grouped aggregation with multiple columns as inputs to a user defined function?这是我的问题：如何将多列分组聚合作为用户定义函数的输入？

Here is my reproducible example along with a couple things I tried.这是我的可重复示例以及我尝试过的一些事情。

import pandas as pd
import numpy as np


def first_b_over_avg_c(b,c):
    first_b = b.first()
    avg_c = np.mean(c)
    return first_b / avg_c


np.random.seed(42)
df = pd.DataFrame(
        {
            "a": ["one", "one", "one", "one", "two", "two", "two", "two"],
            "b": np.random.uniform(0,1,8),
            "c": np.random.uniform(0,1,8)
            }
        )
print(df)

df.groupby(['a'],as_index = False).assign(d = lambda df: first_b_over_avg_c(df['b'],df['c']))
df.groupby(['a'],as_index = False).apply(first_b_over_avg_c, b=('b'), c=('c'))

Here is the output:这是输出：

     a         b         c
0  one  0.374540  0.601115
1  one  0.950714  0.708073
2  one  0.731994  0.020584
3  one  0.598658  0.969910
4  two  0.156019  0.832443
5  two  0.155995  0.212339
6  two  0.058084  0.181825
7  two  0.866176  0.183405

And the error和错误

Traceback (most recent call last): File "reprex.py", line 21, in df.groupby(['a'],as_index = False).assign(d = lambda df: first_b_over_avg_c(df['b'],df['c'])) File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 703, in getattr raise AttributeError( AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'回溯（最近一次调用）：文件“reprex.py”，第 21 行，在 df.groupby(['a'],as_index = False).assign(d = lambda df: first_b_over_avg_c(df['b'], df['c'])) 文件“/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py”，第 703 行，在getattr 中引发 AttributeError( AttributeError: ' DataFrameGroupBy' 对象没有属性 'assign'

Answer 1

I think this does what you are looking for我认为这可以满足您的要求

import pandas as pd
import numpy as np


def first_b_over_avg_c(group):
    first_b = group['b'].iloc[0]
    avg_c = np.mean(group['c'])
    return first_b / avg_c


np.random.seed(42)
df = pd.DataFrame(
        {
            "a": ["one", "one", "one", "one", "two", "two", "two", "two"],
            "b": np.random.uniform(0,1,8),
            "c": np.random.uniform(0,1,8)
            }
        )

df.groupby(['a'],as_index = False).apply(first_b_over_avg_c)

If I am reading correctly all you want to do is be able to access multiple columns from a user defined function.如果我正确阅读，那么您想要做的就是能够从用户定义的函数访问多个列。

In this example the entire row/group is being passed into the function.在这个例子中，整个行/组被传递到函数中。 If you print out group in the function:如果在函数中打印出组：

     a         b         c
0  one  0.374540  0.601115
1  one  0.950714  0.708073
2  one  0.731994  0.020584
3  one  0.598658  0.969910

you can see that the groups 'one' and 'two' get passed separately.您可以看到“一”和“二”组分别通过。

Had the object not been grouped.如果对象没有被分组。 Each row would get passed separately.每行将分别通过。

I do not see a point in assigning and passing the columns in separately unless there is a specific reason for it.除非有特定原因，否则我认为单独分配和传递列没有意义。

Pandas 将多列分组聚合作为用户定义函数的输入

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-07-23 18:57:26

Pandas 将多列分组聚合作为用户定义函数的输入

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-07-23 18:57:26

解决方案1
0 已采纳 2021-07-23 18:57:26