简体   繁体   English

Pandas 将多列分组聚合作为用户定义函数的输入

[英]Pandas grouped aggregation with multiple columns as inputs to a user defined function

I am still trying to learn pandas.我仍在努力学习熊猫。 I have a custom user defined function which will require two columns as input.我有一个自定义的用户定义函数,它需要两列作为输入。 It is an aggregation function so it needs to be done by group.它是一个聚合函数,因此需要按组完成。

This is my question: How can I get a grouped aggregation with multiple columns as inputs to a user defined function?这是我的问题:如何将多列分组聚合作为用户定义函数的输入?

Here is my reproducible example along with a couple things I tried.这是我的可重复示例以及我尝试过的一些事情。

import pandas as pd
import numpy as np


def first_b_over_avg_c(b,c):
    first_b = b.first()
    avg_c = np.mean(c)
    return first_b / avg_c


np.random.seed(42)
df = pd.DataFrame(
        {
            "a": ["one", "one", "one", "one", "two", "two", "two", "two"],
            "b": np.random.uniform(0,1,8),
            "c": np.random.uniform(0,1,8)
            }
        )
print(df)

df.groupby(['a'],as_index = False).assign(d = lambda df: first_b_over_avg_c(df['b'],df['c']))
df.groupby(['a'],as_index = False).apply(first_b_over_avg_c, b=('b'), c=('c'))

Here is the output:这是输出:

     a         b         c
0  one  0.374540  0.601115
1  one  0.950714  0.708073
2  one  0.731994  0.020584
3  one  0.598658  0.969910
4  two  0.156019  0.832443
5  two  0.155995  0.212339
6  two  0.058084  0.181825
7  two  0.866176  0.183405

And the error和错误

Traceback (most recent call last): File "reprex.py", line 21, in df.groupby(['a'],as_index = False).assign(d = lambda df: first_b_over_avg_c(df['b'],df['c'])) File "/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 703, in getattr raise AttributeError( AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'回溯(最近一次调用):文件“reprex.py”,第 21 行,在 df.groupby(['a'],as_index = False).assign(d = lambda df: first_b_over_avg_c(df['b'], df['c'])) 文件“/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/groupby/groupby.py”,第 703 行,在getattr 中引发 AttributeError( AttributeError: ' DataFrameGroupBy' 对象没有属性 'assign'

I think this does what you are looking for我认为这可以满足您的要求

import pandas as pd
import numpy as np


def first_b_over_avg_c(group):
    first_b = group['b'].iloc[0]
    avg_c = np.mean(group['c'])
    return first_b / avg_c


np.random.seed(42)
df = pd.DataFrame(
        {
            "a": ["one", "one", "one", "one", "two", "two", "two", "two"],
            "b": np.random.uniform(0,1,8),
            "c": np.random.uniform(0,1,8)
            }
        )

df.groupby(['a'],as_index = False).apply(first_b_over_avg_c)

If I am reading correctly all you want to do is be able to access multiple columns from a user defined function.如果我正确阅读,那么您想要做的就是能够从用户定义的函数访问多个列。

In this example the entire row/group is being passed into the function.在这个例子中,整个行/组被传递到函数中。 If you print out group in the function:如果在函数中打印出组:

     a         b         c
0  one  0.374540  0.601115
1  one  0.950714  0.708073
2  one  0.731994  0.020584
3  one  0.598658  0.969910

you can see that the groups 'one' and 'two' get passed separately.您可以看到“一”和“二”组分别通过。

Had the object not been grouped.如果对象没有被分组。 Each row would get passed separately.每行将分别通过。

I do not see a point in assigning and passing the columns in separately unless there is a specific reason for it.除非有特定原因,否则我认为单独分配和传递列没有意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM