简体   繁体   English

熊猫通过汇总列对双重观测进行分组

[英]Pandas group double observations by aggregating column

I have a dataframe like this: 我有一个这样的数据框:

+----------+---------+
| username | role    |
+----------+---------+
| foo      | user    |
+----------+---------+
| foo      | analyst |
+----------+---------+
| bar      | admin   |
+----------+---------+

and I would like to remove the repetition of the users that appear twice or more by aggregating the column role in a way to obtain the following dataframe: 并且我想通过汇总列角色以获取以下数据框的方式来消除出现两次或两次以上的用户的重复:

+----------+---------------+
| username | role          |
+----------+---------------+
| foo      | user, analyst |
+----------+---------------+
| bar      | admin         |
+----------+---------------+

So far I have tried using pivot table in this way: 到目前为止,我已经尝试过以这种方式使用数据透视表:

table = pd.pivot_table(df, index='username', columns='role')

and also the groupby function, but this is not the right way to do it. 以及groupby函数,但这不是正确的方法。 What is the right way to deal with this? 解决这个问题的正确方法是什么?

What you want to do is group the rows based on username , so the groupby -function is one way to go. 您要做的是根据username对行进行分组,因此groupby -function是一种解决方法。 Usually when you use groupby you apply an aggregation function to the rest of the columns, for example sum , average , min or similair. 通常,当您使用groupby时, groupby 聚合函数应用于其余的列,例如sumaveragemin或similair。 But you can also define your own aggregation function, and use that in agg . 但是您也可以定义自己的聚合函数,并在agg使用它。

def merge_strings(series):
    # This function will get a series of all the values in a column. For example for foo the series will be ['user', 'analyst'].
    # We can use the built in function str.cat() fo contatenate a series of strings.

    return series.str.cat(sep=', ')

Then we simply call groupby, and tell that we want to aggregate the role -column using our custom function 然后我们简单地调用groupby,并告诉我们要使用自定义函数聚合role -column

df.groupby('username').agg({'role': merge_strings})

You can create a list or comma separate strings using the following: 您可以使用以下命令创建列表或逗号分隔的字符串:

df.groupby('username')['role'].agg(list).reset_index()

Output: 输出:

  username             role
0      bar          [admin]
1      foo  [user, analyst]

OR 要么

df.groupby('username')['role'].agg(lambda x: ', '.join(x)).reset_index()

OUtput: 输出:

  username           role
0      bar          admin
1      foo  user, analyst

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM