[英]Pandas group double observations by aggregating column
I have a dataframe like this: 我有一个这样的数据框:
+----------+---------+
| username | role |
+----------+---------+
| foo | user |
+----------+---------+
| foo | analyst |
+----------+---------+
| bar | admin |
+----------+---------+
and I would like to remove the repetition of the users that appear twice or more by aggregating the column role in a way to obtain the following dataframe: 并且我想通过汇总列角色以获取以下数据框的方式来消除出现两次或两次以上的用户的重复:
+----------+---------------+
| username | role |
+----------+---------------+
| foo | user, analyst |
+----------+---------------+
| bar | admin |
+----------+---------------+
So far I have tried using pivot table in this way: 到目前为止,我已经尝试过以这种方式使用数据透视表:
table = pd.pivot_table(df, index='username', columns='role')
and also the groupby
function, but this is not the right way to do it. 以及groupby
函数,但这不是正确的方法。 What is the right way to deal with this? 解决这个问题的正确方法是什么?
What you want to do is group the rows based on username
, so the groupby
-function is one way to go. 您要做的是根据username
对行进行分组,因此groupby
-function是一种解决方法。 Usually when you use groupby
you apply an aggregation function to the rest of the columns, for example sum
, average
, min
or similair. 通常,当您使用groupby
时, groupby
聚合函数应用于其余的列,例如sum
, average
, min
或similair。 But you can also define your own aggregation function, and use that in agg
. 但是您也可以定义自己的聚合函数,并在agg
使用它。
def merge_strings(series):
# This function will get a series of all the values in a column. For example for foo the series will be ['user', 'analyst'].
# We can use the built in function str.cat() fo contatenate a series of strings.
return series.str.cat(sep=', ')
Then we simply call groupby, and tell that we want to aggregate the role
-column using our custom function 然后我们简单地调用groupby,并告诉我们要使用自定义函数聚合role
-column
df.groupby('username').agg({'role': merge_strings})
You can create a list or comma separate strings using the following: 您可以使用以下命令创建列表或逗号分隔的字符串:
df.groupby('username')['role'].agg(list).reset_index()
Output: 输出:
username role
0 bar [admin]
1 foo [user, analyst]
OR 要么
df.groupby('username')['role'].agg(lambda x: ', '.join(x)).reset_index()
OUtput: 输出:
username role
0 bar admin
1 foo user, analyst
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.