[英]Pandas create new column with count from groupby
I have a df that looks like the following:我有一个如下所示的 df:
id item color
01 truck red
02 truck red
03 car black
04 truck blue
05 car black
I am trying to create a df that looks like this:我正在尝试创建一个如下所示的 df:
item color count
truck red 2
truck blue 1
car black 2
I have tried我努力了
df["count"] = df.groupby("item")["color"].transform('count')
But it is not quite what I am searching for.但这不是我正在寻找的。
Any guidance is appreciated任何指导表示赞赏
That's not a new column, that's a new DataFrame:这不是一个新列,而是一个新的 DataFrame:
In [11]: df.groupby(["item", "color"]).count()
Out[11]:
id
item color
car black 2
truck blue 1
red 2
To get the result you want is to use reset_index
:要获得您想要的结果是使用
reset_index
:
In [12]: df.groupby(["item", "color"])["id"].count().reset_index(name="count")
Out[12]:
item color count
0 car black 2
1 truck blue 1
2 truck red 2
To get a "new column" you could use transform:要获得“新列”,您可以使用转换:
In [13]: df.groupby(["item", "color"])["id"].transform("count")
Out[13]:
0 2
1 2
2 2
3 1
4 2
dtype: int64
I recommend reading the split-apply-combine section of the docs .我建议阅读文档的split-apply-combine 部分。
Another possible way to achieve the desired output would be to use Named Aggregation .实现所需输出的另一种可能方法是使用Named Aggregation 。 Which will allow you to specify the name and respective aggregation function for the desired output columns.
这将允许您为所需的输出列指定名称和相应的聚合函数。
Named aggregation
命名聚合
( New in version 0.25.0. )
( 0.25.0 版中的新功能。 )
To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in
GroupBy.agg()
, known as “named aggregation”, where:为了通过控制输出列名称来支持特定于列的聚合,pandas 接受
GroupBy.agg()
的特殊语法,称为“命名聚合”,其中:
The keywords are the output column names
关键字是输出列名称
The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column.
这些值是元组,其第一个元素是要选择的列,第二个元素是要应用于该列的聚合。 Pandas provides the
pandas.NamedAgg
named tuple with the fields['column','aggfunc']
to make it clearer what the arguments are.Pandas 提供了带有字段
['column','aggfunc']
名为pandas.NamedAgg
元组,以便更清楚地说明参数是什么。 As usual, the aggregation can be a callable or a string alias.像往常一样,聚合可以是可调用的或字符串别名。
So to get the desired output - you could try something like...因此,要获得所需的输出 - 您可以尝试类似...
import pandas as pd
# Setup
df = pd.DataFrame([
{
"item":"truck",
"color":"red"
},
{
"item":"truck",
"color":"red"
},
{
"item":"car",
"color":"black"
},
{
"item":"truck",
"color":"blue"
},
{
"item":"car",
"color":"black"
}
])
df_grouped = df.groupby(["item", "color"]).agg(
count_col=pd.NamedAgg(column="color", aggfunc="count")
)
print(df_grouped)
Which produces the following output:产生以下输出:
count_col
item color
car black 2
truck blue 1
red 2
You can use value_counts
and name the column with reset_index
:您可以使用
value_counts
并将列命名为reset_index
:
In [3]: df[['item', 'color']].value_counts().reset_index(name='counts')
Out[3]:
item color counts
0 car black 2
1 truck red 2
2 truck blue 1
Here is another option:这是另一种选择:
import numpy as np
df['Counts'] = np.zeros(len(df))
grp_df = df.groupby(['item', 'color']).count()
which results in这导致
Counts
item color
car black 2
truck blue 1
red 2
An option that is more literal then the accepted answer.一个比接受的答案更直白的选项。
df.groupby(["item", "color"], as_index=False).agg(count=("item", "count"))
Any column name can be used in place of "item" in the aggregation.任何列名都可以用来代替聚合中的“item”。
"as_index=False" prevents the grouped column from becoming the index. “as_index=False”防止分组列成为索引。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.