熊猫groupby（）比较并计算两列

Question

I have the following Pandas dataframe: 我有以下熊猫数据框：

name1   name2
A       B
A       A
A       C
A       A
B       B
B       A

I want to add a column named new which counts per groups of name1 how often name1 is the same as name2 . 我想添加一列名为new的列，该列按name1每个组计数name1与name2相同的频率

Hence, the expected output is the following dataframe: 因此，预期的输出是以下数据帧：

name1   name2   new
A       B       2       
A       A       2
A       C       2
A       A       2
B       B       1
B       A       1

I have tried the following, but I get an error: 我尝试了以下操作，但出现错误：

df['new'] = df.groupby('name1').apply(lambda x: (x[x['name1'] == x['name2']].fillna(False).sum()))

TypeError: incompatible index of inserted column with frame index TypeError：插入的列的索引与框架索引不兼容

Answer 1

You can compare name1 with name2 , then group by name1 and sum Trues : 您可以将name1与name2比较，然后按name1分组并sum Trues ：

df['new'] = df.name2.eq(df.name1).astype(int).groupby(df.name1).transform('sum')

df
#  name1 name2  new
#0     A     B    2
#1     A     A    2
#2     A     C    2
#3     A     A    2
#4     B     B    1
#5     B     A    1

Or if using apply , aggregate the counts firstly, then use map to generate the new column: 或者，如果使用apply ，则首先聚合计数，然后使用map生成new列：

cnt = df.groupby('name1').apply(lambda g: (g.name1 == g.name2).sum())
df['new'] = df.name1.map(cnt)

Timing : 时间：

df = pd.concat([df]*10000)

%timeit df['new'] = df.name2.eq(df.name1).astype(int).groupby(df.name1).transform('sum')
# 100 loops, best of 3: 4.85 ms per loop

%%timeit
cnt = df.groupby('name1').apply(lambda g: (g.name1 == g.name2).sum())
df['new'] = df.name1.map(cnt)
# 10 loops, best of 3: 22.1 ms per loop

熊猫groupby（）比较并计算两列

问题描述

1 个解决方案

解决方案1
4 已采纳 2017-10-29 17:13:00

熊猫groupby（）比较并计算两列

问题描述

1 个解决方案

解决方案1 4 已采纳 2017-10-29 17:13:00

解决方案1
4 已采纳 2017-10-29 17:13:00