简体   繁体   English

计算每组唯一值的数量

[英]Count the number of unique values per group

I have 2 columns - _a, _b. 我有2列-_a,_b。

import numpy as np 
import pandas as pd
df = pd.DataFrame({'_a':[1,1,1,2,2,3,3],'_b':[3,4,5,3,3,3,9], 'a_b_3':[3,3,3,1,1,2,2]})
df

    _a  _b  a_b_3   
0   1   3   3
1   1   4   3
2   1   5   3
3   2   3   1
4   2   3   1
5   3   3   2
6   3   9   2

I need create column a_b_3 (unique count from column '_b') use groupby from pandas. 我需要使用来自熊猫的groupby创建a_b_3列(来自'_b'列的唯一计数)。 Thank you in advance. 先感谢您。

Looks like you want transform + nunique ; 看起来像您要transform + nunique ;

df['a_b_3'] = df.groupby('_a')['_b'].transform('nunique')        
df
   _a  _b  a_b_3
0   1   3      3
1   1   4      3
2   1   5      3
3   2   3      1
4   2   3      1
5   3   3      2
6   3   9      2

This is effectively groupby + nunique + map : 这实际上是groupby + nunique + map

v = df.groupby('_a')['_b'].nunique()
df['a_b_3'] = df['_a'].map(v)

df
   _a  _b  a_b_3
0   1   3      3
1   1   4      3
2   1   5      3
3   2   3      1
4   2   3      1
5   3   3      2
6   3   9      2

Use - 采用 -

df2=df.groupby(['_a'])['_b'].nunique().reset_index()
df['a_b_3'] = df.merge(df2, how='left', on='_a')[['_b_y']]

Output 输出量

   _a  _b  a_b_3
0   1   3      3
1   1   4      3
2   1   5      3
3   2   3      1
4   2   3      1
5   3   3      2
6   3   9      2

If I understand you correctly what you want is to group by column _a, count the number of unique values in column _b within each group and then append this count to the original dataframe using _a as the key. 如果我正确理解了您想要按_a列分组的内容,请计算每个组中_b列中唯一值的数量,然后使用_a作为键将此计数附加到原始数据帧中。 The following code should achieve that. 下面的代码应该可以实现这一点。

df.merge(pd.DataFrame(df.groupby('_a')._b.nunique()), left_on='_a', right_index=True)

Breaking it down, the first thing is to group by _a and then count the uniques in column _b. 分解起来,第一件事是对_a进行分组,然后计算_b列中的唯一性。 That's what df.groupby('_a')._b.nunique() does. df.groupby('_a')._b.nunique()这样做的。 Then it's merged with the original dataframe using _a as the key. 然后使用_a作为键将其与原始数据帧合并。 The groupby returns a series so we need to convert it to a dataframe before merging, hence the pd.DataFrame . groupby返回一个序列,因此我们需要在合并之前将其转换为数据pd.DataFrame ,因此将pd.DataFrame转换pd.DataFrame数据pd.DataFrame

EDIT 编辑

@COLDSPEED's answer above is much more efficient than this one. 上面@COLDSPEED的答案比这个答案有效得多。 To give an idea of the speed difference I ran a timeit which shows a speed up of 2x on this small dataframe, on larger dataframes it would probably be even more. 为了给出速度差的概念,我运行了一个timeit,它显示了在这个小数据帧上的速度提高了2倍,在大数据帧上的速度可能会更高。

Using merge: 使用合并:

%timeit df.merge(pd.DataFrame(df.groupby('_a')._b.nunique()), left_on='_a', right_index=True)
1.43 ms ± 74.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Using transform: 使用转换:

%timeit df.groupby('_a')['_b'].transform('nunique')
750 µs ± 32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM