简体   繁体   English

在 pandas 中添加包含组内重复值数量和组内唯一值数量的列

[英]Add column with number of duplicated values within groups and number of unique values within groups in pandas

I have a dataframe such as我有一个 dataframe 比如

Groups Names
G1     A
G1     A
G1     B
G1     B
G1     C
G1     C
G1     C
G1     D
G2     A
G2     B
G2     C
G3     A
G3     A
G4     F
G4     F
G4     E

And I would like to count for each Groups the number of duplicated (at least 2 times) of values within the column Names and add this information on a new column called Nb_duplicated .我想为每个Groups计算Names列中值的重复次数(至少 2 次),并将此信息添加到名为Nb_duplicated的新列中。 And I would like also to add another column called Number_unique_names which will be the number of unique Names values within each Groups .我还想添加另一个名为Number_unique_names的列,这将是每个Groups中唯一Names值的数量。

I should then get:然后我应该得到:

Groups Names Nb_duplicated Number_unique_names
G1     A     3             4
G1     A     3             4
G1     B     3             4
G1     B     3             4
G1     C     3             4
G1     C     3             4
G1     C     3             4
G1     D     3             4
G2     A     0             3
G2     B     0             3
G2     C     0             3
G3     A     1             1
G3     A     1             1
G4     F     1             2
G4     F     1             2
G4     E     1             2

You can use compute the number of unique and the number of non-duplicated names (with GroupBy.transform ), then subtract the two to get the number of duplicated:您可以使用计算唯一名称的数量和非重复名称的数量(使用GroupBy.transform ),然后减去两者以获得重复的数量:

# set up group
g = df.groupby('Groups')
# get unique values
df['unique'] = g['Names'].transform('nunique')
# get non-duplicates
non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
# duplicates = unique - non-duplicates
df['duplicated'] = df['unique'] - non_dup

NB.注意。 I used an intermediate variable "non_dup" here for clarity but you can use a one-liner为了清楚起见,我在这里使用了一个中间变量“non_dup”,但您可以使用单行

output (with intermediate non_dup for clarity): output(为清楚起见,使用中间 non_dup):

   Groups Names  unique  duplicated           non_dup
0      G1     A       4           3                 1
1      G1     A       4           3                 1
2      G1     B       4           3                 1
3      G1     B       4           3                 1
4      G1     C       4           3                 1
5      G1     C       4           3                 1
6      G1     C       4           3                 1
7      G1     D       4           3                 1
8      G2     A       3           0                 3
9      G2     B       3           0                 3
10     G2     C       3           0                 3
11     G3     A       1           1                 0
12     G3     A       1           1                 0
13     G4     F       2           1                 1
14     G4     F       2           1                 1
15     G4     E       2           1                 1

Get duplicated values to mask by chain DataFrame.duplicated with inverted mask and keep=False for remove unique rows and then use sum for count True s in GroupBy.transform :获取重复值以通过链DataFrame.duplicated使用反向掩码和keep=False来屏蔽以删除唯一行,然后在GroupBy.transform中使用sum来计数True

m = ~df.duplicated(['Groups','Names'])
m1 = df.duplicated(['Groups','Names'], keep=False)
df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
print (df)
   Groups Names  Nb_duplicated  Number_unique_names
0      G1     A              3                    4
1      G1     A              3                    4
2      G1     B              3                    4
3      G1     B              3                    4
4      G1     C              3                    4
5      G1     C              3                    4
6      G1     C              3                    4
7      G1     D              3                    4
8      G2     A              0                    3
9      G2     B              0                    3
10     G2     C              0                    3
11     G3     A              1                    1
12     G3     A              1                    1
13     G4     F              1                    2
14     G4     F              1                    2
15     G4     E              1                    2

Performance is better in sample data, please test also in real data:样本数据性能更好,请在真实数据中测试:

np.random.seed(2022)

df = pd.DataFrame({'Groups':np.random.randint(1000, size=10000),
                   'Names':np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), size=10000)})


In [104]: %%timeit
     ...: m = ~df.duplicated(['Groups','Names'])
     ...: m1 = df.duplicated(['Groups','Names'], keep=False)
     ...: df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
     ...: df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
     ...: 
6.29 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [105]: %%timeit
     ...: # set up group
     ...: g = df.groupby('Groups')
     ...: # get unique values
     ...: df['unique'] = g['Names'].transform('nunique')
     ...: # get non-duplicates
     ...: non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
     ...: # duplicates = unique - non-duplicates
     ...: df['duplicated'] = df['unique'] - non_dup
     ...: 
344 ms ± 8.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM