[英]Add column with number of duplicated values within groups and number of unique values within groups in pandas
I have a dataframe such as我有一个 dataframe 比如
Groups Names
G1 A
G1 A
G1 B
G1 B
G1 C
G1 C
G1 C
G1 D
G2 A
G2 B
G2 C
G3 A
G3 A
G4 F
G4 F
G4 E
And I would like to count for each Groups
the number of duplicated (at least 2 times) of values within the column Names
and add this information on a new column called Nb_duplicated
.我想为每个
Groups
计算Names
列中值的重复次数(至少 2 次),并将此信息添加到名为Nb_duplicated
的新列中。 And I would like also to add another column called Number_unique_names
which will be the number of unique Names
values within each Groups
.我还想添加另一个名为
Number_unique_names
的列,这将是每个Groups
中唯一Names
值的数量。
I should then get:然后我应该得到:
Groups Names Nb_duplicated Number_unique_names
G1 A 3 4
G1 A 3 4
G1 B 3 4
G1 B 3 4
G1 C 3 4
G1 C 3 4
G1 C 3 4
G1 D 3 4
G2 A 0 3
G2 B 0 3
G2 C 0 3
G3 A 1 1
G3 A 1 1
G4 F 1 2
G4 F 1 2
G4 E 1 2
You can use compute the number of unique and the number of non-duplicated names (with GroupBy.transform
), then subtract the two to get the number of duplicated:您可以使用计算唯一名称的数量和非重复名称的数量(使用
GroupBy.transform
),然后减去两者以获得重复的数量:
# set up group
g = df.groupby('Groups')
# get unique values
df['unique'] = g['Names'].transform('nunique')
# get non-duplicates
non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
# duplicates = unique - non-duplicates
df['duplicated'] = df['unique'] - non_dup
NB.注意。 I used an intermediate variable "non_dup" here for clarity but you can use a one-liner
为了清楚起见,我在这里使用了一个中间变量“non_dup”,但您可以使用单行
output (with intermediate non_dup for clarity): output(为清楚起见,使用中间 non_dup):
Groups Names unique duplicated non_dup
0 G1 A 4 3 1
1 G1 A 4 3 1
2 G1 B 4 3 1
3 G1 B 4 3 1
4 G1 C 4 3 1
5 G1 C 4 3 1
6 G1 C 4 3 1
7 G1 D 4 3 1
8 G2 A 3 0 3
9 G2 B 3 0 3
10 G2 C 3 0 3
11 G3 A 1 1 0
12 G3 A 1 1 0
13 G4 F 2 1 1
14 G4 F 2 1 1
15 G4 E 2 1 1
Get duplicated values to mask by chain DataFrame.duplicated
with inverted mask and keep=False
for remove unique rows and then use sum
for count True
s in GroupBy.transform
:获取重复值以通过链
DataFrame.duplicated
使用反向掩码和keep=False
来屏蔽以删除唯一行,然后在GroupBy.transform
中使用sum
来计数True
:
m = ~df.duplicated(['Groups','Names'])
m1 = df.duplicated(['Groups','Names'], keep=False)
df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
print (df)
Groups Names Nb_duplicated Number_unique_names
0 G1 A 3 4
1 G1 A 3 4
2 G1 B 3 4
3 G1 B 3 4
4 G1 C 3 4
5 G1 C 3 4
6 G1 C 3 4
7 G1 D 3 4
8 G2 A 0 3
9 G2 B 0 3
10 G2 C 0 3
11 G3 A 1 1
12 G3 A 1 1
13 G4 F 1 2
14 G4 F 1 2
15 G4 E 1 2
Performance is better in sample data, please test also in real data:样本数据性能更好,请在真实数据中测试:
np.random.seed(2022)
df = pd.DataFrame({'Groups':np.random.randint(1000, size=10000),
'Names':np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), size=10000)})
In [104]: %%timeit
...: m = ~df.duplicated(['Groups','Names'])
...: m1 = df.duplicated(['Groups','Names'], keep=False)
...: df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
...: df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
...:
6.29 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [105]: %%timeit
...: # set up group
...: g = df.groupby('Groups')
...: # get unique values
...: df['unique'] = g['Names'].transform('nunique')
...: # get non-duplicates
...: non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
...: # duplicates = unique - non-duplicates
...: df['duplicated'] = df['unique'] - non_dup
...:
344 ms ± 8.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.