[英]Add column with number of duplicated values within groups and number of unique values within groups in pandas
我有一個 dataframe 比如
Groups Names
G1 A
G1 A
G1 B
G1 B
G1 C
G1 C
G1 C
G1 D
G2 A
G2 B
G2 C
G3 A
G3 A
G4 F
G4 F
G4 E
我想為每個Groups
計算Names
列中值的重復次數(至少 2 次),並將此信息添加到名為Nb_duplicated
的新列中。 我還想添加另一個名為Number_unique_names
的列,這將是每個Groups
中唯一Names
值的數量。
然后我應該得到:
Groups Names Nb_duplicated Number_unique_names
G1 A 3 4
G1 A 3 4
G1 B 3 4
G1 B 3 4
G1 C 3 4
G1 C 3 4
G1 C 3 4
G1 D 3 4
G2 A 0 3
G2 B 0 3
G2 C 0 3
G3 A 1 1
G3 A 1 1
G4 F 1 2
G4 F 1 2
G4 E 1 2
您可以使用計算唯一名稱的數量和非重復名稱的數量(使用GroupBy.transform
),然后減去兩者以獲得重復的數量:
# set up group
g = df.groupby('Groups')
# get unique values
df['unique'] = g['Names'].transform('nunique')
# get non-duplicates
non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
# duplicates = unique - non-duplicates
df['duplicated'] = df['unique'] - non_dup
注意。 為了清楚起見,我在這里使用了一個中間變量“non_dup”,但您可以使用單行
output(為清楚起見,使用中間 non_dup):
Groups Names unique duplicated non_dup
0 G1 A 4 3 1
1 G1 A 4 3 1
2 G1 B 4 3 1
3 G1 B 4 3 1
4 G1 C 4 3 1
5 G1 C 4 3 1
6 G1 C 4 3 1
7 G1 D 4 3 1
8 G2 A 3 0 3
9 G2 B 3 0 3
10 G2 C 3 0 3
11 G3 A 1 1 0
12 G3 A 1 1 0
13 G4 F 2 1 1
14 G4 F 2 1 1
15 G4 E 2 1 1
獲取重復值以通過鏈DataFrame.duplicated
使用反向掩碼和keep=False
來屏蔽以刪除唯一行,然后在GroupBy.transform
中使用sum
來計數True
:
m = ~df.duplicated(['Groups','Names'])
m1 = df.duplicated(['Groups','Names'], keep=False)
df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
print (df)
Groups Names Nb_duplicated Number_unique_names
0 G1 A 3 4
1 G1 A 3 4
2 G1 B 3 4
3 G1 B 3 4
4 G1 C 3 4
5 G1 C 3 4
6 G1 C 3 4
7 G1 D 3 4
8 G2 A 0 3
9 G2 B 0 3
10 G2 C 0 3
11 G3 A 1 1
12 G3 A 1 1
13 G4 F 1 2
14 G4 F 1 2
15 G4 E 1 2
樣本數據性能更好,請在真實數據中測試:
np.random.seed(2022)
df = pd.DataFrame({'Groups':np.random.randint(1000, size=10000),
'Names':np.random.choice(list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'), size=10000)})
In [104]: %%timeit
...: m = ~df.duplicated(['Groups','Names'])
...: m1 = df.duplicated(['Groups','Names'], keep=False)
...: df['Nb_duplicated'] = (m & m1).groupby(df['Groups']).transform('sum')
...: df['Number_unique_names'] = m.groupby(df['Groups']).transform('sum')
...:
6.29 ms ± 50.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [105]: %%timeit
...: # set up group
...: g = df.groupby('Groups')
...: # get unique values
...: df['unique'] = g['Names'].transform('nunique')
...: # get non-duplicates
...: non_dup = g['Names'].transform(lambda x: (~x.duplicated(False)).sum())
...: # duplicates = unique - non-duplicates
...: df['duplicated'] = df['unique'] - non_dup
...:
344 ms ± 8.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.