[英]Create a group column based on duplicated elements within 3 columns in pandas
I have a dataframe such as:我有一个 dataframe 例如:
COL1 COL2 COL3
G1 SP1 A
G1 SP1 A
G1 SP2 B
G2 SP1 C
G2 SP2 C
G3 SP1 D
G3 SP1 D
G3 SP1 D
And I would simply like to add a new Groups
column with groups of duplicated COL1,COL2 and COL3
values and a Nb_dup
column with the number of dup such as:我只想添加一个新的Groups
列,其中包含重复的COL1,COL2 and COL3
值组,以及一个Nb_dup
列,其中包含 dup 的数量,例如:
COL1 COL2 COL3 Groups Nb_dup
G1 SP1 A Group1 2
G1 SP1 A Group1 2
G1 SP2 B Group2 1
G2 SP1 C Group3 1
G2 SP2 C Group4 1
G3 SP1 D Group5 3
G3 SP1 D Group5 3
G3 SP1 D Group5 3
So far I tried:到目前为止,我尝试过:
key_set = set(df[['COL1','COL2','COL3']])
df_a = pd.DataFrame(list(key_set))
df_a['Groups'] = df_a.index
result = pd.merge(tab,df_a,left_on=['COL1','COL2','COL3'],right_on=0,how='left')
Here is the df in dict format if it can helps:如果有帮助,这里是 dict 格式的 df:
{'COL1': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G3', 6: 'G3', 7: 'G3'}, 'COL2': {0: 'SP1', 1: 'SP1', 2: 'SP2', 3: 'SP1', 4: 'SP2', 5: 'SP1', 6: 'SP1', 7: 'SP1'}, 'COL3': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'C', 5: 'D', 6: 'D', 7: 'D'}}
Let's try我们试试看
cols = ['COL1', 'COL2', 'COL3']
df['Groups'] = 'Group' + df.groupby(cols).ngroup().add(1).astype(str)
df['Nb_dup'] = df.groupby('Groups')['Groups'].transform('count')
print(df)
COL1 COL2 COL3 Groups Nb_dup
0 G1 SP1 A Group1 2
1 G1 SP1 A Group1 2
2 G1 SP2 B Group2 1
3 G2 SP1 C Group3 1
4 G2 SP2 C Group4 1
5 G3 SP1 D Group5 3
6 G3 SP1 D Group5 3
7 G3 SP1 D Group5 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.