I have a dataframe such as:
COL1 COL2 COL3
G1 SP1 A
G1 SP1 A
G1 SP2 B
G2 SP1 C
G2 SP2 C
G3 SP1 D
G3 SP1 D
G3 SP1 D
And I would simply like to add a new Groups
column with groups of duplicated COL1,COL2 and COL3
values and a Nb_dup
column with the number of dup such as:
COL1 COL2 COL3 Groups Nb_dup
G1 SP1 A Group1 2
G1 SP1 A Group1 2
G1 SP2 B Group2 1
G2 SP1 C Group3 1
G2 SP2 C Group4 1
G3 SP1 D Group5 3
G3 SP1 D Group5 3
G3 SP1 D Group5 3
So far I tried:
key_set = set(df[['COL1','COL2','COL3']])
df_a = pd.DataFrame(list(key_set))
df_a['Groups'] = df_a.index
result = pd.merge(tab,df_a,left_on=['COL1','COL2','COL3'],right_on=0,how='left')
Here is the df in dict format if it can helps:
{'COL1': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G3', 6: 'G3', 7: 'G3'}, 'COL2': {0: 'SP1', 1: 'SP1', 2: 'SP2', 3: 'SP1', 4: 'SP2', 5: 'SP1', 6: 'SP1', 7: 'SP1'}, 'COL3': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'C', 5: 'D', 6: 'D', 7: 'D'}}
Let's try
cols = ['COL1', 'COL2', 'COL3']
df['Groups'] = 'Group' + df.groupby(cols).ngroup().add(1).astype(str)
df['Nb_dup'] = df.groupby('Groups')['Groups'].transform('count')
print(df)
COL1 COL2 COL3 Groups Nb_dup
0 G1 SP1 A Group1 2
1 G1 SP1 A Group1 2
2 G1 SP2 B Group2 1
3 G2 SP1 C Group3 1
4 G2 SP2 C Group4 1
5 G3 SP1 D Group5 3
6 G3 SP1 D Group5 3
7 G3 SP1 D Group5 3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.