简体   繁体   中英

Create a group column based on duplicated elements within 3 columns in pandas

I have a dataframe such as:

COL1 COL2 COL3
G1   SP1  A
G1   SP1  A
G1   SP2  B
G2   SP1  C
G2   SP2  C
G3   SP1  D
G3   SP1  D
G3   SP1  D

And I would simply like to add a new Groups column with groups of duplicated COL1,COL2 and COL3 values and a Nb_dup column with the number of dup such as:

COL1 COL2 COL3 Groups Nb_dup
G1   SP1  A    Group1      2
G1   SP1  A    Group1      2
G1   SP2  B    Group2      1
G2   SP1  C    Group3      1
G2   SP2  C    Group4      1
G3   SP1  D    Group5      3
G3   SP1  D    Group5      3
G3   SP1  D    Group5      3

So far I tried:

key_set = set(df[['COL1','COL2','COL3']])
df_a = pd.DataFrame(list(key_set))
df_a['Groups'] = df_a.index
result = pd.merge(tab,df_a,left_on=['COL1','COL2','COL3'],right_on=0,how='left')

Here is the df in dict format if it can helps:

{'COL1': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G3', 6: 'G3', 7: 'G3'}, 'COL2': {0: 'SP1', 1: 'SP1', 2: 'SP2', 3: 'SP1', 4: 'SP2', 5: 'SP1', 6: 'SP1', 7: 'SP1'}, 'COL3': {0: 'A', 1: 'A', 2: 'B', 3: 'C', 4: 'C', 5: 'D', 6: 'D', 7: 'D'}}

Let's try

cols = ['COL1', 'COL2', 'COL3']

df['Groups'] = 'Group' + df.groupby(cols).ngroup().add(1).astype(str)
df['Nb_dup'] = df.groupby('Groups')['Groups'].transform('count')
print(df)

  COL1 COL2 COL3  Groups  Nb_dup
0   G1  SP1    A  Group1       2
1   G1  SP1    A  Group1       2
2   G1  SP2    B  Group2       1
3   G2  SP1    C  Group3       1
4   G2  SP2    C  Group4       1
5   G3  SP1    D  Group5       3
6   G3  SP1    D  Group5       3
7   G3  SP1    D  Group5       3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM