How to assign a unique ID to detect repeated rows in a pandas dataframe?

Question

I am working with a large pandas dataframe, with several columns pretty much like this:

A      B         C    D   

John   Tom       0    1
Homer  Bart      2    3
Tom    Maggie    1    4 
Lisa   John      5    0
Homer  Bart      2    3
Lisa   John      5    0
Homer  Bart      2    3
Homer  Bart      2    3
Tom    Maggie    1    4

How can I assign an unique id to each repeated row? For example:

A      B         C    D      new_id

John   Tom       0    1.2      1
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.2      3
Lisa   John      5    0        4
Homer  Bart      2    3        5
Lisa   John      5    0        4
Homer  Bart      2    3.0      2
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.1      6

I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:

df.assign(id=(df.columns).astype('category').cat.codes)
df

However, is not working. How can I get a unique id for detecting groups of duplicated rows?

Answer 1

按您尝试查找重复项的列分组并使用ngroup ：

df['new_id'] = df.groupby(['A','B','C','D']).ngroup()

Answer 2

For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize .

df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1

groupby is more efficient for larger dataframes:

df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1

How to assign a unique ID to detect repeated rows in a pandas dataframe?

Question

2 answers

solution1
5 2018-06-29 22:39:11

solution2
3 ACCPTED 2018-06-29 22:40:16

How to assign a unique ID to detect repeated rows in a pandas dataframe?

Question

2 answers

solution1 5 2018-06-29 22:39:11

solution2 3 ACCPTED 2018-06-29 22:40:16

solution1
5 2018-06-29 22:39:11

solution2
3 ACCPTED 2018-06-29 22:40:16