如何分配唯一ID来检测pandas数据帧中的重复行？

Question

I am working with a large pandas dataframe, with several columns pretty much like this: 我正在使用一个大型的pandas数据框，其中有几个列非常类似：

A      B         C    D   

John   Tom       0    1
Homer  Bart      2    3
Tom    Maggie    1    4 
Lisa   John      5    0
Homer  Bart      2    3
Lisa   John      5    0
Homer  Bart      2    3
Homer  Bart      2    3
Tom    Maggie    1    4

How can I assign an unique id to each repeated row? 如何为每个重复的行分配唯一的ID？ For example: 例如：

A      B         C    D      new_id

John   Tom       0    1.2      1
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.2      3
Lisa   John      5    0        4
Homer  Bart      2    3        5
Lisa   John      5    0        4
Homer  Bart      2    3.0      2
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.1      6

I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. 我知道我可以使用duplicate来检测重复的行，但是我无法想象正在重新划分这些行。 I tried to: 我试过了：

df.assign(id=(df.columns).astype('category').cat.codes)
df

However, is not working. 但是，不起作用。 How can I get a unique id for detecting groups of duplicated rows? 如何获取用于检测重复行组的唯一ID？

Answer 1

按您尝试查找重复项的列分组并使用ngroup ：

df['new_id'] = df.groupby(['A','B','C','D']).ngroup()

Answer 2

For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize . 对于小型数据帧，您可以将行转换为可以散列的元组，然后使用pd.factorize 。

df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1

groupby is more efficient for larger dataframes: groupby对于更大的数据帧更有效：

df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1

如何分配唯一ID来检测pandas数据帧中的重复行？

问题描述

2 个解决方案

解决方案1
5 2018-06-29 22:39:11

解决方案2
3 已采纳 2018-06-29 22:40:16

如何分配唯一ID来检测pandas数据帧中的重复行？

问题描述

2 个解决方案

解决方案1 5 2018-06-29 22:39:11

解决方案2 3 已采纳 2018-06-29 22:40:16

解决方案1
5 2018-06-29 22:39:11

解决方案2
3 已采纳 2018-06-29 22:40:16