[英]How to assign a unique ID to detect repeated rows in a pandas dataframe?
I am working with a large pandas dataframe, with several columns pretty much like this: 我正在使用一个大型的pandas数据框,其中有几个列非常类似:
A B C D
John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
Homer Bart 2 3
Lisa John 5 0
Homer Bart 2 3
Homer Bart 2 3
Tom Maggie 1 4
How can I assign an unique id to each repeated row? 如何为每个重复的行分配唯一的ID? For example:
例如:
A B C D new_id
John Tom 0 1.2 1
Homer Bart 2 3.0 2
Tom Maggie 1 4.2 3
Lisa John 5 0 4
Homer Bart 2 3 5
Lisa John 5 0 4
Homer Bart 2 3.0 2
Homer Bart 2 3.0 2
Tom Maggie 1 4.1 6
I know that I can use duplicate
to detect the duplicated rows, however I can not visualize were are reapeting those rows. 我知道我可以使用
duplicate
来检测重复的行,但是我无法想象正在重新划分这些行。 I tried to: 我试过了:
df.assign(id=(df.columns).astype('category').cat.codes)
df
However, is not working. 但是,不起作用。 How can I get a unique id for detecting groups of duplicated rows?
如何获取用于检测重复行组的唯一ID?
按您尝试查找重复项的列分组并使用ngroup
:
df['new_id'] = df.groupby(['A','B','C','D']).ngroup()
For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize
. 对于小型数据帧,您可以将行转换为可以散列的元组,然后使用
pd.factorize
。
df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1
groupby
is more efficient for larger dataframes: groupby
对于更大的数据帧更有效:
df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.