简体   繁体   English

在 Pandas 中基于列重复数据删除创建 unique_id

[英]Create unique_id based in columns deduplication in Pandas

I'm trying to generate an unique_id based using as a base some columns:我正在尝试使用一些列作为基础生成一个 unique_id:

The current process has the following process:当前流程有以下流程:

  • Indicates columns that will be used as unique;表示将用作唯一的列;
  • Create a bool column called is_duplicated ;创建一个名为is_duplicated的 bool 列;
  • Iterate over nonduplicates, the unique rows, and put an integer generated by enumerate in all equal rows.遍历非重复的唯一行,并将枚举生成的整数放入所有相等的行中。
optimal = ["date", "amount", "description", "tenant_id", "comment", "bank_account_id"]
data_normalization["is_duplicated"] = data_normalization.duplicated(subset=optimal)

for unique_id, row in enumerate(data_normalization.loc[data_normalization.is_duplicated == False].itertuples()):
    data_normalization.loc[
        (data_normalization.date == row.date,) &
        (data_normalization.amount == row.amount,) &
        (data_normalization.description == row.description,) &
        (data_normalization.tenant_id == row.tenant_id,) &
        (data_normalization.comment == row.comment,) &
        (data_normalization.bank_account_id == row.bank_account_id,),
        "unique_id"
    ] = unique_id
    

The way above works, but I'm wondering if there is no better way to do it, than using pandas features.上述方法可行,但我想知道是否没有比使用熊猫功能更好的方法。

Example:例子:

  • Suppose that we have a table like below假设我们有一个如下表
| Row1     | row2           | Row3     | unique_id |
| -------- | -------------- | -------- | --------  |
| First    | row            | First    |    1      |
| First    | row            | First    |    1      |
| Second   | 22             |scondd    |    2      |
| Second   | 22             |scondd    |    2      |
| Second   | 22             |scondd    |    2      |
| Third    | 22             |scondd    |    3      |
  • Basically, the unique_id is created through ["Row1", "row2"]基本上,unique_id 是通过 ["Row1", "row2"] 创建的
  • Every time that the Row1 and row2 are equal the index remaining the same;每次 Row1 和 row2 相等时,索引保持不变;
  • When not the index is increased不增加索引时

-- --

  • The idea is to create a unique integer id over the target columns.;这个想法是在目标列上创建一个唯一的整数 id。
  • The snippet code above works, but I want some more clear and performative, that uses pandas power上面的代码片段有效,但我想要一些更清晰和更具表现力的,使用 pandas 的力量

You can use duplicated and cumsum to get that done.您可以使用duplicatedcumsum来完成。

Starting with your sample data frame从您的示例数据框开始

     Row1 row2    Row3
0   First  row   First
1   First  row   First
2  Second   22  scondd
3  Second   22  scondd
4  Second   22  scondd
5   Third   22  scondd

Execute执行

df['unique_id']  = (~df.duplicated(['Row1','row2'])).cumsum()
print(df)

Result结果

     Row1 row2    Row3  unique_id
0   First  row   First          1
1   First  row   First          1
2  Second   22  scondd          2
3  Second   22  scondd          2
4  Second   22  scondd          2
5   Third   22  scondd          3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM