在 Pandas 中基于列重复数据删除创建 unique_id

Question

I'm trying to generate an unique_id based using as a base some columns:我正在尝试使用一些列作为基础生成一个 unique_id：

The current process has the following process:当前流程有以下流程：

Indicates columns that will be used as unique;表示将用作唯一的列；
Create a bool column called is_duplicated ;创建一个名为is_duplicated的 bool 列；
Iterate over nonduplicates, the unique rows, and put an integer generated by enumerate in all equal rows.遍历非重复的唯一行，并将枚举生成的整数放入所有相等的行中。

optimal = ["date", "amount", "description", "tenant_id", "comment", "bank_account_id"]
data_normalization["is_duplicated"] = data_normalization.duplicated(subset=optimal)

for unique_id, row in enumerate(data_normalization.loc[data_normalization.is_duplicated == False].itertuples()):
    data_normalization.loc[
        (data_normalization.date == row.date,) &
        (data_normalization.amount == row.amount,) &
        (data_normalization.description == row.description,) &
        (data_normalization.tenant_id == row.tenant_id,) &
        (data_normalization.comment == row.comment,) &
        (data_normalization.bank_account_id == row.bank_account_id,),
        "unique_id"
    ] = unique_id

The way above works, but I'm wondering if there is no better way to do it, than using pandas features.上述方法可行，但我想知道是否没有比使用熊猫功能更好的方法。

Example:例子：

Suppose that we have a table like below假设我们有一个如下表

| Row1     | row2           | Row3     | unique_id |
| -------- | -------------- | -------- | --------  |
| First    | row            | First    |    1      |
| First    | row            | First    |    1      |
| Second   | 22             |scondd    |    2      |
| Second   | 22             |scondd    |    2      |
| Second   | 22             |scondd    |    2      |
| Third    | 22             |scondd    |    3      |

Basically, the unique_id is created through ["Row1", "row2"]基本上，unique_id 是通过 ["Row1", "row2"] 创建的
Every time that the Row1 and row2 are equal the index remaining the same;每次 Row1 和 row2 相等时，索引保持不变；
When not the index is increased不增加索引时

-- --

The idea is to create a unique integer id over the target columns.;这个想法是在目标列上创建一个唯一的整数 id。
The snippet code above works, but I want some more clear and performative, that uses pandas power上面的代码片段有效，但我想要一些更清晰和更具表现力的，使用 pandas 的力量

Answer 1

You can use duplicated and cumsum to get that done.您可以使用duplicated和cumsum来完成。

Starting with your sample data frame从您的示例数据框开始

     Row1 row2    Row3
0   First  row   First
1   First  row   First
2  Second   22  scondd
3  Second   22  scondd
4  Second   22  scondd
5   Third   22  scondd

Execute执行

df['unique_id']  = (~df.duplicated(['Row1','row2'])).cumsum()
print(df)

Result结果

     Row1 row2    Row3  unique_id
0   First  row   First          1
1   First  row   First          1
2  Second   22  scondd          2
3  Second   22  scondd          2
4  Second   22  scondd          2
5   Third   22  scondd          3

在 Pandas 中基于列重复数据删除创建 unique_id

问题描述

1 个解决方案

解决方案1
0 2022-07-23 00:17:26

在 Pandas 中基于列重复数据删除创建 unique_id

问题描述

1 个解决方案

解决方案1 0 2022-07-23 00:17:26

解决方案1
0 2022-07-23 00:17:26