[英]Create unique_id based in columns deduplication in Pandas
I'm trying to generate an unique_id based using as a base some columns:我正在尝试使用一些列作为基础生成一个 unique_id:
The current process has the following process:当前流程有以下流程:
is_duplicated
;is_duplicated
的 bool 列;optimal = ["date", "amount", "description", "tenant_id", "comment", "bank_account_id"]
data_normalization["is_duplicated"] = data_normalization.duplicated(subset=optimal)
for unique_id, row in enumerate(data_normalization.loc[data_normalization.is_duplicated == False].itertuples()):
data_normalization.loc[
(data_normalization.date == row.date,) &
(data_normalization.amount == row.amount,) &
(data_normalization.description == row.description,) &
(data_normalization.tenant_id == row.tenant_id,) &
(data_normalization.comment == row.comment,) &
(data_normalization.bank_account_id == row.bank_account_id,),
"unique_id"
] = unique_id
The way above works, but I'm wondering if there is no better way to do it, than using pandas features.上述方法可行,但我想知道是否没有比使用熊猫功能更好的方法。
Example:例子:
| Row1 | row2 | Row3 | unique_id |
| -------- | -------------- | -------- | -------- |
| First | row | First | 1 |
| First | row | First | 1 |
| Second | 22 |scondd | 2 |
| Second | 22 |scondd | 2 |
| Second | 22 |scondd | 2 |
| Third | 22 |scondd | 3 |
-- --
You can use duplicated
and cumsum
to get that done.您可以使用
duplicated
和cumsum
来完成。
Starting with your sample data frame从您的示例数据框开始
Row1 row2 Row3
0 First row First
1 First row First
2 Second 22 scondd
3 Second 22 scondd
4 Second 22 scondd
5 Third 22 scondd
Execute执行
df['unique_id'] = (~df.duplicated(['Row1','row2'])).cumsum()
print(df)
Result结果
Row1 row2 Row3 unique_id
0 First row First 1
1 First row First 1
2 Second 22 scondd 2
3 Second 22 scondd 2
4 Second 22 scondd 2
5 Third 22 scondd 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.