简体   繁体   English

Pandas - 根据行值生成唯一 ID

[英]Pandas - Generate Unique ID based on row values

I would like to generate an integer-based unique ID for users (in my df).我想为用户生成一个基于整数的唯一 ID(在我的 df 中)。

Let's say I have:假设我有:

index  first  last    dob
0      peter  jones   20000101
1      john   doe     19870105
2      adam   smith   19441212
3      john   doe     19870105
4      jenny  fast    19640822

I would like to generate an ID column like so:我想像这样生成一个 ID 列:

index  first  last    dob       id
0      peter  jones   20000101  1244821450
1      john   doe     19870105  1742118427
2      adam   smith   19441212  1841181386
3      john   doe     19870105  1742118427
4      jenny  fast    19640822  1687411973

10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID). 10 位 ID,但它基于字段的值(john doe 相同的行值获得相同的 ID)。

I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case.我研究了散列、加密、UUID,但找不到与这个特定的非安全用例有太多关系。 It's just about generating an internal identifier.它只是生成一个内部标识符。

  • I can't use groupby/cat code type methods in case the order of the rows change.如果行的顺序发生变化,我不能使用 groupby/cat 代码类型方法。
  • The dataset won't grow beyond 50k rows.数据集不会超过 50k 行。
  • Safe to assume there won't be a first, last, dob duplicate.可以安全地假设不会有第一个,最后一个,dob 重复。

Feel like I may be tackling this the wrong way as I can't find much literature on it!感觉我可能以错误的方式解决这个问题,因为我找不到太多关于它的文献!

Thanks谢谢

You can try using hash function.您可以尝试使用哈希函数。

df['id'] = df[['first', 'last']].sum(axis=1).map(hash)

Please note the hash id is greater than 10 digits and is a unique integer sequence.请注意哈希 id 大于 10 位并且是唯一的整数序列。

Here's a way of doing using numpy这是使用 numpy 的一种方法

import numpy as np
np.random.seed(1)

# create a list of unique names
names = df[['first', 'last']].agg(' '.join, 1).unique().tolist()

# generte ids
ids = np.random.randint(low=1e9, high=1e10, size = len(names))

# maps ids to names
maps = {k:v for k,v in zip(names, ids)}

# add new id column
df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps)

   index  first   last       dob          id
0      0  peter  jones  20000101  9176146523
1      1   john    doe  19870105  8292931172
2      2   adam  smith  19441212  4108641136
3      3   john    doe  19870105  8292931172
4      4  jenny   fast  19640822  6385979058

You can apply the below function on your data frame column.您可以在数据框列上应用以下函数。

def generate_id(s):
    return abs(hash(s)) % (10 ** 10)

df['id'] = df['first'].apply(generate_id)

In case find out some values are not in exact digits, something like below you can do it -如果发现某些值不是精确数字,则可以执行以下操作 -

def generate_id(s, size):
    val = str(abs(hash(s)) % (10 ** size))
    if len(val) < size:
        diff = size - len(val)
        val = str(val) + str(generate_id(s[:diff], diff))
    return int(val)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM