在 Pandas 中为每行创建一个唯一值？

Question

Acquire raw data --> transform it and join it with other files --> email to end-users for review获取原始数据 --> 对其进行转换并将其与其他文件合并 --> 通过电子邮件发送给最终用户以供审查

What is the best approach?最好的方法是什么？

Answer 1

If 'employee_id'+'customer_id'+'timestamp' is long, and you are interested in something that is unlikely to have collisions, you can replace it with a hash.如果'employee_id'+'customer_id'+'timestamp'很长，并且您对不太可能发生冲突的内容感兴趣，则可以将其替换为哈希。 The range and quality of the hash will determine the probability of collisions.散列的范围和质量将决定冲突的概率。 Perhaps the simplest is to use the builtin hash .也许最简单的方法是使用内置的hash 。 Assuming your DataFrame is df , and the columns are strings, this is假设你的 DataFrame 是df ，列是字符串，这是

(df.employee_id + df.customer_id + df.timestamp).apply(hash)

If you want greater control of the size and collision probability, see this piece on non-crypotgraphic hash functions in Python .如果您想更好地控制大小和碰撞概率，请参阅有关 Python 中非密码散列函数的这篇文章。

Edit编辑

Building on an answer to this question , you could build 10-character hashes like this:基于对这个问题的回答，您可以像这样构建 10 个字符的哈希：

import hashlib
df['survey_id'] = (df.employee_id + df.customer_id + df.timestamp).apply(
    lambda s: hashlib.md5(s).digest().encode('base64')[: 10])

Answer 2

If anyone is looking for a modularized function, save this into a file for use where needed.如果有人正在寻找模块化功能，请将其保存到文件中以在需要时使用。 (for Pandas DataFrames) （对于 Pandas 数据帧）

df is your dataframe, columns is a list of columns to hash over, and name is the name of your new column with hash values. df是您的数据框， columns是要散列的列列表， name是具有散列值的新列的名称。

Returns a copy of the original dataframe with a new column containing the hash of each row.返回原始数据帧的副本，其中包含一个包含每行哈希的新列。

def hash_cols(df, columns, name="hash"):
    new_df = df.copy()
    def func(row, cols):
        col_data = []
        for col in cols:
            col_data.append(str(row.at[col]))

        col_combined = ''.join(col_data).encode()
        hashed_col = sha256(col_combined).hexdigest()
        return hashed_col

    new_df[name] = new_df.apply(lambda row: func(row,columns), axis=1)

    return new_df

Answer 3

I had a similar problem, that I solved thusly:我有一个类似的问题，我这样解决了：

import hashlib
import pandas as pd
df = pd.DataFrame.from_dict({'mine': ['yours', 'amazing', 'pajamas'], 'have': ['something', 'nothing', 'between'], 'num': [1, 2, 3]})
hashes = []
for index, row in df.iterrows():
    hashes.append(hashlib.md5(str(row).encode('utf-8')).hexdigest())
# if you want the hashes in the df, 
# in my case, I needed them to form a JSON entry per row
df['hash'] = hashes

The results will form an md5 hash, but you can really use any hash function you need to.结果将形成一个 md5 散列，但您实际上可以使用任何您需要的散列函数。

在 Pandas 中为每行创建一个唯一值？

问题描述

3 个解决方案

解决方案1
3 已采纳 2016-03-09 18:07:30

解决方案2
0 2021-07-27 02:20:23

解决方案3
0 2021-08-16 17:00:35

在 Pandas 中为每行创建一个唯一值？

问题描述

3 个解决方案

解决方案1 3 已采纳 2016-03-09 18:07:30

解决方案2 0 2021-07-27 02:20:23

解决方案3 0 2021-08-16 17:00:35

解决方案1
3 已采纳 2016-03-09 18:07:30

解决方案2
0 2021-07-27 02:20:23

解决方案3
0 2021-08-16 17:00:35