简体   繁体   中英

How to concatenate and convert multiple 32-bit hash strings to a unique identifier in Python

I have an issue where I am extracting a dataset from an API for reporting purposes, and unfortunately it has:

  1. No unique identifier field
  2. Three of the four fields that make up a unique composite key are each 32-bit hash values. They shouldn't be hash values, but the developer appears to have hashed them for some reason in this specific API-endpoint.

I am using Python 3.7.6 and pandas 1.0.3. The data will ultimately end up in SQL Server.

For my task, I am required to ensure that when I call the REST API, I can check the record uniqueness and if there have been updates to existing records in my database, use a unique identifier to understand which rows to update.

The 32-bit hash values are causing issues as:

  1. When concatenating the 3 hash fields (and one datetime field) that form a valid composite key, the resulting string is too long to use as a database primary key.

  2. The length appears to be overloading Pandas .duplicates() function (see below for example)

  3. I do not know a way to 'unhash' them to a unique identifier (Eg an integer) that I could use for checking against existing records in the database that may require updating. How can I convert this long value into a valid unique identifier that will be the same for any string hash value that is exactly the same?

Pandas includes pd.Factorize which could in theory be used for creating a unique identifier in newly extracted data, but it's then inconsistent with previously extracted data that's already in the database, as it will generated different keys for the same hashes. Furthermore, similar to #2 factorize is also failing to work accurately given the string length (see below).

Please find the example of #2 above, where pandas.duplicates() fails. I concatenate the four composite key hash strings in pandas as follows:

df['id'] = (df['Question ID'].astype(str)+df['WorkflowInstance ID'].astype('str')+df['Agent ID'].astype('str')+df['Evaluation Start Date'].astype('str')).sort_values()

And then do:

df['id'].duplicated().sum()

I get 19 duplicates:

Out[252]: 19

Here are those 'duplicates' (note: I have removed the dataframe row numbers from the output):

Out[264]: 

100094146736011ea2dbb-ff69-f5d1-8fa5-0242ac11000311e9e0ef-60e7-a470-9724-0242ac1100052020-01-03 11:00:05+11:00

101580231257511ea533b-5eb8-8bb1-a514-0242ac11000211e9e0ef-5ec7-3570-9724-0242ac1100052020-02-20 04:15:04+11:00

102935022988411ea2dbb-ff69-f5d1-8fa5-0242ac11000311e9e0ef-60e7-a470-9724-0242ac1100052020-01-03 11:00:05+11:00

103643806614811ea5122-ee46-6471-a514-0242ac11000211e9e0ef-62d2-f9b0-9724-0242ac1100052020-02-17 12:15:05+11:00

104448956250611ea1d09-3888-4fb1-ad55-0242ac11000411e9e0ef-6156-6bd0-9724-0242ac1100052019-12-13 05:00:03+11:00

104448956250611ea2dbb-ff69-f5d1-8fa5-0242ac11000311e9e0ef-60e7-a470-9724-0242ac1100052020-01-03 11:00:05+11:00

105036638204211ea2dbb-ff69-f5d1-8fa5-0242ac11000311e9e0ef-60e7-a470-9724-0242ac1100052020-01-03 11:00:05+11:00

105439525877511ea1d09-3888-4fb1-ad55-0242ac11000411e9e0ef-6156-6bd0-9724-0242ac1100052019-12-13 05:00:03+11:00

105439525877511ea3cba-ec08-1e01-b4fb-0242ac11000511e9e0ef-6156-6bd0-9724-0242ac1100052020-01-22 13:00:11+11:00

105753070464411ea1d09-3888-4fb1-ad55-0242ac11000411e9e0ef-6156-6bd0-9724-0242ac1100052019-12-13 05:00:03+11:00

105753070464411ea2dbb-ff69-f5d1-8fa5-0242ac11000311e9e0ef-60e7-a470-9724-0242ac1100052020-01-03 11:00:05+11:00

105929086494811ea2dbb-ff69-f5d1-8fa5-0242ac11000311e9e0ef-60e7-a470-9724-0242ac1100052020-01-03 11:00:05+11:00

105942530227011ea1d09-3888-4fb1-ad55-0242ac11000411e9e0ef-6156-6bd0-9724-0242ac1100052019-12-13 05:00:03+11:00

106416598476711ea533b-5eb8-8bb1-a514-0242ac11000211e9e0ef-5ec7-3570-9724-0242ac1100052020-02-20 04:15:04+11:00

107187437764311ea1d09-3888-4fb1-ad55-0242ac11000411e9e0ef-6156-6bd0-9724-0242ac1100052019-12-13 05:00:03+11:00

108645054061411ea1d09-3888-4fb1-ad55-0242ac11000411e9e0ef-6156-6bd0-9724-0242ac1100052019-12-13 05:00:03+11:00

108669477403111ea1d09-3888-4fb1-ad55-0242ac11000411e9e0ef-6156-6bd0-9724-0242ac1100052019-12-13 05:00:03+11:00

108669477403111ea3cba-ec08-1e01-b4fb-0242ac11000511e9e0ef-6156-6bd0-9724-0242ac1100052020-01-22 13:00:11+11:00

108783533783711ea6ebc-2980-1bb1-be7d-0242ac11000311e9e0ef-5a8e-e2f0-a8b3-0242ac1100032020-03-26 04:15:01+11:00

They do not appear to be duplicates. And considering they are strings, I would have thought they wouldn't have been subject to floating point or rounding errors?

However, If I rinse and repeat: I get 0 duplicates (as it should be):

df['id'][df['id'].duplicated()].sort_values().duplicated().sum()

Out[257]: 0

What could be causing this strange behaviour?

This workaround lets me at least identify uniqueness in the records, but it doesn't address the other issues above.

The solution I have come up with is to maintain a secondary mapping table of known hash values, and an associated integer key. I use this to create a unique identifier to use instead, and store these in the target table instead of the hashes.

So this does answer the primary question, though I am interested in other options. Hopefully this helps somebody.

Also, the pandas.duplicated() issue is still a mystery.

Cheers Pete

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM