简体   繁体   中英

How to map string keys to unique integer IDs?

I have some data that comes regularily as a dump from a data souce with a string natural key that is long (up to 60 characters) and not relevant to the end user. I am using this key in a url. This makes urls too long and user unfriendly.

I would like to transform the string keys into integers with the following requirements:

The source dataset will change over time.

The ID should be:

  • non negative integer
  • unique and constant even if the set of input keys changes
  • preferrably reversible back to key (not a strong requirement)

The database is rebuilt from scratch every time so I can not remember the already assigned IDs and match the new data set to existing IDs and generate sequential IDs for the added keys.

There are currently around 30000 distinct keys and the set is constantly growing.

How to implement a function that will map string keys to integer IDs?

What I have thought about:

1. Built-in string.GetHashCode:

ID(key) = Math.Abs(key.GetHashCode())

  • is not guaranteed to be unique
  • (not reversible)

1.1 "Re-hashing" the built-in GetHashCode until a unique ID is generated to prevent collisions.

  • existing IDs may change if something colliding is added to the beginning of the input data set

2. a perfect hashing function

  • I am not sure if this can generate constant IDs if the set of inputs changes
  • (not reversible)

3. translate to base 36/64/??

  • does not shorten the long keys enough

What are the other options?

A Base64-encoded sha1sum is 27 characters. base64(md5(...)) is 22 characters. Any smaller and you will have a non-negligible risk of collisions.

Perfect hashing functions aren't possible when the set of inputs changes.

You can only do that if you can keep a list of assigned IDs.

For any give algorithm that actually gives you unique ID for the current set, any new value is not guaranteed to get a unique ID.

The strings contain about 400 bits of information, so to get an integer that is guaranteed to be unique it would have to contain all the information from the string and be about 400 bits. That's a 120 characters expressed as a decimal number so that's not shorter than what you have now.

Set up a second, persistant DB and store your KEY/ID pairs there. Make sure you also have the data's date in the table so you can do some house-keeping.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM