简体   繁体   中英

Hash algorithm for english phrases

I'm working on an English words app right now, I want every word has a different int id, as all words are different from one another, I think they simply can be assigned an integer (or long?) easily.

I don't want to give them ids serially according to alphabetical order. I think there may be an existing algorithm for this very requirement, I don't want to invent my own wheels, so, please help me.

I prefer integer id because I want the structure to be compact and small enough for transfer over the internet, because one word list might contain hundreds and thousands of words.

Imagine I have data structure as bellow:

struct word {
  int wordId;
  byte familiarity;
}
// I prefer the mapping like this
apple -> 0x1,  0x4
app   -> 0x2E, 0x2
ape   -> 0xEA, 0x1

UPDATE:

Okay, what I'm trying to do is to provide the users several wordlists, each of which contains a couple of words, chances are the user already learn some of the words(eg. apple), so he/she wants to skip those words, and hope they'll never show up again. So, I want to enable the user to skip those words, and the selected words will be sent to the server or kept in a local file, it might be unnecessary to send the whole word or phrase. I've found a question here: http://stackoverflow.com/questions/7700400/whats-a-good-hash-function-for-english-words, do you have any better solutions?

Yes, it seems impossible to find a perfect collision free hashing algorithm, I might end up with maintaining a mapping file.

I also find a great question and answer here .

Actually I don't mind the performance of this algorithm for it's all done on server, and only once while starting up. All I want is the id for every word/phrase is unique, as short as possible, like a fingerprint. I wonder if I can take advantage of the prime numbers..

Finally, I decide to use a long for my id

(8 bit) first letter of the first word
(8 bit) last letter of the last word
(4 bit) word count
(4 bit) the serial number of the longest word in the phrase
(8 bit) character count, space included
(32bit) MurmurHash3 result

You can find murmurHash3 cs implementation here:
https://gist.github.com/automatonic/3725443
I think this approach will generate unique ids for any existing words and phrases with no collision.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM