简体   繁体   English

如何从唯一的字符串生成唯一的 int?

[英]how can i generate a unique int from a unique string?

I have an object with a String that holds a unique id .我有一个带有 String 的对象,其中包含一个唯一的 id 。 (such as "ocx7gf" or "67hfs8") I need to supply it an implementation of int hascode() which will be unique obviously. (例如“ocx7gf”或“67hfs8”)我需要为它提供一个 int hascode() 的实现,这显然是唯一的。

how do i cast a string to a unique int in the easiest/fastest way?如何以最简单/最快的方式将字符串转换为唯一的 int?

10x. 10 倍。

Edit - OK.编辑 - 好的。 I already know that String.hashcode is possible.我已经知道 String.hashcode 是可能的。 But it is not recommended in any place.但不建议在任何地方使用。 Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode.实际上,如果不推荐任何其他方法 - 如果我的对象在集合中并且我需要哈希码,我是否应该使用它。 should I concat it to another string to make it more successful?我应该将它连接到另一个字符串以使其更成功吗?

No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.不,你并不需要有一个实现,它返回一个独特的价值,“明明”,因为很明显多数的实现将被打破。

What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others).您想要做的是在位之间进行良好的传播,尤其是对于常见值(如果任何值比其他值更常见)。 Barring special knowledge of your format, then just using the hashcode of the string itself would be best.除非对您的格式有特殊了解,否则最好只使用字符串本身的哈希码。

With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.凭借对 id 格式限制的特殊了解,有可能进行自定义并获得更好的性能,尽管错误的假设更有可能使事情变得更糟而不是更好。

Edit: On good spread of bits.编辑:关于位的良好传播。

As stated here and in other answers, being completely unique is impossible and hash collisions are possible.正如此处和其他答案中所述,完全唯一是不可能的,并且可能会发生哈希冲突。 Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.使用散列的方法知道这一点并且可以处理它,但它确实会影响性能,因此我们希望冲突很少发生。

Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to eg one in the range 0 to 22, and we want as good a distribution within that as possible to.此外,散列通常会重新散列,因此我们的 32 位数字可能最终会减少到例如 0 到 22 范围内的 1,并且我们希望在该范围内有尽可能好的分布。

We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself.我们还希望平衡这一点,而不是花费太长时间来计算我们的哈希值,以至于它本身成为一个瓶颈。 An imperfect balancing act.一个不完美的平衡行为。

A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:一个糟糕的散列方法的经典例子是 X、Y 整数的坐标对,它执行以下操作:

return X ^ Y;

While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number.虽然这在返回 4^32 个可能输入中的 2^32 个可能值方面做得非常好,但在实际使用中,具有 X 和 Y 相等的坐标集是很常见的 ({0, 0}, {1 , 1}, {2, 2} 等),它们都散列为零,或者匹配的对({2,3} 和 {3, 2})将散列到相同的数字。 We are likely better served by:我们可能通过以下方式获得更好的服务:

return ((X << 16) | (x >> 16)) ^ Y;

Now, there are just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.现在,同样多的可能值,也是它比前可怕的,但它往往服务于现实世界的情况下更好。

Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand.当然,如果您正在编写一个通用类(不知道有哪些可能的输入)或者对手头的目的有更好的了解,则有不同的工作。 For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one.例如,如果我使用 Date 对象但知道它们都只是日期(时间部分总是午夜)并且只在彼此之间的几年内,那么我可能更喜欢只使用日、月和年份的较低数字,超过标准数字。 The writer of Date though can't work on such knowledge and has to try to cater for everyone. Date虽然不能在这些知识上工作,必须设法满足每个人的需求。

Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [az] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.因此,例如,如果我知道给定的字符串总是由 [az] 或 [0-9] 范围内的 6 个不区分大小写的字符组成(您的似乎是这样,但从您的问题中不清楚确实如此)然后我可能会使用一种算法,为每个字符分配一个从 0 到 35(每个字符的 36 个可能值)的值,然后遍历字符串,每次将当前值乘以 36 并加上下一个字符。

Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.假设在 ids 中有一个很好的传播,这将是要走的路,特别是如果我下令使哈希中的低有效数字与 id 中最常变化的字符相匹配(如果可以进行这样的调用) ,因此可以很好地重新散列到较小的范围内。

However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).但是,由于确实缺乏对格式的了解,我无法确定地进行调用,而且我很可能会让事情变得更糟(较慢的算法,哈希质量的增益很小甚至是负增益)。

One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined.您拥有的一个优势是,由于它本身就是一个 ID,因此可能没有其他不相等的对象具有相同的 ID,因此不需要检查其他属性。 This doesn't always hold.这并不总是成立。

You can't get a unique integer from a String of unlimited length.您无法从无限长度的字符串中获取唯一整数。 There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.有 4 亿 (2^32) 个唯一整数,但唯一字符串的数量几乎是无限的。

String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string. String.hashCode()不会为您提供唯一的整数,但它会尽力根据输入字符串为您提供不同的结果。

EDIT编辑

Your edited question says that String.hashCode() is not recommended.您编辑的问题表示不建议使用 String.hashCode() 。 This is not true, it is recommended, unless you have some special reason not to use it.这是不正确的,建议这样做,除非您有特殊原因不使用它。 If you do have a special reason, please provide details.如果您确实有特殊原因,请提供详细信息。

Looks like you've got a base-36 number there (az + 0-9).看起来你有一个 base-36 数字(az + 0-9)。 Why not convert it to an int using Integer.parseInt(s, 36) ?为什么不使用Integer.parseInt(s, 36)将其转换为 int 呢? Obviously, if there are too many unique IDs, it won't fit into an int , but in that case you're out of luck with unique integers and will need to get by using String.hashCode() , which does its best to be close to unique.显然,如果唯一的 ID 太多,它就不能放入int ,但在这种情况下,你对唯一的整数不走运,需要使用String.hashCode()来获取,它尽最大努力接近独特。

Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.除非您的字符串在某些方面受到限制,或者您的整数比您尝试转换的字符串包含更多位,否则您无法保证唯一性。

Let's say you have a 32 bit integer and a 64-character character set for your strings.假设您的字符串有一个 32 位整数和一个 64 个字符的字符集。 That means six bits per character.这意味着每个字符 6 位。 That will allow you to store five characters into an integer.这将允许您将五个字符存储为一个整数。 More than that and it won't fit.超过这个,它就不适合了。

One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters).一种方法是为每个字母分配一个值,字符串的每个位置都是它自己的倍数,即 a = 1,b = 2,依此类推,然后第一个数字(从左到右阅读)中的所有内容都将乘以一个质数,下一个质数等等,这样最后一位数字乘以一个大于该数字中可能的子集数量的质数(26+1 表示空格或 52+1 表示国会大厦等等对于其他支持的字符)。 If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.如果数字被映射回第一个数字(最左边的字符),您从一个唯一的字符串生成的任何数字映射回 1 或 6,无论第一个字母是什么,都会给出一个唯一的值。

Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog 可能是 30,3(15),101(7) 或 782,而 God 可能是 33,3(15),101(4) 或 482。保留,就像 30(782) 对某些 12(782) 来说是唯一的,以便区分类似的字符串,如果您曾经设法解决独特的可能性。 Dog would always be Dog, but it would never be Cat or Mouse.狗永远是狗,但永远不会是猫或老鼠。

represent each string character by a five-digit binary digit, eg.用五位二进制数字表示每个字符串字符,例如。 a by 00001 b by 00010 etc. thus 32 combinations are possible, for example, cat might be written as 00100 00001 01100, then convert this binary into decimal, eg. a by 00001 b by 00010 等等,因此有 32 种组合是可能的,例如,cat 可能写为 00100 00001 01100,然后将此二进制转换为十进制,例如。 this would be 4140, thus cat would be 4140, similarly, you can get cat back from 4140 by converting it to binary first and Map the five digit binary to string这将是 4140,因此 cat 将是 4140,类似地,您可以通过首先将其转换为二进制并将五位二进制映射到字符串来从 4140 中取回 cat

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM