简体   繁体   English

为什么对哈希的mod进行加盐的哈希会导致分布非常不均匀?

[英]Why does taking the salted hash of the mod of a hash result in a very non-uniform distribution?

I have a million randomly generated unique IDs. 我有一百万个随机生成的唯一ID。

If I do: 如果我做:

result = int(hash(id + 'some_salt')) % 1000

Then this seems to result in an even distribution of IDs to some integer between 0 and 999, with each integer having approximately 1000 IDs mapped to it. 然后,这似乎导致ID均匀分布到0到999之间的某个整数,每个整数都有大约1000个ID映射到该整数。

If I now append some salt to this and take the hash again: 如果我现在在上面加上一些盐,然后再次进行哈希处理:

x = int(hash(id)) % 1000
result = int(hash(str(x) + 'some_salt') % 1000)

Then the resulting distribution is completely non-uniform. 然后,所得分布完全不均匀。 For each ID, the result is of course in the range of [0,999] but some integers in this range have zero IDs mapped to them, while others have several thousand. 对于每个ID,结果当然都在[0,999]范围内,但是此范围内的某些整数映射有0个ID,而其他整数则具有数千个ID。

Why does this result in a very non-uniform distribution of values? 为什么这会导致值的分配非常不均匀?

How can I adjust this to result in a uniform distribution of integers in the range [0,999] for my million IDs, and any given salt? 如何调整此值以使我的百万个ID和任何给定的盐在[0,999]范围内的整数均匀分布? I want to keep the intermediate step of reducing the potentially very large input space to some much smaller space (eg of size 1000). 我想保留中间步骤,将可能非常大的输入空间减少到一些小得多的空间(例如,大小为1000)。

I'm using SHA-256 hashing. 我正在使用SHA-256哈希。

Here is some Python code which demonstrates the very non-uniform results: 以下是一些Python代码,它们演示了非常不一致的结果:

import numpy as np
import hashlib

OUTPUT_RANGE_SIZE = 1000

unique_ids = xrange(1000000) # sequential here, but could be any kind of unique ids
frequencies = np.zeros(OUTPUT_RANGE_SIZE, dtype='int')

for idx in xrange(len(unique_ids)):
    id = unique_ids[idx]
    hash_mod = int(hashlib.sha256(str(id)).hexdigest(), 16) % 1000
    result = int(hashlib.sha256(str(hash_mod) + 'some_salt').hexdigest(), 16) % OUTPUT_RANGE_SIZE
    frequencies[result] = frequencies[result] + 1

print frequencies

By applying the modulo operator on your first hash operation, you've ensured that there are only 1000 unique outputs from that stage, regardless of how many unique numbers you had as inputs. 通过对第一个哈希运算应用模运算符,您可以确保该阶段只有1000个唯一输出,而不管您作为输入有多少个唯一数。 When you hash it and modulo it again, by chance some of those hashes will map to the same buckets; 当您对它进行散列并再次取模时,这些散列中的一些会偶然映射到相同的存储桶。 as a result the number of values in the bucket will be roughly 1000 times the number of values that hashed to that bucket ID. 结果,存储桶中的值数量大约是散列到该存储桶ID的值数量的1000倍。 You can see this by dividing your values in the frequencies array by 1000: 您可以通过将Frequencys数组中的值除以1000来看到此情况:

[1, 0, 2, 1, 0, 0, 0, ...]

If you remove the modulo operator from the first step, your output values in the second step will be evenly distributed as expected. 如果从第一步中删除模运算符,则第二步中的输出值将按预期平均分配。

Obligatory postscript: Don't invent your own cryptosystems. 强制性后记:不要发明自己的密码系统。 If this is security critical, learn about the best practices and implement them. 如果这对安全性至关重要,请了解最佳实践并加以实施。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM