简体   繁体   English

对于均匀分布的4位值的非均匀序列,是否有良好的散列函数?

[英]A good hashing function for a non-uniform sequence of uniformly distributed 4 bits values?

I have a very specific problem: 我有一个非常具体的问题:

I have uniformly random values spread on a 15x50 grid and the sample I want to hash corresponds to a square of 5x5 cells centered around any possible grid position. 我有一个15x50网格上的均匀随机值,我想要散列的样本对应于以任何可能的网格位置为中心的5x5单元格的正方形。

The number of samples can thus vary from 25 (away from borders, most cases) to 20, 15 (near a border) down to a minimum of 9 (in a corner). 因此,样本的数量可以从25(远离边界,大多数情况)到20,15(靠近边界)到最小值9(在角落中)变化。

So even though the cell values are random, the location introduces a deterministic variation in the sequence length. 因此,即使单元格值是随机的,该位置也会引入序列长度的确定性变化。

The hash table size is a small number, typically between 50 and 20. 哈希表大小是一个小数字,通常在50到20之间。

The function will operate on a large set of randomly generated grids (a few hundreds/thousands), and might be called a few thousands times per grid. 该函数将在大量随机生成的网格上运行(几百/千),每个网格可能会被调用几千次。 The positions on the grid can be considered random. 网格上的位置可以被认为是随机的。

I would like a function that could spread the 15x50 possible samples as evenly as possible. 我想要一个可以尽可能均匀地传播15x50个可能样本的函数。

I have tried the following pseudo-code: 我试过以下伪代码:

int32 hash = 0;
int i = 0; // I guess i could take any initial value and even be left uninitialized, but fixing one makes the function deterministic
foreach (value in block)
{
    hash ^= (value << (i%28))
    i++
}
hash %= table_size

but the results, though not grossly imbalanced, do not seem very smooth to me. 但结果虽然不是非常不平衡,但对我来说似乎并不顺利。 Maybe it's because the sample is too small, but the circumstances make it difficult to run the code on a bigger sample, and I would rather not have to write a complete test harness if some computer savvy has an answer ready for me :). 也许这是因为样本太小,但是情况使得难以在更大的样本上运行代码,而我宁愿不必编写一个完整的测试工具,如果一些计算机知识为我准备好了答案:)。

I am not sure pairing the values two by two and using a general purpose byte hashing strategy would be the best solution, especially since the number of values might be odd. 我不确定将值二乘二并且使用通用字节散列策略将是最佳解决方案,尤其是因为值的数量可能是奇数。

I have tought of using a 17th value to represent off-grid cells, but that seems to introduce a bias (the sequences from cells near a border will have a lot of "off grid" values). 我已经尝试使用第17个值来表示离网格细胞,但这似乎引入了偏差(来自边界附近的单元格的序列将具有许多“离网格”值)。

I am not sure either what would be the best way to test the efficiency of various solutions (how many grids shall I generate to have an idea of the performances, for instance). 我不确定什么是测试各种解决方案效率的最佳方法(例如,我应该生成多少网格以了解性能)。

http://www.partow.net/programming/hashfunctions/ http://www.partow.net/programming/hashfunctions/

Here are few different hash function from experts on various fields. 以下是来自各领域专家的几种不同哈希函数。 Functions are designed for 8bit values, but I am sure you can extend for your case. 功能是针对8位值设计的,但我相信您可以针对您的情况进行扩展。 I dont know what to suggest, but I think that any of them should work better than your current idea. 我不知道该建议什么,但我认为他们中的任何一个都应该比你现在的想法更好。

Problem with current approach you propose is that values are cyclic in field 2^n and if you make mod 64 at the end for example you lost most values out and only last 3 values remains in final result. 您建议的当前方法的问题是值在字段2 ^ n中是循环的,并且如果您在结尾处使用mod 64,例如您丢失了大多数值,并且最终结果中仅剩下最后3个值。

Despite your scepticism I would just shove them through a standard hash function. 尽管你持怀疑态度,我还是会通过一个标准的哈希函数来推动它们。 If they are well randomised (and relatively independent - you don't say) to begin with you probably don't need to do too much work. 如果他们完全随机(并且相对独立 - 你没有说)开始你可能不需要做太多的工作。 Fowler-Noll-Vo (FNV) is a good candidate in these circumstances. 在这种情况下,Fowler-Noll-Vo(FNV)是一个很好的候选人。

FNV operates on a series of 8-bit input and your input is (logically) 4-bit. FNV采用一系列8位输入,输入为(逻辑上)4位。 I would start without even bothering to pack 'two by two' as you describe. 正如你所描述的那样,我开始时甚至没有打扰“两个一包”。 If you feel like trying that, just logically pad odd length series with the message length (reduced to a 4 bit value obviously). 如果您想尝试这样做,只需逻辑填充消息长度的奇数长度序列(显然减少到4位值)。

I wouldn't expect that packing to improve the hash. 我不希望打包来改善哈希值。 It may save you a tiny number of cycles because it swaps a relatively expensive * with a << and a | 它可以节省你很少的周期数,因为它与<<和a |交换相对昂贵的* .

Try both and report back! 试试两个并报告回来!

Here are implementations of packed and 'normal' versions of FNV1a in C: 以下是C中FNV1a的压缩和“正常”版本的实现:

#include <inttypes.h>

static const uint32_t sFNVOffsetBasis=2166136261;
static const uint32_t sFNVPrime= 16777619;

const uint32_t FNV1aPacked4Bit(const uint8_t*const pBytes,const size_t pSize) {
    uint32_t rHash=sFNVOffsetBasis;
    for(size_t i=0;i<pSize;i+=2){
        rHash=rHash^(pBytes[i]|(pBytes[i+1]<<4));
        rHash=rHash*sFNVPrime;
    }
    if(pSize%2){//Length is odd. The loop missed the last element.
        rHash=rHash^(pBytes[pSize-1]|((pSize&0x1E)<<3));
        rHash=rHash*sFNVPrime;

    }
    return rHash;
}

const uint32_t FNV1a(const uint8_t*const pBytes,const size_t pSize) {
    uint32_t rHash=sFNVOffsetBasis;
    for(size_t i=0;i<pSize;++i){
        rHash=(rHash^pBytes[i])*sFNVPrime;
    } 
    return rHash;
}

NB: I've edited it to skip the first bit when adding in the length. 注意:我已经编辑它以在添加长度时跳过第一位。 Obviously the bottom bit of an odd length is 100% biased to 1. I don't know how length is distributed. 显然奇数长度的底部位是100%偏置为1.我不知道长度是如何分布的。 It may be wiser to put it in at the start than the end. 把它放在开头而不是结束可能比较明智。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM