简体   繁体   English

为什么将地址右移三位作为 hash function 对于固定大小的 hash 表?

[英]Why right-shifting an address by three bits as a hash function for a fixed-size hash table?

I'm following an article where I've got a hash table with a fixed number of 2048 baskets.我正在关注一篇文章,其中我有一张 hash 的桌子,上面有固定数量的 2048 个篮子。
The hash function takes a pointer and the hash table itself, treats the address as a bit-pattern, shifts it right three bits and reduces it modulo the size of the hash table (2048): hash function 采用指针和 hash 表本身,将地址视为位模式,将其右移三位并以 hash 表(2048)的大小为模减少它:

(It's written as a macro in this case): (在这种情况下它被写成一个宏):

#define hash(p, t) (((unsigned long)(p) >> 3) & \
                    (sizeof(t) / sizeof((t)[0]) - 1))

The article, however, doesn't elaborate on why it's right-shifting the address by three bits (and it seems a bit arbitrary at first).然而,这篇文章并没有详细说明为什么将地址右移三位(起初看起来有点武断)。 My first guess was that the reason is to sort of group pointers with a similar address by cutting off the last three bits but I don't see how this would be useful given that most addresses allocated for one application have similar addresses anyway;我的第一个猜测是,原因是通过切断最后三位来对具有相似地址的组指针进行排序,但我看不出这有什么用,因为为一个应用程序分配的大多数地址无论如何都具有相似的地址; take this as an example:以此为例:

#include <stdio.h>

int main()
{
    
    int i1 = 0, i2 = 0, i3 = 0;
    
    
    printf("%p\n", &i1);
    printf("%p\n", &i2);
    printf("%p\n", &i3);
    
    printf("%lu\n", ((unsigned long)(&i1) >> 3) & 2047); // Provided that the size of the hash table is 2048.
    printf("%lu\n", ((unsigned long)(&i2) >> 3) & 2047);
    printf("%lu", ((unsigned long)(&i3) >> 3) & 2047);

    return 0;
}

Also, I'm wondering why it's choosing 2048 as a fixed size and if this is in relation to the three-bit shift.另外,我想知道为什么它选择 2048 作为固定大小,这是否与三位移位有关。

For reference, the article is an extract from "C Interfaces and Implementations, Techniques for creating reusable software" by David P. Hanson.作为参考,本文摘自 David P. Hanson 的“C 接口和实现,创建可重用软件的技术”。

Memory allocations must be properly aligned. Memory 分配必须正确对齐。 Ie the hardware may specify that an int should be aligned to a 4-byte boundary, or that a double should be aligned to 8 bytes.即硬件可能指定int应与 4 字节边界对齐,或者double应与 8 字节对齐。 This means that the last two address bits for an int must be zero, three bits for the double .这意味着int的最后两个地址位必须为零, double的三个位。

Now, C allows you to define complex structures which mix char , int , long , float , and double fields (and more).现在,C 允许您定义混合charintlongfloatdouble字段(以及更多)的复杂结构。 And while the compiler can add padding to align the offsets to the fields to the appropriate boundaries, the entire structure must also be properly aligned to the largest alignment that one of its members uses.虽然编译器可以添加填充以将字段的偏移量与适当的边界对齐,但整个结构也必须与其成员之一使用的最大 alignment 正确对齐。

malloc() does not know what you are going to do with the memory, so it must return an allocation that's aligned for the worst case . malloc()不知道您要对 memory 做什么,因此它必须返回针对最坏情况对齐的分配 This alignment is specific to the platform, but it's generally not less than 8-byte alignment. A more typical value today is 16-byte alignment.这个alignment是平台特有的,但是一般不会小于8字节的alignment,今天比较典型的值是16字节的alignment。

So, the hash algorithm simply cuts off the three bits of the address which are virtually always zero, and thus less than worthless for a hash value.因此,hash 算法简单地切断了地址的三位,这三位实际上始终为零,因此对于 hash 值来说还不算毫无价值。 This easily reduces the number of hash collisions by a factor of 8. (The fact that it only cuts off 3-bits indicates that the function was written a while ago. Today it should be programmed to cut off four bits.)这很容易将hash的碰撞次数减少8倍。(只截掉3位说明function是前段时间写的,今天应该编程截掉4位。)

This code assumes that the objects which are going to be hashed are aligned to 8 (more precise to 2^(right_shift) ).此代码假定将要散列的对象对齐到 8(更精确到 2^(right_shift) )。 Otherwise this hash function (or macro) will return colliding results.否则这个 hash function(或宏)将返回冲突结果。

#define mylog2(x)  (((x) & 1) ? 0 : ((x) & 2) ? 1 : ((x) & 4) ? 2 : ((x) & 8) ? 3 : ((x) & 16) ? 4 : ((x) & 32) ? 5 : -1)


#define hash(p, t) (((unsigned long)(p) >> mylog2(sizeof(p))) & \
                    (sizeof(t) / sizeof((t)[0]) - 1))

unsigned long h[2048];                    

int main()
{
    
    int i1 = 0, i2 = 0, i3 = 0;
    long l1,l2,l3;
    
    
    printf("sizeof(ix) = %zu\n", sizeof(i1));
    printf("sizeof(lx) = %zu\n", sizeof(l1));
    
    printf("%lu\n", hash(&i1, h)); // Provided that the size of the hash table is 2048.
    printf("%lu\n", hash(&i2, h));
    printf("%lu\n", hash(&i3, h));

    printf("\n%lu\n", hash(&l1, h)); // Provided that the size of the hash table is 2048.
    printf("%lu\n", hash(&l2, h));
    printf("%lu\n", hash(&l3, h));


    return 0;
}

https://godbolt.org/z/zq1zfP https://godbolt.org/z/zq1zfP

to make it more reliable you need to take into the account the size of the object:为了使其更可靠,您需要考虑 object 的大小:

#define hash1(o, p, t) (((unsigned long)(p) >> mylog2(sizeof(o))) & \
                    (sizeof(t) / sizeof((t)[0]) - 1))

Then it will work with any size data https://godbolt.org/z/a7dYj9然后它将处理任何大小的数据https://godbolt.org/z/a7dYj9

Though it's not dictated by the C language standard, on most platforms (where platform = compiler + designated HW architecture), variable x is allocated at an address which is a multiple of (ie, divisible by) sizeof(x) .虽然它不是由 C 语言标准规定的,但在大多数平台上(其中平台 = 编译器 + 指定的硬件架构),变量x分配在一个地址,该地址是sizeof(x)

This is because many platforms do not support unaligned load/store operations (eg, writing a 4-byte value to an address which is not aligned to 4 bytes).这是因为许多平台不支持未对齐的加载/存储操作(例如,将 4 字节值写入未对齐到 4 字节的地址)。

Knowing that sizeof(long) is at most 8 (again, on most platforms), we can further predict that the last 3 bits on the address of every long variable will always be zero.知道sizeof(long)最多为8 (同样,在大多数平台上),我们可以进一步预测每个long变量地址的最后 3 位将始终为零。

When designing a hash-table solution, one would typically strive for as fewer collisions as possible.在设计哈希表解决方案时,人们通常会争取尽可能少的冲突。

Here, the hashing solution takes the last 11 bits of every address.此处,哈希解决方案采用每个地址的最后 11 位。

So in order to reduce the number of collisions, we shift-right every address by 3 bits, thus replacing of those 3 "predictable" zeros with something "more random".因此,为了减少冲突次数,我们将每个地址右移 3 位,从而用“更随机”的东西替换这 3 个“可预测”的零。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM