[英]Create Minimum perfect hash for sparse 64-bit unsigned integer
I need a 64 bit to 16 bit perfect hash function for a sparsely populated list of keys. 对于人口稀疏的密钥列表,我需要一个64位到16位的完美哈希函数。
I have a dictionary in python which has 48326 keys of length 64-bits. 我在python中有一个字典,它有48326个长度为64位的密钥。 I would like to create a minimum perfect hash for this list of keys.
我想为这个键列表创建一个最小完美哈希。 (I don't want to have to wait for a few days to calculate the MPH so i am ok with it mapping to a 16 bit hash also)
(我不想等待几天来计算MPH,所以我也可以将它映射到16位哈希)
The objective is to eventually port this dictionary to C as an array which contains the dict values and the index is calculated by the Minimum perfect hash function taking key as input . 目标是最终将此字典移植到C作为包含dict值的数组,并且索引由最小完美散列函数计算,该函数将键作为输入。 I cannot use external hashing libraries in the C port of the application I am building
我不能在我正在构建的应用程序的C端口中使用外部散列库
Question: Is there any python library which will take my keys as input and provide me the hashing parameters and (based on a defined algorithm used for hashing) as an output. 问题:是否有任何python库将我的密钥作为输入并向我提供散列参数和(基于用于散列的定义算法)作为输出。
I found a library perfection 2.0.0 but since my keys are of 64 bit form, this just hung. 我发现了一个库完美2.0.0,但由于我的键是64位形式,这只是挂起。 (even when I tested it on a subset of 2000 keys)
(即使我在2000键的子集上测试它)
EDIT As suggested in comments I looked at Steve Hanov's Algo and modified the hash function to take a 64 bit integer (changing values of the FNV Prime and offset as per this wiki page ) 编辑正如评论中所建议的,我查看了Steve Hanov的Algo并修改了哈希函数以获取64位整数(根据此维基页面更改FNV Prime和偏移的值)
while I got the result, unfortunately the Maps return -ve index values while i can make it work it means i have to add another 4 cycles to the hash calculations by checking for -ve index 虽然我得到了结果,但不幸的是,地图返回-ve索引值,而我可以使它工作,这意味着我必须通过检查-ve index为哈希计算添加另外4个周期
would like to avoid this 我想避免这种情况
Personally, I'd just generate a table with gperf
, or for a large number of keys, with CMPH , and be done with it. 就个人而言,我只是使用CMPH生成一个带有
gperf
或大量键的表 ,并完成它。
If you must do this in Python, then I found this blog post with some Python 2 code that is very efficient at turning string keys into a minimal perfect hash, using an intermediary table. 如果您必须在Python中执行此操作,那么我发现此博客文章中包含一些Python 2代码,该代码使用中间表将字符串键转换为最小完美哈希非常有效。
Adapting the code in the post to your requirements, produces a minimal perfect hash for 50k items in under 0.35 seconds: 根据您的要求调整帖子中的代码,在0.35秒内为50k项产生最小的完美哈希:
>>> import random
>>> testdata = {random.randrange(2**64): random.randrange(2**64)
... for __ in range(50000)} # 50k random 64-bit keys
>>> import timeit
>>> timeit.timeit('gen_minimal_perfect_hash(testdata)', 'from __main__ import gen_minimal_perfect_hash, testdata', number=10)
3.461486832005903
The changes I made: 我做的改变:
int.to_bytes()
int.to_bytes()
将64位无符号整数转换为8字节的字节int.to_bytes()
The adapted code: 改编的代码:
# Easy Perfect Minimal Hashing
# By Steve Hanov. Released to the public domain.
# Adapted to Python 3 best practices and 64-bit integer keys by Martijn Pieters
#
# Based on:
# Edward A. Fox, Lenwood S. Heath, Qi Fan Chen and Amjad M. Daoud,
# "Practical minimal perfect hash functions for large databases", CACM, 35(1):105-121
# also a good reference:
# Compress, Hash, and Displace algorithm by Djamal Belazzougui,
# Fabiano C. Botelho, and Martin Dietzfelbinger
from itertools import count, groupby
def fnv_hash_int(value, size, d=0x811c9dc5):
"""Calculates a distinct hash function for a given 64-bit integer.
Each value of the integer d results in a different hash value. The return
value is the modulus of the hash and size.
"""
# Use the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/
# The unsigned integer is first converted to a 8-character byte string.
for c in value.to_bytes(8, 'big'):
d = ((d * 0x01000193) ^ c) & 0xffffffff
return d % size
def gen_minimal_perfect_hash(dictionary, _hash_func=fnv_hash_int):
"""Computes a minimal perfect hash table using the given Python dictionary.
It returns a tuple (intermediate, values). intermediate and values are both
lists. intermediate contains the intermediate table of indices needed to
compute the index of the value in values; a tuple of (flag, d) is stored, where
d is either a direct index, or the input for another call to the hash function.
values contains the values of the dictionary.
"""
size = len(dictionary)
# Step 1: Place all of the keys into buckets
buckets = [[] for __ in dictionary]
intermediate = [(False, 0)] * size
values = [None] * size
for key in dictionary:
buckets[_hash_func(key, size)].append(key)
# Step 2: Sort the buckets and process the ones with the most items first.
buckets.sort(key=len, reverse=True)
# Only look at buckets of length greater than 1 first; partitioned produces
# groups of buckets of lengths > 1, then those of length 1, then the empty
# buckets (we ignore the last group).
partitioned = (g for k, g in groupby(buckets, key=lambda b: len(b) != 1))
for bucket in next(partitioned, ()):
# Try increasing values of d until we find a hash function
# that places all items in this bucket into free slots
for d in count(1):
slots = {}
for key in bucket:
slot = _hash_func(key, size, d=d)
if values[slot] is not None or slot in slots:
break
slots[slot] = dictionary[key]
else:
# all slots filled, update the values table; False indicates
# these values are inputs into the hash function
intermediate[_hash_func(bucket[0], size)] = (False, d)
for slot, value in slots.items():
values[slot] = value
break
# The next group is buckets with only 1 item. Process them more quickly by
# directly placing them into a free slot.
freelist = (i for i, value in enumerate(values) if value is None)
for bucket, slot in zip(next(partitioned, ()), freelist):
# These are 'direct' slot references
intermediate[_hash_func(bucket[0], size)] = (True, slot)
values[slot] = dictionary[bucket[0]]
return (intermediate, values)
def perfect_hash_lookup(key, intermediate, values, _hash_func=fnv_hash_int):
"Look up a value in the hash table defined by intermediate and values"
direct, d = intermediate[_hash_func(key, len(intermediate))]
return values[d if direct else _hash_func(key, len(values), d=d)]
The above produces two lists with 50k entries each; 以上产生两个列表,每个列表有50k个条目; the values in the first table are
(boolean, integer)
tuples with the integers in the range [0, tablesize)
(in theory the values could range up to 2^16 but I'd be very surprised if it ever took 65k+ attempts to find a slot arrangement for your data). 第一个表中的值是
(boolean, integer)
元组,其中整数的范围为[0, tablesize)
(理论上,这些值的范围可以达到2 ^ 16但是如果它花费了65k +的尝试,我会非常惊讶找到您的数据的插槽安排。 Your table size is < 50k, so the above arrangement makes it possible to store the entries in this list in 4 bytes ( bool
and short
make 3, but alignment rules add one byte padding) when expressing this as a C array. 您的表大小<50k,因此上面的安排使得可以将此列表中的条目以4个字节(
bool
和short
make 3,但是对齐规则添加一个字节填充)存储在表示为C数组时。
A quick test to see of the hash tables are correct and produce the right output again: 快速测试以查看哈希表是否正确并再次生成正确的输出:
>>> tables = gen_minimal_perfect_hash(testdata)
>>> for key, value in testdata.items():
... assert perfect_hash_lookup(key, *tables) == value
...
You only need to implement the lookup function in C: 您只需要在C中实现查找功能:
fnv_hash_int
operation can take a pointer to your 64-bit integer, then just cast that pointer to an array of 8-bit values and increment an index 8 times to access each separate byte; fnv_hash_int
操作可以获取指向64位整数的指针,然后将该指针fnv_hash_int
为8位值数组,并将索引递增8次以访问每个单独的字节; use a suitable function to ensure big-endian (network) order . 0xffffffff
in C, as overflow on a C integer value is automatically discarded anyway. 0xffffffff
进行掩码,因为无论如何都会自动丢弃C整数值上的溢出。 len(intermediate) == len(values) == len(dictionary)
and can be captured in a constant. len(intermediate) == len(values) == len(dictionary)
并且可以在常量中捕获。 flag
being a bool
, d
as an unsigned short
; flag
为bool
, d
为unsigned short
; that's just 3 bytes, plus 1 padding byte to align on a 4-byte boundary. values
array depends on the values in your input dictionary. values
数组中的数据类型取决于输入字典中的值。 If you forgive my C skills, here's a sample implementation: 如果您原谅我的C技能,这里有一个示例实现:
mph_table.h
#include "mph_generated_table.h"
#include <arpa/inet.h>
#include <stdint.h>
#ifndef htonll
// see https://stackoverflow.com/q/3022552
#define htonll(x) ((1==htonl(1)) ? (x) : ((uint64_t)htonl((x) & 0xFFFFFFFF) << 32) | htonl((x) >> 32))
#endif
uint64_t mph_lookup(uint64_t key);
mph_table.c
#include "mph_table.h"
#include <stdbool.h>
#include <stdint.h>
#define FNV_OFFSET 0x811c9dc5
#define FNV_PRIME 0x01000193
uint32_t fnv_hash_modulo_table(uint32_t d, uint64_t key) {
d = (d == 0) ? FNV_OFFSET : d;
uint8_t* keybytes = (uint8_t*)&key;
for (int i = 0; i < 8; ++i) {
d = (d * FNV_PRIME) ^ keybytes[i];
}
return d % TABLE_SIZE;
}
uint64_t mph_lookup(uint64_t key) {
_intermediate_entry entry =
mph_tables.intermediate[fnv_hash_modulo_table(0, htonll(key))];
return mph_tables.values[
entry.flag ?
entry.d :
fnv_hash_modulo_table((uint32_t)entry.d, htonll(key))];
}
which would rely on a generated header file, produced from: 这将依赖于生成的头文件,由以下内容生成:
from textwrap import indent
template = """\
#include <stdbool.h>
#include <stdint.h>
#define TABLE_SIZE %(size)s
typedef struct _intermediate_entry {
bool flag;
uint16_t d;
} _intermediate_entry;
typedef struct mph_tables_t {
_intermediate_entry intermediate[TABLE_SIZE];
uint64_t values[TABLE_SIZE];
} mph_tables_t;
static const mph_tables_t mph_tables = {
{ // intermediate
%(intermediate)s
},
{ // values
%(values)s
}
};
"""
tables = gen_minimal_perfect_hash(dictionary)
size = len(dictionary)
cbool = ['false, ', 'true, ']
perline = lambda i: zip(*([i] * 10))
entries = (f'{{{cbool[e[0]]}{e[1]:#06x}}}' for e in tables[0])
intermediate = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
entries = (format(v, '#018x') for v in tables[1])
values = indent(',\n'.join([', '.join(group) for group in perline(entries)]), ' ' * 8)
with open('mph_generated_table.h', 'w') as generated:
generated.write(template % locals())
where dictionary
is your input table. dictionary
是你的输入表。
Compiled with gcc -O3
the hash function is inlined (loop unrolled) and the whole mph_lookup
function clocks in at 300 CPU instructions. 使用
gcc -O3
编译,哈希函数被内联(循环展开),整个mph_lookup
函数以300个CPU指令mph_lookup
。 A quick benchmark looping through all 50k random keys I generated shows my 2.9 GHz Intel Core i7 laptop can look up 50 million values for those keys per second (0.02 microseconds per key). 通过我生成的所有50k随机密钥循环的快速基准测试表明,我的2.9 GHz英特尔酷睿i7笔记本电脑每秒可以查找5000万个这些按键值(每个按键0.02微秒)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.