简体   繁体   English

短字符串的哈希函数

[英]Hash function for short strings

I want to send function names from a weak embedded system to the host computer for debugging purpose. 我想将函数名从弱嵌入式系统发送到主机以进行调试。 Since the two are connected by RS232, which is short on bandwidth, I don't want to send the function's name literally. 由于两者是通过带宽短的RS232连接的,我不想直接发送功能的名称。 There are some 15 chars long function names, and I sometimes want to send those names at a pretty high rate. 有大约15个字符长的函数名称,我有时想以相当高的速率发送这些名称。

The solution I thought about, was to find a hash function which would hash those function names to a single byte, and send this byte only. 我想到的解决方案是找到一个散列函数,它将这些函数名称散列为单个字节,并仅发送此字节。 The host computer would scan all the functions in the source, compute their hash using the same function, and then would translate the hash to the original string. 主机将扫描源中的所有函数,使用相同的函数计算它们的哈希值,然后将哈希值转换为原始字符串。

The hash function must be 哈希函数必须是

  1. Collision free for short strings. 短线串冲突。
  2. Simple (since I don't want too much code in my embedded system). 简单(因为我不想在嵌入式系统中使用太多代码)。
  3. Fit a single byte 适合单个字节

Obviously, it does not need to be secure by any means, only collision free. 显然,它不需要以任何方式保证安全,只需要无碰撞。 So I don't think using cryptography-related hash function is worth their complexity. 所以我不认为使用与加密相关的哈希函数是值得的复杂性。

An example code: 示例代码:

int myfunc() {
    sendToHost(hash("myfunc"));
}

The host would then be able to present me with list of times where the myfunc function was executed. 然后主机可以向我提供执行myfunc函数的时间列表。

Is there some known hash function which holds the above conditions? 是否有一些已知的哈希函数可以保持上述条件?

Edit: 编辑:

  1. I assume I will use much less than 256 function-names. 我假设我将使用少于256个函数名。
  2. I can use more than a single byte, two bytes would have me pretty covered. 我可以使用多个字节,两个字节可以让我很好。
  3. I prefer to use a hash function instead of using the same function-to-byte map on the client and the server, because (1) I have no map implementation on the client, and I'm not sure I want to put one for debugging purposes. 我更喜欢使用哈希函数,而不是在客户端和服务器上使用相同的函数到字节映射,因为(1)我在客户端上没有映射实现,我不确定是否要为调试目的。 (2) It requires another tool in my build chain to inject the function-name-table into my embedded system code. (2)它需要我的构建链中的另一个工具将function-name-table注入我的嵌入式系统代码中。 Hash is better in this regard, even if that means I'll have a collision once in many while. Hash在这方面更好,即使这意味着我会在很多时候碰撞一次。

Try minimal perfect hashing : 尝试最小的完美散列

Minimal perfect hashing guarantees that n keys will map to 0..n-1 with no collisions at all. 最小的完美散列保证n个键将映射到0..n-1,完全没有碰撞。

C code is included. 包含C代码。

Hmm with only 256 possible values, since you will parse your source code to know all possible functions, maybe the best way to do it would be to attribute a number to each of your function ??? 嗯只有256个可能的值,因为你将解析你的源代码以了解所有可能的函数,也许最好的方法是将一个数字归因于你的每个函数???

A real hash function would probably won't work because you have only 256 possible hashes. 真正的哈希函数可能不会起作用,因为你只有256个哈希值。 but you want to map at least 26^15 possible values (assuming letter-only, case-insensitive function names). 但是您希望映射至少26 ^ 15个可能的值(假设仅限字母,不区分大小写的函数名称)。 Even if you restricted the number of possible strings (by applying some mandatory formatting) you would be hard pressed to get both meaningful names and a valid hash function. 即使您限制了可能的字符串数量(通过应用一些强制格式化),您也很难获得有意义的名称和有效的哈希函数。

No, there isn't. 不,没有。

You can't make a collision free hash code, or even close to it, with just an eight bit hash. 您只能使用8位哈希来制作无冲突的哈希码,甚至不能使用它。 If you allow strings that are longer than one character, you have more possible strings than there are possible hash codes. 如果允许长度超过一个字符的字符串,则可能的字符串多于可能的哈希码。

Why not just extract the function names and give each function name an id? 为什么不直接提取函数名称并给每个函数名称一个id? Then you only need a lookup table on each side of the wire. 然后,您只需要在电线的每一侧都有一个查找表。

(As others have shown you can generate a hash algorithm without collisions if you already have all the function names, but then it's easier to just assign a number to each name to make a lookup table...) (正如其他人已经表明,如果您已经拥有所有函数名称,则可以生成没有冲突的哈希算法,但是更容易为每个名称分配一个数字以生成查找表...)

You could use a Huffman tree to abbreviate your function names according to the frequency they are used in your program. 您可以使用Huffman树根据程序中使用的频率缩写函数名称。 The most common function could be abbreviated to 1 bit, less common ones to 4-5, very rare functions to 10-15 bits etc. A Huffman tree is not very hard to implement but you will have to do something about the bit alignment. 最常见的函数可以缩写为1位,不太常见的函数可以缩写为4-5,非常罕见的函数可以缩写为10-15位等。霍夫曼树不是很难实现,但你必须对位对齐做一些事情。

哈夫曼树

If you have a way to track the functions within your code (ie a text file generated at run-time) you can just use the memory locations of each function. 如果您有办法跟踪代码中的函数(即在运行时生成的文本文件),您可以使用每个函数的内存位置。 Not exactly a byte, but smaller than the entire name and guaranteed to be unique. 不完全是一个字节,但小于整个名称并保证是唯一的。 This has the added benefit of low overhead. 这具有低开销的额外好处。 All you would need to 'decode' the address is the text file that maps addresses to actual names; 你需要“解码”地址的只是将地址映射到实际名称的文本文件; this could be sent to the remote location or, as I mentioned, stored on the local machine. 这可以发送到远程位置,或者,如我所提到的,存储在本地计算机上。

Described here is a simple way of implementing it yourself: http://www.devcodenote.com/2015/04/collision-free-string-hashing.html 这里描述的是一种自己实现它的简单方法: http//www.devcodenote.com/2015/04/collision-free-string-hashing.html

Here is a snippet from the post: 这是帖子的一个片段:

It derives its inspiration from the way binary numbers are decoded and converted to decimal number format. 它从二进制数被解码并转换为十进制数格式的方式中获得灵感。 Each binary string representation uniquely maps to a number in the decimal format. 每个二进制字符串表示唯一地映射到十进制格式的数字。

if say we have a character set of capital English letters, then the length of the character set is 26 where A could be represented by the number 0, B by the number 1, C by the number 2 and so on till Z by the number 25. Now, whenever we want to map a string of this character set to a unique number , we perform the same conversion as we did in case of the binary format 如果说我们有一个大写英文字母的字符集,那么字符集的长度是26,其中A可以用数字0表示,B用数字1表示,C用数字2表示,依此类推,直到Z数字25.现在,每当我们想要将此字符集的字符串映射到唯一编号时,我们执行与二进制格式相同的转换

In this case you could just use an enum to identify functions. 在这种情况下,您可以使用enum来识别函数。 Declare function IDs in some header file: 在某些头文件中声明函数ID:

typedef enum
{
    FUNC_ID_main,
    FUNC_ID_myfunc,
    FUNC_ID_setled,
    FUNC_ID_soundbuzzer
} FUNC_ID_t;

Then in functions: 然后在函数中:

int myfunc(void)
{
    sendFuncIDToHost(FUNC_ID_myfunc);
    ...
}

If sender and receiver share the same set of function names, they can build identical hashtables from these. 如果发送方和接收方共享同一组函数名,则可以从这些函数名构建相同的哈希表。 You can use the path taken to get to an hash element to communicate this. 您可以使用获取哈希元素的路径来进行通信。 This can be {starting position+ number of hops} to communicate this. 这可以是{起始位置+跳数}来传达此信息。 This would take 2 bytes of bandwidth. 这将占用2个字节的带宽。 For a fixed-size table (lineair probing) only the final index is needed to address an entry. 对于固定大小的表(lineair探测),只需要最终索引来处理条目。

NOTE: when building the two "synchronous" hash tables, the order of insertion is important ;-) 注意:构建两个“同步”哈希表时,插入顺序很重要;-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM