简体   繁体   English

哈希和取模字符串在C#和Python之间具有等效的结果

[英]Hash and modulo strings to have equivalent results between C# and Python

I have need to group strings into ten different groups in a deterministic fashion, with some level of uniformity. 我需要以确定性的方式将字符串分为十个不同的组,并具有一定程度的一致性。 The strings are identifiers that come from different sources, all with different (basically unknown) formats. 字符串是来自不同来源的标识符,所有标识符都具有不同(基本上未知)的格式。

To accomplish this I decided to hash the strings and mod by 10. However I am going to be doing this in two different locations and I need their results to be consistent, one is a C# app and the other is a python one. 为此,我决定将字符串和mod散列10。但是,我将在两个不同的位置执行此操作,并且我需要它们的结果保持一致,一个是C#应用程序,另一个是python。

To ensure consistent hashing I have decided to go with MD5 (reasonably fast and consistent). 为了确保一致的哈希,我决定使用MD5(合理且快速且一致)。 Python already has this in the hashlib library and C# has one as in Cryptography Python在hashlib库中已经具有此功能,而C#在密码学中具有一个功能

However I need to int and modulo these numbers with consistency. 但是,我需要对这些数字进行整型和求模。 In python this is easy 在python中这很容易

md5 = hashlib.md5()
md5.update(my_string)
int(md5.hexdigest(), 16) % 10

But I can't just do this in C# as I only have 64 bit integers. 但是我不能只在C#中执行此操作,因为我只有64位整数。 So my thought is to just grab the last 16 characters from the hex. 所以我的想法是只从十六进制中获取最后16个字符。 In python 在python中

int(md5.hexdigest()[-16:]) % 10

Then in C# 然后在C#中

// hashString filled via MD5 code in the C# link above
string subHash = hashString.Substring(hashString.Length - 16);
Convert.ToUInt64(subHash, 16) % 10;

Now my questions are these. 现在我的问题就是这些。 Are these two methods guaranteed to be equivalent? 这两种方法是否保证等效? Is MD5 a good choice here? MD5是这里的好选择吗? It's certainly consistent but if there is something faster that would be ideal. 这当然是一致的,但是如果有更快的方法将是理想的。 Is grabbing the last 16 characters the best way to prevent overflow? 抓住最后16个字符是防止溢出的最佳方法吗?

The answer to this question Where can I find source or algorithm of Python's hash() function? 这个问题的答案在哪里可以找到Python的hash()函数的源代码或算法? includes the source code for the Python hash function (in C). 包括Python哈希函数的源代码(用C语言编写)。 Couldn't you implement that in C#. 您不能在C#中实现它。 I'm guessing it would be much faster than MD5. 我猜它会比MD5快得多。

Python hash function for strings: 字符串的Python哈希函数:

static long string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;

    if (a->ob_shash != -1)
        return a->ob_shash;
    len = Py_SIZE(a);
    p = (unsigned char *) a->ob_sval;
    x = *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM