简体繁体 English

是否可以复制.NET HashAlgorithm（用于重复的增量散列结果）？

[英]Is it possible to copy a .NET HashAlgorithm (for repeated incremental hash results)?

原文 2014-09-30 14:44:58 6 2 c#/ clone/ md5/ hashalgorithm

I have the following use case: 我有以下用例：

Read n bytes from a file 从文件中读取n个字节
Compute (MD5) hash for these n bytes 计算这些n个字节的（MD5）哈希值
Read next m bytes from file 从文件中读取下一个m字节
Compute (MD5) hash for the file up to n+m bytes 计算（MD5）哈希，文件最多为n + m个字节

Incrementally hashing a file isn't the problem, just call TransformBlock and TransformFinalBlock . 增量散列文件不是问题，只需调用TransformBlock和TransformFinalBlock 。

The problem is that I need multiple hashes of data that shares its beginning bytes, but after I have called TransformFinalBlock to read the Hash of the first n bytes I cannot continue to hash with the same object and need a new one. 问题是我需要多个哈希数据共享它的起始字节，但是在我调用TransformFinalBlock来读取前n个字节的Hash后，我无法继续使用相同的对象进行哈希并需要一个新的哈希。

Searching for the problem, I saw that both Python as well as OpenSSL have an option to copy a hashing object for exactly this purpose: 在搜索问题时，我发现Python和OpenSSL都可以选择复制散列对象，以实现此目的：

hash.copy() hash.copy（）

Return a copy (“clone”) of the hash object. 返回哈希对象的副本（“克隆”）。 This can be used to efficiently compute the digests of strings that share a common initial substring . 这可以用于有效地计算共享公共初始子字符串的字符串的摘要 。

EVP_MD_CTX_copy_ex() can be used to copy the message digest state from in to out. EVP_MD_CTX_copy_ex（）可用于将消息摘要状态从in复制到out。 This is useful if large amounts of data are to be hashed which only differ in the last few bytes . 如果要散列的大量数据仅在最后几个字节中有所不同，这非常有用 。 out must be initialized before calling this function. 必须在调用此函数之前初始化out。

Searching as I may, I can't find anything withing the stock C# HashAlgorithm that would allow me to effectively Clone() == copy such an object before calling its TransformFinalBlock method -- and afterwards continue to hash the rest of the data with the clone. 正如我可能搜索的那样，我找不到任何包含库存C＃ HashAlgorithm的东西，它允许我在调用它的TransformFinalBlock方法之前有效地Clone() ==复制这样的对象 - 然后继续用其他方法散列其余的数据。克隆。

I found a C# reference implementation for MD5 that could be trivially adapted to support cloning(*) but would strongly prefer to use what is there instead of introducing such a thing into the codebase. 我找到了一个MD5的C＃参考实现，可以简单地适应支持克隆（*），但是我更倾向于使用那些而不是将这样的东西引入代码库。

(*) Indeed, as far as I understand, any Hashing Algorithm (as opposed to encryption/decryption) I've bothered to check is trivially copyable because all the state such an algorithm has is a form of a digest. （*）实际上，据我所知，任何哈希算法（与加密/解密相反）我都很难检查，因为所有状态这样的算法都是一种摘要形式。

So am I missing something here or does the standard C#/.NET interface in fact not offer a way to copy the hash object? 所以我在这里遗漏了一些东西，或者标准的C＃/ .NET接口实际上是不是提供了复制哈希对象的方法？

Another data point: 另一个数据点：

Microsoft's own native API for crypto services has a function CryptDuplicateHash , the docs of which state, quote: 微软自己的加密服务原生API有一个函数CryptDuplicateHash ，其状态为docs，引用：

The CryptDuplicateHash function can be used to create separate hashes of two different contents that begin with the same content. CryptDuplicateHash函数可用于创建以相同内容开头的两个不同内容的单独哈希。

Been around since Windows XP. 自Windows XP以来一直存在。 :-| ： - |

Note wrt. 注意wrt。 MD5: The use case is not cryptographically sensitive. MD5：用例不具有加密敏感性。 Just reliable file checksumming. 只是可靠的文件校验和。

2 个解决方案

I realize this isn't exactly what you are asking for, but if this matches the problem you're trying to solve it's an alternative approach that would give you the same guarantees & similar streaming performance characteristics. 我意识到这并不是你所要求的，但如果这与你试图解决的问题相符，那么它将是一种替代方法，可以为您提供相同的保证和类似的流性能特性。 I've used this in the past for a server-to-server file transfer protocol where the sender/receiver weren't always available/reliable. 我过去曾使用过这种服务器到服务器文件传输协议，其中发送器/接收器并不总是可用/可靠。 Granted, I had control over the code on both sides of the wire which I realize you may not. 当然，我控制了电线两侧的代码，我意识到你可能没有。 In that case, please ignore ;-) 在这种情况下，请忽略;-)

My approach was to setup 1 HashAlgorithm that dealt with the entire file and another one for hashing fixed-sized blocks of the file--not rolling hashes (avoids your problem), but standalone hashes. 我的方法是设置1个处理整个文件的HashAlgorithm，另一个用于散列文件的固定大小的块 - 不是滚动哈希（避免你的问题），而是独立的哈希。 So imagine a 1034MB (1 GB + 10 MB) file logically split into 32MB blocks. 因此，想象一下1034MB（1 GB + 10 MB）文件在逻辑上分为32MB块。 The sender loaded the file, calling TransformBlock on both the file-level and the block-level HashAlgorithm's at the same time. 发送方加载了文件，同时在文件级和块级HashAlgorithm上调用TransformBlock。 When it reached the end of the 32MB, it called TransformFinalBlock on the block-level one, recorded the hash for that block, and reset/created a new HashAlgorithm for the next block. 当它到达32MB的末尾时，它在块级别1上调用TransformFinalBlock，记录该块的散列，并为下一个块重置/创建新的HashAlgorithm。 When it reached the end of the file it called TransformFinalBlock on the file- and block-level hasher. 当它到达文件的末尾时，它在文件块和块级别的哈希上调用TransformFinalBlock。 Now the sender had a 'plan' for the transfer that included filename, file size, file hash, and the offset, length, and hash of each block. 现在发件人有一个传输的“计划”，包括文件名，文件大小，文件哈希以及每个块的偏移量，长度和哈希值。

It sent the plan to the receiver, who either allocated space for a new file (file length % block size tells it that the last block is smaller than 32MB) or opened the existing file. 它将计划发送给接收方，接收方为新文件分配空间（文件长度％块大小告诉它最后一个块小于32MB）或打开现有文件。 If the file was already there, it ran the same algorithm to compute the hash of the same-sized blocks. 如果文件已经存在，则它运行相同的算法来计算相同大小的块的哈希值。 Any mismatches against the plan caused it to ask the sender for those blocks only (this would account for not-yet-transferred blocks/all 0's and corrupt blocks). 与计划的任何不匹配导致它仅向发送方询问这些块（这将考虑尚未转移的块/全0和损坏的块）。 It did this (verify, ask for blocks) work in a loop until there was nothing left to ask for. 它做了这个（验证，请求块）在一个循环中工作，直到没有任何东西要求。 Then it checked the file-level hash against the plan. 然后它检查了计划的文件级哈希。 If the file-level hash was invalid but the block-level hashes were all valid, it would probably mean either a hash colission or bad RAM (both extremely rare... I used SHA-512). 如果文件级哈希是无效的，但块级哈希都是有效的，那么它可能意味着哈希委托或坏RAM（两者都非常罕见......我使用的是SHA-512）。 This allowed the receiver to recover from incomplete blocks or corrupt blocks with a worst-case-scenario penalty of having to download 1 bad block again, which could be offset by tuning the block size. 这允许接收器从不完整的块或损坏的块中恢复，最坏情况下必须再次下载1个坏块，这可以通过调整块大小来抵消。

SIGH 叹

The stock .NET library does not allow this. 库存.NET库不允许这样做。 Sad. 伤心。 Anyways, there are a couple of alternatives: 无论如何，还有几种选择：

MD5Managed pure .NET ("default" MD5 RSA license) MD5Managed pure .NET （“默认”MD5 RSA许可证）
ClonableHash that wraps the MS Crypto API via PInvoke (may need some work extracting that from the Org.Mentalis namespace, but the license is permissive) ClonableHash通过PInvoke包装MS Crypto API（可能需要一些工作从Org.Mentalis命名空间中提取，但许可证是允许的）

It is also possible to for example wrap a C++ implementation in a C++/CLI wrapper - preliminary tests have shown that this seems to be way faster than the normal .NET library, but don't take my word on it. 例如，也可以在C ++ / CLI包装器中包装C ++实现 - 初步测试表明这似乎比普通的.NET库更快，但是不要接受我的话。

Since, I also wrote/adapted a C++ based solution myself: https://github.com/bilbothebaggins/md5cpp 既然如此，我自己也编写/改编了一个基于C ++的解决方案： https ： //github.com/bilbothebaggins/md5cpp

It hasn't gone into production, because the requirements changed, but it was a nice exercise and I like to think it works quite well. 它还没有投入生产，因为要求发生了变化，但这是一个很好的练习，我觉得它很有效。 (Other than it not being a pure C# implementation.) （除了它不是纯粹的C＃实现。）