简体   繁体   English

快速(XOR-?)组合SHA1哈希以生成新哈希

[英]quickly (XOR-?) combine SHA1 hashes to generate a new hash

Having a (possibly large) list of unique text lines (stringified JSON data) I need to calculate a unique hash for the whole text document. 有一个(可能很大) 唯一文本行列表(字符串化JSON数据)我需要计算整个文本文档的唯一哈希。 Often new lines are appended to the document and occasionally some lines will be deleted from it, resulting into a completely new hash for the document. 通常会在文档中附加新行,有时会从中删除一些行,从而导致文档的全新哈希。

The ultimate goal is to be able to identify identical documents just using the hash. 最终目标是能够仅使用哈希来识别相同的文档。

Of course, calculating the SHA1 hash for the whole document after each modification would give me the desired unique hash, but would also be computationally expensive - especially in a situation where just ~40 bytes are appended to a 5 megabyte document and all that data would have to go through the SHA1 calculation again. 当然,在每次修改之后计算整个文档的SHA1哈希值会给我所需的唯一哈希值,但是计算成本很高 - 特别是在只有大约40个字节附加到5兆字节文档的情况下,所有这些数据都会必须再次通过SHA1计算。

So, I'm looking into a solution that allows me to reduce the time it takes to calculate the new hash. 所以,我正在研究一种允许我减少计算新哈希所需时间的解决方案。

A summary of the problem properties / requirements: 问题属性/要求的摘要:

  • each line is guaranteed to be unique 每一行都保证是独一无二的
  • the order of the lines does not necessarily need to matter (even better if it doesn't) 线条的顺序不一定重要(如果没有则更好)
  • the length of a single line is usually small, but the whole document might be large 单行的长度通常很小,但整个文档可能很大
  • the algorithm can be optimized for appended data (ie deleting data might even require a restart from scratch in such a case) 该算法可以针对附加数据进行优化(即删除数据甚至可能需要在这种情况下从头开始重启)

My current idea is to calculate the SHA1 (or whatever) hash for each single line individually and then XOR the hashes together. 我目前的想法是单独计算每一行的SHA1(或其他)散列,然后将散列异或 That should satisfy all requirements. 这应该满足所有要求。 For new lines I just calculate the SHA1 of that line and XOR it with the already known sum. 对于新行,我只计算该行的SHA1,并用已知总和对其进行异或。

However, I'm in doubt because... 但是,我有点怀疑,因为......

  • I 'm not sure if the XORed hash would still be strong enough to exactly identify a document (ie. is there a significantly higher probability of unwanted collisions?) 我不确定XORed哈希是否仍然足够强大以准确识别文档(即,不需要的冲突的概率是否明显更高?)
  • calculating lots of SHA1 hashes of short lines might be computationally expensive of it's own (at least during initialization) 计算大量短线的SHA1哈希值可能在计算上非常昂贵(至少在初始化期间)

Anybody can shed some light into these issues? 任何人都能对这些问题有所了解吗?

Alternatively, is it perhaps generally possible with SHA1 (or a similar hash) to quickly generate a new hash for appended data ( old hash + appended data = new hash )? 或者,通常可能使用SHA1(或类似的哈希)快速生成附加数据的新哈希( old hash + appended data = new hash )?

There are problems with hashing each file individually. 单独散列每个文件存在问题。

If two identical lines are added the combined xor will not change. 如果添加两条相同的线,则组合的xor不会改变。

You might be better off hashing all the individual line hashes. 散列所有单独的线条哈希可能会更好。

Perhaps use a Merkle Tree . 也许使用Merkle Tree

You can perform incremental updates to like-stream calculation: 您可以对类似流计算执行增量更新:

var crypto = require('crypto');

var shasum = crypto.createHash('sha1');
shasum.update("Hello, ");
shasum.update("World!");
console.log(shasum.digest('hex'));

shasum = crypto.createHash('sha1');
shasum.update("Hello, World!")
console.log(shasum.digest('hex'));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM