实现递归哈希算法

Question

假设文件A具有以下字节：

并且我有一个简单的哈希算法，在其中存储最后三个连续字节的总和，因此：

2   
5   
8   - = 8+5+2 = 15
0   
33  
90  - = 90+33+0 = 123
1   
3   
200 - = 204
201 
23  
12  - = 236

这样我就可以将文件A表示为15, 123, 204, 236

假设我将该文件复制到新计算机B上，并做了一些小的修改，文件B的字节为：

“请注意，区别在于文件开头有一个额外的字节，结尾处有2个额外的字节，但其余部分非常相似”

因此我可以执行相同的算法来确定文件的某些部分是否相同。 请记住，文件A由哈希码15, 123, 204, 236让我们看看文件B是否为我提供了其中的一些哈希码！

因此文件BI必须每3个连续字节执行一次

int[] sums; // array where we will hold the sum of the last bytes


255 sums[0]  =          255     
2   sums[1]  =  2+ sums[0]    = 257     
5   sums[2]  =  5+ sums[1]    = 262     
8   sums[3]  =  8+ sums[2]    = 270  hash = sums[3]-sums[0]   = 15   --> MATHES FILE A!
0   sums[4]  =  0+ sums[3]    = 270  hash = sums[4]-sums[1]   = 13
33  sums[5]  =  33+ sums[4]   = 303  hash = sums[5]-sums[2]   = 41
90  sums[6]  =  90+ sums[5]   = 393  hash = sums[6]-sums[3]   = 123  --> MATHES FILE A!
1   sums[7]  =  1+ sums[6]    = 394  hash = sums[7]-sums[4]   = 124
3   sums[8]  =  3+ sums[7]    = 397  hash = sums[8]-sums[5]   = 94
200 sums[9]  =  200+ sums[8]  = 597  hash = sums[9]-sums[6]   = 204  --> MATHES FILE A!
201 sums[10] =  201+ sums[9]  = 798  hash = sums[10]-sums[7]  = 404
23  sums[11] =  23+ sums[10]  = 821  hash = sums[11]-sums[8]  = 424
12  sums[12] =  12+ sums[11]  = 833  hash = sums[12]-sums[9]  = 236  --> MATHES FILE A!
55  sums[13] =  55+ sums[12]  = 888  hash = sums[13]-sums[10] = 90
255 sums[14] =  255+ sums[13] = 1143    hash = sums[14]-sums[11] =  322
255 sums[15] =  255+ sums[14] = 1398    hash = sums[15]-sums[12] =  565

因此，通过查看该表，我知道文件B包含文件A中的字节以及其他字节，因为哈希码匹配。

之所以显示此算法，是因为它的阶数为n。换句话说，我能够计算最后3个连续字节的哈希，而不必遍历它们！

如果我要使用更复杂的算法（例如对后3个字节进行md5处理），那么它将具有n ^ 3的顺序，这是因为当我遍历文件BI时，必须具有一个内部for循环来计算的哈希值最后三个字节。

所以我的问题是：

如何改善算法使其保持n阶。 那就是只计算一次哈希。 如果使用现有的哈希算法（例如md5），则必须在算法内部放置一个内部循环，这将显着增加算法的顺序。

请注意，可以用乘法而不是加法来做相同的事情。 但是计数器的增长确实非常快。 也许我可以将乘法，加法和减法结合起来...

编辑

另外，如果我用Google搜索：

递归哈希函数

出现了很多信息，我认为这些算法很难理解...

我必须为一个项目实现该算法，这就是为什么我要重新发明轮子的原因……我知道那里有很多算法。

我正在考虑的另一种解决方案是执行相同的算法，再执行另一种强大的算法。 因此，文件AI将每3个字节加上每3个字节的md5执行相同的算法。 如果第一个算法实现，我将在第二个文件上执行第二个算法。

Answer 1

编辑：

我对“递归”的含义的思考越多，我就越怀疑我早先提出的解决方案是执行任何有用的工作所应采用的解决方案。

您可能想要实现哈希树算法，这是一个递归操作。

为此，您需要对列表进行哈希处理，将列表一分为二，然后递归到这两个子列表中。 当列表的大小为1或所需的最小哈希大小时终止，因为每个级别的递归都会使总哈希输出的大小增加一倍。

伪代码：

create-hash-tree(input list, minimum size: default = 1):
  initialize the output list
  hash-sublist(input list, output list, minimum size)
  return output list

hash-sublist(input list, output list, minimum size):
  add sum-based-hash(list) result to output list // easily swap hash styles here
  if size(input list) > minimum size:
    split the list into two halves
    hash-sublist(first half of list, output list, minimum size)
    hash-sublist(second half of list, output list, minimum size)

sum-based-hash(list):
  initialize the running total to 0

  for each item in the list:
    add the current item to the running total

  return the running total

我认为整个算法的运行时间为O(hash(m)); m = n * (log(n) + 1) O(hash(m)); m = n * (log(n) + 1) ，其中hash(m)通常是线性时间。

存储空间类似于O(hash * s); s = 2n - 1 O(hash * s); s = 2n - 1 ，哈希通常是恒定大小。

请注意，对于C＃，我将输出列表List<HashType> ，但将输入列表设为IEnumerable<ItemType>以节省存储空间，并使用Linq快速“拆分”列表而无需分配两个新的子列表。

原版的：

我认为您可以将其设为O(n + m)执行时间； 其中， n是列表的大小， m是连续计数的大小，并且n < m （否则所有总和都相等）。

带双端队列

内存消耗将是堆栈大小，再加上临时存储的大小m 。

为此，请使用双端队列和正在运行的总数。 将新遇到的值添加到列表中，同时添加到运行总计中，并且当队列达到大小m ，弹出列表并从运行总计中减去。

这是一些伪代码：

initialize the running total to 0

for each item in the list:
  add the current item to the running total
  push the current value onto the end of the dequeue
  if dequeue.length > m:
    pop off the front of the dequeue
    subtract the popped value from the running total
  assign the running total to the current sum slot in the list

reset the index to the beginning of the list

while the dequeue isn't empty:
  add the item in the list at the current index to the running total
  pop off the front of the dequeue
  subtract the popped value from the running total
  assign the running total to the current sum slot in the list
  increment the index

这不是递归的，而是迭代的。

该算法的运行如下所示（对于m = 3 ）：

value   sum slot   overwritten sum slot
2       2          92
5       7          74
8       15         70
0       15         15
33      46
90      131
1       124
3       127
200     294
201     405
23      427
12      436
55      291

带索引

您可以通过取最后的m值开始，并使用索引的偏移量而不是弹出出array[i - m]例如array[i - m]来删除队列和重新分配所有插槽。

这不会减少您的执行时间，因为您仍然必须有两个循环，一个循环建立运行中的提示，另一个循环填充所有值。 但这会将您的内存使用量减少到仅堆栈空间（有效地为O(1) ）。

这是一些伪代码：

initialize the running total to 0

for the last m items in the list:
  add those items to the running total

for each item in the list:
  add the current item to the running total
  subtract the value of the item m slots earlier from the running total
  assign the running total to the current sum slot in the list

m slots earlier的m slots earlier是棘手的部分。 您可以将其分为两个循环：

从列表末尾索引的值，减去m，再加上i
一个从i减去m的索引

或者，当i - m < 0时，可以使用模运算来“包装”该值：

int valueToSutract = array[(i - m) % n];

Answer 2

http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm使用可更新的哈希函数，该函数称为http://en.wikipedia.org/wiki/Rolling_hash 。 计算MD5 / SHA会容易得多，而且可能不会逊色。

您可以证明这一点：它是所选常数a中的度d的多项式。 假设有人提供两段文字，您随机选择一个。 发生碰撞的概率是多少？ 好吧，如果哈希值相同，则将它们相减即可得到以a为根的多项式。 由于最多有一个非零多项式的d根，并且a是随机选择的，因此该概率最多为模数/ d，对于大模量，这将非常小。

当然，MD5 / SHA是安全的，但请参阅http://cr.yp.to/mac/poly1305-20050329.pdf以获取安全变体。

Answer 3

那就是我到目前为止所得到的。 我只是错过了一些不花时间的步骤，例如比较哈希数组和打开文件以进行读取。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace RecursiveHashing
{
    static class Utilities
    {

        // used for circular arrays. If my circular array is of size 5 and it's
        // current position is 2 if I shift 3 units to the left I shouls be in index
        // 4 of the array.
        public static int Shift(this int number, int shift, int divisor)
        {
            var tempa = (number + shift) % divisor;
            if (tempa < 0)
                tempa = divisor + tempa;
            return tempa;
        }

    }
    class Program
    {
        const int CHUNCK_SIZE = 4; // split the files in chuncks of 4 bytes

        /* 
         * formula that I will use to compute hash
         * 
         *      formula =  sum(chunck) * (a[c]+1)*(a[c-1]+1)*(a[c-2]+1)*(-1^a[c])
         *      
         *          where:
         *              sum(chunk)  = sum of current chunck
         *              a[c]        = current byte
         *              a[c-1]      = last byte
         *              a[c-2]      = last last byte
         *              -1^a[c]     = eather -1 or +1  
         *              
         *      this formula is efficient because I can get the sum of any current index by keeping trak of the overal sum
         *      thus this algorithm should be of order n
         */

        static void Main(string[] args)
        {
            Part1(); // Missing implementation to open file for reading
            Part2();
        }



        // fist part compute hashes on first file
        static void Part1()
        {
            // pertend file b reads those bytes
            byte[] FileB = new byte[]{2,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11,};

            // create an array where to store the chashes
            // index 0 will use a fast hash algorithm. index 1 will use a more secure hashing algorithm
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            // used to track on what index of the file we are at
            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array  needed to remember the last few bytes
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array  needed to remember the last sums
            int index = 0; // position where in circular array

            int numberOfHashes = 0; // number of hashes created so far


            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    if (counter % CHUNCK_SIZE == 0 || counter == FileB.Length)
                    {
                        // get the sum of the last chunk
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);
                        Int64 tempHash = (Int64)a;

                        // conpute my hash function
                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));


                        // add the hashes to the array
                        hashes[numberOfHashes, 0] = tempHash;
                        numberOfHashes++;

                        hashes[numberOfHashes, 1] = -1;// later store a stronger hash function
                        numberOfHashes++;

                        // MISSING IMPLEMENTATION TO STORE A SECOND STRONGER HASH FUNCTION

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1); // if index is out of bounds in circular array place it at position 0
                }
            }
        }


        static void Part2()
        {
            // simulate file read of a similar file
            byte[] FileB = new byte[]{1,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11};            

            // place where we will place all matching hashes
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array
            int index = 0; // position where in circular array



            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    // here we compute the hash every time and we are missing implementation to 
                    // check if hash is contained by the other file
                    if (counter >= CHUNCK_SIZE)
                    {
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);

                        Int64 tempHash = (Int64)a;

                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1);
                }
            }
        }
    }
}

使用相同算法的表中表示的相同文件

                        hashes
bytes       sum Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
2       2                   
3       5                   
5       10  5   3   2   -1  
8       18  8   5   3   1   3888
2       20  2   8   5   1   
0       20  0   2   8   1   
1       21  1   0   2   -1  
0       21  0   1   0   1   6
0       21  0   0   1   1   
0       21  0   0   0   1   
1       22  1   0   0   -1  
2       24  2   1   0   1   18
4       28  4   2   1   1   
5       33  5   4   2   -1  
6       39  6   5   4   1   
7       46  7   6   5   -1  -7392
8       54  8   7   6   1   
2       56  2   8   7   1   
3       59  3   2   8   -1  
4       63  4   3   2   1   1020
5       68  5   4   3   -1  
6       74  6   5   4   1   
7       81  7   6   5   -1  
8       89  8   7   6   1   13104
11      100 11  8   7   -1  -27648






file b                          
                            rolling hashes
bytes       0   Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
1       1                   
3       4                   
5       9   5   3   1   -1  
8       17  8   5   3   1   3672
2       19  2   8   5   1   2916
0       19  0   2   8   1   405
1       20  1   0   2   -1  -66
0       20  0   1   0   1   6
0       20  0   0   1   1   2
0       20  0   0   0   1   1
1       21  1   0   0   -1  -2
2       23  2   1   0   1   18
4       27  4   2   1   1   210
5       32  5   4   2   -1  -1080
6       38  6   5   4   1   3570
7       45  7   6   5   -1  -7392
8       53  8   7   6   1   13104
2       55  2   8   7   1   4968
3       58  3   2   8   -1  -2160
4       62  4   3   2   1   1020
5       67  5   4   3   -1  -1680
6       73  6   5   4   1   3780
7       80  7   6   5   -1  -7392
8       88  8   7   6   1   13104
11      99  11  8   7   -1  -27648

实现递归哈希算法

问题描述

所以我的问题是：

编辑

3 个解决方案

解决方案1
2 已采纳 2011-12-07 03:50:26

编辑：

原版的：

带双端队列

带索引

解决方案2
1 2011-12-07 05:55:45

解决方案3
0 2011-12-07 08:00:56

使用相同算法的表中表示的相同文件

实现递归哈希算法

问题描述

所以我的问题是：

编辑

3 个解决方案

解决方案1 2 已采纳 2011-12-07 03:50:26

编辑：

原版的：

带双端队列

带索引

解决方案2 1 2011-12-07 05:55:45

解决方案3 0 2011-12-07 08:00:56

使用相同算法的表中表示的相同文件

解决方案1
2 已采纳 2011-12-07 03:50:26

解决方案2
1 2011-12-07 05:55:45

解决方案3
0 2011-12-07 08:00:56