Is this the correct way of testing the avalanche effect of a hash function?

Question

I am trying to implement a hash table in C to enrich my understandings of data structures.

There are a lot of hash functions out there for hash table implementations.

To compare the hash functions, there is a test called the avalanche effect test.

To test the set of hash functions I currently have, I wrote a small program in Java:

    public static void testHashAvalanche() {
        Set<Long> collisionSet = new HashSet<>();
        // The input for the hash function with 128 bytes.
        byte[] bytes = new byte[128];
        long count = 0;
        long previous = 0;
        long totalAvalanche = 0;
        // Generate the inputs for hashing with a slight change of bit each time
        for (int i = 0; i < 128; i++) {
            // Byte value from 0 -> 255
            for (int j = 0; j < 256; j++) {
                long current = hash(bytes); // Any hash function with 64 bit output
                int avalanche = calculateAvalanche(previous, current);
                totalAvalanche += avalanche;
                bytes[i]++;
                count++;
                previous = current;
            }
        }
        System.out.println("Average Avalanche: " + (double) totalAvalanche / (double) count);
    }
    
    public static int calculateAvalanche(long a, long b) {
        long difference = a ^ b;
        return Long.bitCount(difference);
    }

I would like to know whether this is a correct approach, or there are other ways to test the hash functions.

Thanks!

Answer 1

Let's begin with a quick observation. Suppose you're hashing 128-byte values. This means that you are hashing inputs that are 1024 bits long. How many different 1024-bit numbers are there? Well, the first bit can be zero or one. Independently, the second bit can be zero or one. And independent of that, the third bit can be zero or one, etc. This means that the number of possible combinations for those bits is 2 × 2 × ... × 2, a total of 1024 times, or 2 ¹⁰²⁴ .

For context, that number is staggeringly huge. To call it "astronomically large" would actually be an insult to that number, since the number of atoms in the observable universe is something like 2 ³⁰⁰ . There's simply no way that you could try out all 2 ¹⁰²⁴ combinations of inputs to see how much they differ from one another.

So what could you do instead? One option would be to pick a sample of different inputs, and for each of those compute all numbers that are one bit different from them. Then, hash all those, see how many output bits flip, and average those numbers together. Another option would be to pick a random value, flip some bit in it, compute how much that changes the output hash, and repeat this process to get a rough estimate of how the hash changes in practice. Or you could use these approaches on "realistic" inputs (perhaps from a database or list of values somewhere), making "realistic" edits to the inputs (maybe by comparing hashes of one value against hashes of "similar" values, for some definition of "similar").

The approach you're taking is along these lines, but not quite the same. Specifically, your approach works by maintaining an array of bytes, cycling through the patterns

0 0 0 0 ... 0 0
0 0 0 0 ... 0 1
0 0 0 0 ... 0 2
0 0 0 0 ... 0 3
0 0 0 0 ... 0 4
      ...
0 0 0 0 ... 1 0
0 0 0 0 ... 2 0
0 0 0 0 ... 3 0
0 0 0 0 ... 4 0
      ...

There are a couple of issues with this. For starters, these inputs might not be a good representative sample of possible inputs. (Though perhaps it's common to hash these sorts of values in your application, in which case you can ignore this. ^_^)

The next issue is that you're changing one byte at a time, rather than one bit at a time. That may or may not be a problem, depending on what you're trying to measure. If you're looking for the avalanche effect at the level of individual bytes, this is fine. But if you're looking for the avalanche effect at the level of individual bits , this won't work. For example, rolling a byte over from 15 (00001111) to 16 (00010000) changes five bits. You could potentially keep the current approach of modifying one byte at a time by using a Gray code to cycle through all the possible bytes in a way that flips one bit at a time.

Is this the correct way of testing the avalanche effect of a hash function?

Question

1 answers

solution1
2 ACCPTED 2021-06-22 16:34:51

Is this the correct way of testing the avalanche effect of a hash function?

Question

1 answers

solution1 2 ACCPTED 2021-06-22 16:34:51

solution1
2 ACCPTED 2021-06-22 16:34:51