CRC32 hash collision on the same string for any seed

Question

I tried to find seed to hash short strings of lowercase letters of maximum possible length without collisions. I chose SSE 4.2 CRC32 to make the task easier. For lengths 4, 5, 6 there is no collision for seeds up to some reasonable small value (I can't wait infinitely).

#include <bitset>
#include <limits>
#include <iterator>
#include <iostream>

#include <x86intrin.h>

static std::bitset<size_t(std::numeric_limits<uint32_t>::max()) + 1> hashes;

static void findSeed()
{
    uint8_t c[7];
    const auto findCollision = [&] (uint32_t seed)
    {
        std::cout << "seed = " << seed << std::endl;
        hashes.reset();
        for (c[0] = 'a'; c[0] <= 'z'; ++c[0]) {
            uint32_t hash0 = _mm_crc32_u8(~seed, c[0]);
            for (c[1] = 'a'; c[1] <= 'z'; ++c[1]) {
                uint32_t hash1 = _mm_crc32_u8(hash0, c[1]);
                for (c[2] = 'a'; c[2] <= 'z'; ++c[2]) {
                    uint32_t hash2 = _mm_crc32_u8(hash1, c[2]);
                    for (c[3] = 'a'; c[3] <= 'z'; ++c[3]) {
                        uint32_t hash3 = _mm_crc32_u8(hash2, c[3]);
                        for (c[4] = 'a'; c[4] <= 'z'; ++c[4]) {
                            uint32_t hash4 = _mm_crc32_u8(hash3, c[4]);
                            for (c[5] = 'a'; c[5] <= 'z'; ++c[5]) {
                                uint32_t hash5 = _mm_crc32_u8(hash4, c[5]);
                                for (c[6] = 'a'; c[6] <= 'z'; ++c[6]) {
                                    uint32_t hash6 = _mm_crc32_u8(hash5, c[6]);
                                    if (hashes[hash6]) {
                                        std::cerr << "collision at ";
                                        std::copy(std::cbegin(c), std::cend(c), std::ostream_iterator<uint8_t>(std::cerr, ""));
                                        std::cerr << " " << hash6 << '\n';
                                        return;
                                    }
                                    hashes.set(hash6);
                                }
                            }
                        }
                    }
                }
            }
            std::cout << "c[0] = " << c[0] << std::endl;
        }
    };
    for (uint32_t seed = 0; seed != std::numeric_limits<uint32_t>::max(); ++seed) {
        findCollision(seed);
    }
    findCollision(std::numeric_limits<uint32_t>::max());
}

int main()
{
    findSeed();
}

It is clear, that for strings of length 7 it is impossible to find such a seed, because ('z' - 'a' + 1)^7 = 26^7 = 8 031 810 176 > 4 294 967 296 = size_t(std::numeric_limits<uint32_t>::max()) + 1 . But notable thing is that for strings abfcmbk and baabaaa for any seed there is first collision. hash6 differs for different seeds when collision occured. It is curious on my mind.

How can it be explained?

Answer 1

If CRC(seed,dat) is the CRC of dat , using the specified seed , then for any seeds (seed1, seed2), and matching-length pair of data (dat1, dat2), and given CRC(seed1,dat1), one can compute CRC(seed2,dat1) by computing the xor of CRC(seed1, dat1) , CRC(seed1,dat2) , and CRC(seed2,dat2) .

This in turn implies that if two pieces of data would yield the same CRC value for any particular seed, they would yield the same value for every possible seed. If for any seed1 , CRC(seed1,dat1a) equals CRC(seed1,dat1b) , and the strings are of equal length, then for any other seed seed2 and same-length data dat2 , CRC(seed2,dat1a) will equal CRC(seed1, dat1a) xor CRC(seed1,dat2) xor CRC(seed2,dat2) , and CRC(seed2,dat1b) will equal CRC(seed1, dat1b) xor CRC(seed1,dat2) xor CRC(seed2,dat2) . Since all three terms of the xors are equal, that implies that the results will be likewise equal.

Answer 2

As noted in another answer, a CRC can't help with this. Instead you should simply encode your six or fewer lower case letters into base 26 32-bit integers, with some offsets depending on the length of the string. The sum of 26^n for n=0 to 6 is less than 2^32. Much less actually, as that can be encoded in 29 bits. Or as Peter Cordes commented, in 30 bits with six five-bit fields.

There will be no collisions. If it's useful, you can apply a 32-bit CRC to that integer to scramble the bits, and there will again be no collisions.

As you observed, it is not possible to uniquely encode seven or more lower-case characters in 32 bits.

CRC32 hash collision on the same string for any seed

Question

2 answers

solution1
6 ACCPTED 2020-09-26 19:40:29

solution2
3 2020-09-27 14:13:25

CRC32 hash collision on the same string for any seed

Question

2 answers

solution1 6 ACCPTED 2020-09-26 19:40:29

solution2 3 2020-09-27 14:13:25

solution1
6 ACCPTED 2020-09-26 19:40:29

solution2
3 2020-09-27 14:13:25