简体   繁体   中英

CRC32 hash collision on the same string for any seed

I tried to find seed to hash short strings of lowercase letters of maximum possible length without collisions. I chose SSE 4.2 CRC32 to make the task easier. For lengths 4, 5, 6 there is no collision for seeds up to some reasonable small value (I can't wait infinitely).

#include <bitset>
#include <limits>
#include <iterator>
#include <iostream>

#include <x86intrin.h>

static std::bitset<size_t(std::numeric_limits<uint32_t>::max()) + 1> hashes;

static void findSeed()
{
    uint8_t c[7];
    const auto findCollision = [&] (uint32_t seed)
    {
        std::cout << "seed = " << seed << std::endl;
        hashes.reset();
        for (c[0] = 'a'; c[0] <= 'z'; ++c[0]) {
            uint32_t hash0 = _mm_crc32_u8(~seed, c[0]);
            for (c[1] = 'a'; c[1] <= 'z'; ++c[1]) {
                uint32_t hash1 = _mm_crc32_u8(hash0, c[1]);
                for (c[2] = 'a'; c[2] <= 'z'; ++c[2]) {
                    uint32_t hash2 = _mm_crc32_u8(hash1, c[2]);
                    for (c[3] = 'a'; c[3] <= 'z'; ++c[3]) {
                        uint32_t hash3 = _mm_crc32_u8(hash2, c[3]);
                        for (c[4] = 'a'; c[4] <= 'z'; ++c[4]) {
                            uint32_t hash4 = _mm_crc32_u8(hash3, c[4]);
                            for (c[5] = 'a'; c[5] <= 'z'; ++c[5]) {
                                uint32_t hash5 = _mm_crc32_u8(hash4, c[5]);
                                for (c[6] = 'a'; c[6] <= 'z'; ++c[6]) {
                                    uint32_t hash6 = _mm_crc32_u8(hash5, c[6]);
                                    if (hashes[hash6]) {
                                        std::cerr << "collision at ";
                                        std::copy(std::cbegin(c), std::cend(c), std::ostream_iterator<uint8_t>(std::cerr, ""));
                                        std::cerr << " " << hash6 << '\n';
                                        return;
                                    }
                                    hashes.set(hash6);
                                }
                            }
                        }
                    }
                }
            }
            std::cout << "c[0] = " << c[0] << std::endl;
        }
    };
    for (uint32_t seed = 0; seed != std::numeric_limits<uint32_t>::max(); ++seed) {
        findCollision(seed);
    }
    findCollision(std::numeric_limits<uint32_t>::max());
}

int main()
{
    findSeed();
}

It is clear, that for strings of length 7 it is impossible to find such a seed, because ('z' - 'a' + 1)^7 = 26^7 = 8 031 810 176 > 4 294 967 296 = size_t(std::numeric_limits<uint32_t>::max()) + 1 . But notable thing is that for strings abfcmbk and baabaaa for any seed there is first collision. hash6 differs for different seeds when collision occured. It is curious on my mind.

How can it be explained?

If CRC(seed,dat) is the CRC of dat , using the specified seed , then for any seeds (seed1, seed2), and matching-length pair of data (dat1, dat2), and given CRC(seed1,dat1), one can compute CRC(seed2,dat1) by computing the xor of CRC(seed1, dat1) , CRC(seed1,dat2) , and CRC(seed2,dat2) .

This in turn implies that if two pieces of data would yield the same CRC value for any particular seed, they would yield the same value for every possible seed. If for any seed1 , CRC(seed1,dat1a) equals CRC(seed1,dat1b) , and the strings are of equal length, then for any other seed seed2 and same-length data dat2 , CRC(seed2,dat1a) will equal CRC(seed1, dat1a) xor CRC(seed1,dat2) xor CRC(seed2,dat2) , and CRC(seed2,dat1b) will equal CRC(seed1, dat1b) xor CRC(seed1,dat2) xor CRC(seed2,dat2) . Since all three terms of the xors are equal, that implies that the results will be likewise equal.

As noted in another answer, a CRC can't help with this. Instead you should simply encode your six or fewer lower case letters into base 26 32-bit integers, with some offsets depending on the length of the string. The sum of 26^n for n=0 to 6 is less than 2^32. Much less actually, as that can be encoded in 29 bits. Or as Peter Cordes commented, in 30 bits with six five-bit fields.

There will be no collisions. If it's useful, you can apply a 32-bit CRC to that integer to scramble the bits, and there will again be no collisions.

As you observed, it is not possible to uniquely encode seven or more lower-case characters in 32 bits.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM