简体   繁体   中英

MySQL bitwise operations, bloom filter

I'd like to implement a bloom filter using MySQL (other a suggested alternative).

The problem is as follows:

Suppose I have a table that stores 8 bit integers, with these following values:

1: 10011010
2: 00110101
3: 10010100
4: 00100110
5: 00111011
6: 01101010

I'd like to find all results that are bitwise AND to this:

00011000

The results should be rows 1 and 5.

However, in my problem, they aren't 8 bit integers, but rather n-bit integers. How do I store this, and how do I query? Speed is key.

Create a table with int column (use this link to pick the right int size). Don't store numbers as a sequence of 0 and 1.

For your data it will look like this:

number

154
53
148
38
59
106

and you need to find all entries matching 24.

Then you can run a query like

SELECT * FROM test WHERE number & 24 = 24

If you want to avoid convertion into 10 base numbers in your application you can hand it over to mysql:

INSERT INTO test SET number = b'00110101';

and search like this

SELECT bin(number) FROM test WHERE number & b'00011000' = b'00011000'

Consider not using MySQL for this.

First off, there probably isn't a built-in way for more than 64-bit tables. You'd have to resort to user-defined functions written in C.

Second, each query is going to require a full table scan, because MySQL can't use an index for your query. So, unless your table is very small, this will not be fast.

Bloom filters by their nature require table scans to evaluate matches. In MySQL, there is no bloom filter type. The simple solution is to map the bytes of the bloom filter onto BitInteger (8-byte words) and perform the check in the query. So assuming that the bloom filters 8 bytes or fewer (a very small filter) you could execute a prepared statement like:

SELECT * FROM test WHERE cast(filter, UNSIGNED) & cast(?, UNSIGNED) = cast(?, UNSIGNED)

and replace the parameters with the value you are looking for. However, for larger filters, you have to create multiple filter columns and split your target filter into multiple words. You have to cast to unsigned to do the check properly.

Since many reasonable bloom filters are in the Kilo- to Megabyte range in size it makes sense to use blobs to store them. Once you switch to blobs there are no native mechanisms to perform the byte level comparisons. And pulling an entire table of large blobs across the network to do the filter in code locally does not make much sense.

The only reasonable solution I have found is a UDF. The UDF should accept a char* and iterate over it casting the char* to an unsigned char* and perform the target & candidate = target check. This code would look something like:

my_bool bloommatch(UDF_INIT *initid, UDF_ARGS *args, char* result, unsigned long* length, char *is_null, char *error)
{
    if (args->lengths[0] > args->lengths[1])
    {
        return 0;
    }
    char* b1=args->args[0];
    char* b2=args->args[1];
    int limit = args->lengths[0];
    unsigned char a;
    unsigned char b;
    int i;
    for (i=0;i<limit;i++)
    {
        a = (unsigned char) b1[i];
        b = (unsigned char) b2[i];
        if ((a & b) != a)
        {
            return 0;
        }
    }
    return 1;
}

This solution is implemented and is available here

切换到PostgreSQL并使用bit(n)

For up to 64 bits, you can use a MySQL integer type, like tinyint (8b), int (16b), mediumint (24b) and bigint (64b). Use the unsigned variants.

Above 64b, use the MySQL (VAR)BINARY type. Those are raw byte buffers. For example BINARY(16) is good for 128 bits.

To prevent table scans you need an index per useful bit, and/or an index per set of related bits. You can create virtual columns for that, and put an index on each of them.

To implement a Bloom filter using a database, I'd think about it differently.

I'd do a two-level filter. Use a single multi-bit hash function to generate an id (this would be more like a hash table bucket index) and then use bits within the row for the remaining k-1 hash functions of the more classical kind. Within the row, it could be (say) 100 bigint columns (I'd compare performance vs BLOBs too).

It would effectively be N separate Bloom filters, where N is the domain of your first hash function. The idea is to reduce the size of the Bloom filter required by choosing a hash bucket. It wouldn't have the full efficiency of an in-memory Bloom filter, but could still greatly reduce the amount of data needing to be stored compared to putting all the values in the database and indexing them. Presumably the reason for using a database in the first place is lack of memory for a full Bloom filter.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM