简体   繁体   中英

Roaring bitmap using more storage than normal bitset

I have a bitset which i am using to track whether an item is present or not example

b = 01100110000

it represents that 2nd and 3rd items are present and 1st and 4th item are not present.

While searching for library which can optimise this bitset array. I came across Roaring bitmaps which sounded very exciting.

I did a quick test with it,

    public static void main(String[] args) throws IOException {
        RoaringBitmap roaringBitMap = new RoaringBitmap();
        BitSet bitSet = new BitSet(5000);
        double prob = 0.001;
        Random random = new Random();
        for (int i = 0; i < 5000; i++) {
            if (random.nextDouble() < prob) {
                bitSet.set(i);
                roaringBitMap.add(i);
            }
        }
        System.out.println(bitSet.cardinality());
        System.out.println("bitset bytes: "+ bitSet.size());
        System.out.println("RoaringBitmap bytes: " + roaringBitMap.getSizeInBytes() * 8);
    }

Basically we are setting some values and check overall size of data structure.

when we run this with multiple prob values. I got

prob byte bitset bytes RoaringBitmap bytes
0.001 5056 288
0.01 5056 944
0.1 5056 7872
0.999 5056 65616

If you see as we insert more and more numbers, the memory footprint of RoaringBitmap increases.

  1. Is this expected?
  2. In the worst case should it not just fall back to bitset based implementaiton?
  3. can't 0.999 be treated as inverse of 0.001 and we would be able to store it in 288 bytes?
  4. What is the most optimal way to represent these bitset as String when we are making inter service calls and using jackson library (but not byte based serialisation libraries)

this seems to be the case when number of entries are small, But as we increase the number of entries, the different becomes less visible. Although it is not confirmed by the lib author ( i asked here and followed up here )

prob number of entries bitset bits RoaringBitmap bits saving %
0.001 50000 50048 928 98
0.01 50000 50048 7744 84
0.1 50000 50048 65616 -31
0.999 50000 50048 65616 <- NOTE it does not increase -31
0.001 500000 500032 8704 98
0.01 500000 500032 80720 83
0.1 500000 500032 524480 -4
0.999 500000 500032 524480 <- NOTE it does not increase -4
0.001 50000000 50000000 835232 98
0.01 50000000 50000000 8036368 83
0.1 50000000 50000000 50016240 -0.03
0.999 50000000 50000000 50016240 <- NOTE it does not increase -0.03

looking at this it seems like as number of entries grows they might be using bitmap only behind the scene. The take away is that don't blindly use the library, test for your use case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM