简体   繁体   中英

How to create efficient bit set structure for big data?

Java's BitSet is in memory and it has no compression in it.

Say I have 1 billion entries in bit map - 125 MB is occupied in memory. Say I have to do AND and OR operation on 10 such bit maps it is taking 1250 MB or 1.3 GB memory, which is unacceptable. How to do fast operations on such bit maps without holding them uncompressed in memory?

I do not know the distribution of the bit in the bit-set.

I have also looked at JavaEWAH , which is a variant of the Java BitSet class, using run-length encoding (RLE) compression.

Is there any better solution ?

One solution is to keep the arrays off the heap.

You'll want to read this answer by @PeterLawrey to a related question.

In summary the performance of Memory-Mapped files in Java is quite good and it avoids keeping huge collections of objects on the heap.

The operating system may limit the size of a individual memory mapped region. Its easy to work around this limitation by mapping multiple regions. If the regions are fixed size, simple binary operations on the entities index can be used to find the corresponding memory mapped region in the list of memory-mapped files.

Are you sure you need compression? Compression will trade time for space. Its possible that the reduced I/O ends up saving you time, but its also possible that it won't. Can you add an SSD?

If you haven't yet tried memory-mapped files, start with that. I'd take a close look at implementing something on top of Peter's Chronicle.

If you need more speed you could try doing your binary operations in parallel.

If you end up needing compression you could always implement it on top of Chronicle's memory mapped arrays.

From the comments here what I would say as a complement to your initial question :

  • the bit fields distribution is unknown and so BitSet is probably the best we can use
  • you have to use the bit fields in different modules and want to cache them

That being said, my advice would be to implement a dedicated cache solution, using a LinkedHashMap with access order if LRU is an acceptable eviction strategy, and having a permanent storage on disk for the BitSetS .

Pseudo code :

class BitSetHolder {

    class BitSetCache extends LinkedHashMap<Integer, Bitset> {
        BitSetCache() {
            LinkedHashMap(size, loadfactor, true); // access order ...
        }

        protected boolean removeEldestEntry(Map.Entry eldest) {
            return size() > BitSetHolder.this.size; //size is knows in BitSetHolder
        }
    }
    BitSet get(int i) { // get from cache if not from disk
        if (bitSetCache.containsKey(i) {
             return bitSetCache.get(i);
        }
        // if not in cache, put it in cache
        BitSet bitSet = readFromDisk();
        bitSetCache.put(i, bitSet);
        return bitSet();
    }
}

That way :

  • you have transparent access to you 10 bit sets
  • you keep in memory the most recently accessed bit sets
  • you limit the memory to the size of the cache (the minimum size should be 3 if you want to create a bitset be combining 2 others)

If this is an option for your requirements, I could develop a little more. Anyway, this is adaptable for other eviction strategy, LRU being the simplest as it is native in LinkedHashMap .

The best solution depends a great deal on the usage patterns and structure of the data.

If your data has some structure beyond a raw bit blob, you might be able to do better with a different data structure. For example, a word list can be represented very efficiently in both space and lookup time using a DAG.

Sample Directed Graph and Topological Sort Code

BitSet is internally represented as a long[], which makes it slightly more difficult to refactor. If you grab the source out of the openjdk, you'd want to rewrite it so that internally it used iterators, backed by either files or in-memory compressed blobs. However, you have to rewrite all the loops in BitSet to use iterators, so the entire blob never has to be instantiated.

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM