如何为大数据创建有效的位集结构？

Question

Java's BitSet is in memory and it has no compression in it. Java的BitSet在内存中，并且没有压缩。

Say I have 1 billion entries in bit map - 125 MB is occupied in memory. 假设我在位图中有10亿个条目-内存中占用了125 MB。 Say I have to do AND and OR operation on 10 such bit maps it is taking 1250 MB or 1.3 GB memory, which is unacceptable. 假设我必须对10个这样的位图执行AND和OR操作，这将占用1250 MB或1.3 GB的内存，这是不可接受的。 How to do fast operations on such bit maps without holding them uncompressed in memory? 如何在此类位图上进行快速操作而又不将其保持未压缩状态？

I do not know the distribution of the bit in the bit-set. 我不知道位集中的位分布。

I have also looked at JavaEWAH , which is a variant of the Java BitSet class, using run-length encoding (RLE) compression. 我还研究了JavaEWAH ，它是Java BitSet类的一种变体，它使用游程长度编码（RLE）压缩。

Is there any better solution ? 有没有更好的解决方案？

Answer 1

One solution is to keep the arrays off the heap. 一种解决方案是使阵列远离堆。

You'll want to read this answer by @PeterLawrey to a related question. 您将需要阅读@PeterLawrey的相关问题的答案。

In summary the performance of Memory-Mapped files in Java is quite good and it avoids keeping huge collections of objects on the heap. 总之，Java中的内存映射文件的性能非常好，并且避免了在堆上保留大量对象。

The operating system may limit the size of a individual memory mapped region. 操作系统可能会限制单个内存映射区域的大小。 Its easy to work around this limitation by mapping multiple regions. 通过映射多个区域可以轻松解决此限制。 If the regions are fixed size, simple binary operations on the entities index can be used to find the corresponding memory mapped region in the list of memory-mapped files. 如果区域的大小固定，则可以使用对实体索引的简单二进制操作在内存映射文件列表中找到相应的内存映射区域。

Are you sure you need compression? 您确定需要压缩吗？ Compression will trade time for space. 压缩将以时间换取空间。 Its possible that the reduced I/O ends up saving you time, but its also possible that it won't. 减少的I / O可能最终节省了您的时间，但也可能不会。 Can you add an SSD? 可以添加SSD吗？

If you haven't yet tried memory-mapped files, start with that. 如果您还没有尝试过内存映射文件，则从此开始。 I'd take a close look at implementing something on top of Peter's Chronicle. 我将仔细研究在Peter的《编年史》之上实现一些东西。

If you need more speed you could try doing your binary operations in parallel. 如果需要更高的速度，可以尝试并行执行二进制操作。

If you end up needing compression you could always implement it on top of Chronicle's memory mapped arrays. 如果最终需要压缩，则可以始终在Chronicle的内存映射数组的顶部实现它。

Answer 2

From the comments here what I would say as a complement to your initial question : 从这里的评论中，我想补充您最初提出的问题：

the bit fields distribution is unknown and so BitSet is probably the best we can use 位字段的分布是未知的，因此BitSet可能是我们可以使用的最好的
you have to use the bit fields in different modules and want to cache them 您必须使用不同模块中的位字段并要对其进行缓存

That being said, my advice would be to implement a dedicated cache solution, using a LinkedHashMap with access order if LRU is an acceptable eviction strategy, and having a permanent storage on disk for the BitSetS . 话虽如此，我的建议是实施一个专用的缓存解决方案，如果LRU是可接受的驱逐策略，则使用具有访问顺序的LinkedHashMap ，并在磁盘上为BitSetS 永久存储。

Pseudo code : 伪代码：

class BitSetHolder {

    class BitSetCache extends LinkedHashMap<Integer, Bitset> {
        BitSetCache() {
            LinkedHashMap(size, loadfactor, true); // access order ...
        }

        protected boolean removeEldestEntry(Map.Entry eldest) {
            return size() > BitSetHolder.this.size; //size is knows in BitSetHolder
        }
    }
    BitSet get(int i) { // get from cache if not from disk
        if (bitSetCache.containsKey(i) {
             return bitSetCache.get(i);
        }
        // if not in cache, put it in cache
        BitSet bitSet = readFromDisk();
        bitSetCache.put(i, bitSet);
        return bitSet();
    }
}

That way : 那样：

you have transparent access to you 10 bit sets 您可以透明访问10位集
you keep in memory the most recently accessed bit sets 您将最近访问的位集保留在内存中
you limit the memory to the size of the cache (the minimum size should be 3 if you want to create a bitset be combining 2 others) 您将内存限制为高速缓存的大小（如果要创建将2个其他位组合在一起的位集，则最小大小应为3）

If this is an option for your requirements, I could develop a little more. 如果这是您要求的一种选择，我可以再发展一点。 Anyway, this is adaptable for other eviction strategy, LRU being the simplest as it is native in LinkedHashMap . 无论如何，这适用于其他驱逐策略，因为LRU是LinkedHashMap本机，所以它是最简单的。

Answer 3

The best solution depends a great deal on the usage patterns and structure of the data. 最佳解决方案在很大程度上取决于数据的使用模式和结构。

If your data has some structure beyond a raw bit blob, you might be able to do better with a different data structure. 如果您的数据具有超出原始位Blob的某种结构，则可以使用其他数据结构做得更好。 For example, a word list can be represented very efficiently in both space and lookup time using a DAG. 例如，可以使用DAG在空间和查找时间上非常有效地表示单词列表。

Sample Directed Graph and Topological Sort Code 样本有向图和拓扑排序代码

BitSet is internally represented as a long[], which makes it slightly more difficult to refactor. BitSet在内部表示为long []，这使其重构起来稍微困难一些。 If you grab the source out of the openjdk, you'd want to rewrite it so that internally it used iterators, backed by either files or in-memory compressed blobs. 如果您从openjdk中获取源代码，则需要对其进行重写，以使其在内部使用迭代器，并由文件或内存中压缩的blob作为后盾。 However, you have to rewrite all the loops in BitSet to use iterators, so the entire blob never has to be instantiated. 但是，您必须重写BitSet中的所有循环才能使用迭代器，因此不必完全实例化整个Blob。

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java

如何为大数据创建有效的位集结构？

问题描述

3 个解决方案

解决方案1
2 2014-07-28 17:28:56

解决方案2
0 2014-07-23 14:32:21

解决方案3
0 2014-07-27 01:26:18

如何为大数据创建有效的位集结构？

问题描述

3 个解决方案

解决方案1 2 2014-07-28 17:28:56

解决方案2 0 2014-07-23 14:32:21

解决方案3 0 2014-07-27 01:26:18

解决方案1
2 2014-07-28 17:28:56

解决方案2
0 2014-07-23 14:32:21

解决方案3
0 2014-07-27 01:26:18