[英]How to create efficient bit set structure for big data?
Java's BitSet
is in memory and it has no compression in it. Java的
BitSet
在内存中,并且没有压缩。
Say I have 1 billion entries in bit map - 125 MB is occupied in memory. 假设我在位图中有10亿个条目-内存中占用了125 MB。 Say I have to do AND and OR operation on 10 such bit maps it is taking 1250 MB or 1.3 GB memory, which is unacceptable.
假设我必须对10个这样的位图执行AND和OR操作,这将占用1250 MB或1.3 GB的内存,这是不可接受的。 How to do fast operations on such bit maps without holding them uncompressed in memory?
如何在此类位图上进行快速操作而又不将其保持未压缩状态?
I do not know the distribution of the bit in the bit-set. 我不知道位集中的位分布。
I have also looked at JavaEWAH , which is a variant of the Java BitSet
class, using run-length encoding (RLE) compression. 我还研究了JavaEWAH ,它是Java
BitSet
类的一种变体,它使用游程长度编码(RLE)压缩。
Is there any better solution ? 有没有更好的解决方案?
One solution is to keep the arrays off the heap. 一种解决方案是使阵列远离堆。
You'll want to read this answer by @PeterLawrey to a related question. 您将需要阅读@PeterLawrey的相关问题的答案 。
In summary the performance of Memory-Mapped files in Java is quite good and it avoids keeping huge collections of objects on the heap. 总之,Java中的内存映射文件的性能非常好,并且避免了在堆上保留大量对象。
The operating system may limit the size of a individual memory mapped region. 操作系统可能会限制单个内存映射区域的大小。 Its easy to work around this limitation by mapping multiple regions.
通过映射多个区域可以轻松解决此限制。 If the regions are fixed size, simple binary operations on the entities index can be used to find the corresponding memory mapped region in the list of memory-mapped files.
如果区域的大小固定,则可以使用对实体索引的简单二进制操作在内存映射文件列表中找到相应的内存映射区域。
Are you sure you need compression? 您确定需要压缩吗? Compression will trade time for space.
压缩将以时间换取空间。 Its possible that the reduced I/O ends up saving you time, but its also possible that it won't.
减少的I / O可能最终节省了您的时间,但也可能不会。 Can you add an SSD?
可以添加SSD吗?
If you haven't yet tried memory-mapped files, start with that. 如果您还没有尝试过内存映射文件,则从此开始。 I'd take a close look at implementing something on top of Peter's Chronicle.
我将仔细研究在Peter的《编年史》之上实现一些东西。
If you need more speed you could try doing your binary operations in parallel. 如果需要更高的速度,可以尝试并行执行二进制操作。
If you end up needing compression you could always implement it on top of Chronicle's memory mapped arrays. 如果最终需要压缩,则可以始终在Chronicle的内存映射数组的顶部实现它。
From the comments here what I would say as a complement to your initial question : 从这里的评论中,我想补充您最初提出的问题:
BitSet
is probably the best we can use BitSet
可能是我们可以使用的最好的 That being said, my advice would be to implement a dedicated cache solution, using a LinkedHashMap
with access order if LRU is an acceptable eviction strategy, and having a permanent storage on disk for the BitSetS
. 话虽如此,我的建议是实施一个专用的缓存解决方案,如果LRU是可接受的驱逐策略,则使用具有访问顺序的
LinkedHashMap
,并在磁盘上为BitSetS
永久存储。
Pseudo code : 伪代码:
class BitSetHolder {
class BitSetCache extends LinkedHashMap<Integer, Bitset> {
BitSetCache() {
LinkedHashMap(size, loadfactor, true); // access order ...
}
protected boolean removeEldestEntry(Map.Entry eldest) {
return size() > BitSetHolder.this.size; //size is knows in BitSetHolder
}
}
BitSet get(int i) { // get from cache if not from disk
if (bitSetCache.containsKey(i) {
return bitSetCache.get(i);
}
// if not in cache, put it in cache
BitSet bitSet = readFromDisk();
bitSetCache.put(i, bitSet);
return bitSet();
}
}
That way : 那样 :
If this is an option for your requirements, I could develop a little more. 如果这是您要求的一种选择,我可以再发展一点。 Anyway, this is adaptable for other eviction strategy, LRU being the simplest as it is native in
LinkedHashMap
. 无论如何,这适用于其他驱逐策略,因为LRU是
LinkedHashMap
本机,所以它是最简单的。
The best solution depends a great deal on the usage patterns and structure of the data. 最佳解决方案在很大程度上取决于数据的使用模式和结构。
If your data has some structure beyond a raw bit blob, you might be able to do better with a different data structure. 如果您的数据具有超出原始位Blob的某种结构,则可以使用其他数据结构做得更好。 For example, a word list can be represented very efficiently in both space and lookup time using a DAG.
例如,可以使用DAG在空间和查找时间上非常有效地表示单词列表。
Sample Directed Graph and Topological Sort Code
样本有向图和拓扑排序代码
BitSet is internally represented as a long[], which makes it slightly more difficult to refactor. BitSet在内部表示为long [],这使其重构起来稍微困难一些。 If you grab the source out of the openjdk, you'd want to rewrite it so that internally it used iterators, backed by either files or in-memory compressed blobs.
如果您从openjdk中获取源代码,则需要对其进行重写,以使其在内部使用迭代器,并由文件或内存中压缩的blob作为后盾。 However, you have to rewrite all the loops in BitSet to use iterators, so the entire blob never has to be instantiated.
但是,您必须重写BitSet中的所有循环才能使用迭代器,因此不必完全实例化整个Blob。
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/BitSet.java
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.