简体繁体 English

键值存储建议

[英]key-value store suggestion

原文 2011-07-10 04:01:52 8 6 java/ nosql

I need a very basic key-value store for java.我需要一个非常基本的 java 键值存储。 I started with a HashMap but it seems that HashMap is somewhat space inefficient (I'm storing ~20 million records, and seems to require ~6GB RAM).我从 HashMap 开始，但似乎 HashMap 的空间效率有点低（我存储了大约 2000 万条记录，并且似乎需要大约 6GB RAM）。

The map is Map<Integer,String> , and so I'm considering using GNU Trove TIntObjectHashMap<byte[]> , and storing the map value as an ascii byte array rather than String. map 是Map<Integer,String> ，所以我正在考虑使用 GNU Trove TIntObjectHashMap<byte[]> ，并将 map 值存储为 ascii 字节数组而不是字符串。

As an alternative to that, is there a key-value store that only requires adding jar files, does not hold the entire map in RAM at once, and is still reasonably fast?作为替代方案，是否有一个键值存储只需要添加 jar 文件，不会一次将整个 map 保存在 RAM 中，并且仍然相当快？

6 个解决方案

BabuDB巴布数据库

BabuDB is an embedded non-relational database system. BabuDB 是一个嵌入式非关系型数据库系统。 Its lean and simple design allows it to persistently store large amounts of key-value pairs without the overhead and complexity of similar approaches such as BerkeleyDB.其精简和简单的设计使其能够持久存储大量键值对，而无需像 BerkeleyDB 这样的类似方法的开销和复杂性。

License: New BSD license, Language: Java许可证：新 BSD 许可证，语言：Java

JDBM2 JDBM2

JDBM2 provides HashMap and TreeMap which are backed by disk storage. JDBM2 提供了由磁盘存储支持的 HashMap 和 TreeMap。

License: Apache License 2.0, Language: Java许可证：Apache 许可证 2.0，语言：Java

Banana DB香蕉数据库

Banana DB is a self-contained key/value pair database implemented in Java. Banana DB 是在 Java 中实现的自包含键/值对数据库。

License: Apache License 2.0, Language: Java许可证：Apache 许可证 2.0，语言：Java

I've tried BabuDB and JDBM2 and they work fine.我试过 BabuDB 和 JDBM2，它们工作正常。 BabuDB is a little bit more difficult to set up, but potentially delivers higher performance than JDBM2. BabuDB 设置起来有点困难，但可能提供比 JDBM2 更高的性能。

These all all databases, which allow to persist data on disk.这些都是允许在磁盘上持久化数据的所有数据库。 There are also solutions to hold a large map in memory ( ehcache , hazelcast , ...).还有一些解决方案可以在 memory（ ehcache ， hazelcast ，...）中保存一个大的 map。

Use Berkeley DB .使用伯克利数据库。

Berkeley DB stores object graphs, objects in collections, or simple binary key/value data directly in an a btree on disk . Berkeley DB 将 object 图、collections 中的对象或简单的二进制键/值数据直接存储在磁盘上的 btree 中。 This simple, highly efficient approach removes all the unnecessary overhead in ORM solutions.这种简单、高效的方法消除了 ORM 解决方案中所有不必要的开销。 Using the Direct Persistence Layer (DPL) Java developers annotate classes with storage information, much like JPA.使用直接持久层 (DPL) Java 开发人员使用存储信息注释类，就像 JPA 一样。 This approach is familiar, efficient, and fast.这种方法熟悉、高效且快速。 The DPL reduces the complexity of data storage while not sacrificing speed. DPL 在不牺牲速度的情况下降低了数据存储的复杂性。

This should definitely give you huge gains in memory and speed, while not increasing the complexity of your application.这肯定会给您带来 memory 和速度方面的巨大收益，同时不会增加应用程序的复杂性。 Enjoy!享受！

http://www.mapdb.org/ is what you are looking for. http://www.mapdb.org/是您正在寻找的。 It's a rocking fast persistent implementation of java.util.Map.这是 java.util.Map 的快速持久实现。

Features特征

Concurrent同时

MapDB has record level locking and state-of-art concurrent engine. MapDB 具有记录级锁定和最先进的并发引擎。 Its performance scales nearly linearly with number of cores.它的性能几乎与内核数量呈线性关系。 Data can be written by multiple parallel threads.数据可以由多个并行线程写入。

Fast快速地

MapDB has outstanding performance rivaled only by native DBs. MapDB 具有出色的性能，只有原生 DB 才能与之匹敌。 It is result of more than a decade of optimizations and rewrites.这是十多年优化和重写的结果。

ACID酸

MapDB optionally supports ACID transactions with full MVCC isolation. MapDB 可选地支持具有完全 MVCC 隔离的 ACID 事务。 MapDB uses write-ahead-log or append-only store for great write durability. MapDB 使用 write-ahead-log 或 append-only 存储来实现出色的写入持久性。

Flexible灵活的

MapDB can be used everywhere from in-memory cache to multi-terabyte database. MapDB 可用于从内存缓存到多 TB 数据库的任何地方。 It also has number of options to trade durability for write performance.它还有许多选项可以用持久性换取写入性能。 This makes it very easy to configure MapDB to exactly fit your needs.这使得配置 MapDB 以完全满足您的需求变得非常容易。

Hackable可破解

MapDB is component based, most features (instance cache, async writes, compression) are just class wrappers. MapDB 是基于组件的，大多数功能（实例缓存、异步写入、压缩）只是 class 包装器。 It is very easy to introduce new functionality or component into MapDB.将新功能或组件引入 MapDB 非常容易。

SQL Like SQL 喜欢

MapDB was developed as faster alternative to SQL engine. MapDB 被开发为 SQL 引擎的更快替代品。 It has number of features which makes transition from relational database easier: secondary indexes/collections, autoincremental sequential ID, joins, triggers, composite keys…它具有许多特性，使从关系数据库的转换更容易：二级索引/集合、自动增量顺序 ID、连接、触发器、复合键……

Low disk-space usage磁盘空间使用率低

MapDB has number of features (serialization, delta key packing…) to minimize disk used by its store. MapDB 具有许多功能（序列化、增量键打包……）以最小化其存储使用的磁盘。 It also has very fast compression and custom serializers.它还具有非常快速的压缩和自定义序列化程序。 We take disk-usage seriously and do not waste single byte.我们认真对待磁盘使用，不浪费单个字节。

Consider Koloboke Collections , which is up to 2 times faster than Trove according to various tests:考虑Koloboke Collections ，根据各种测试，它比 Trove 快 2 倍：

if configured to consume the same memory as Trove.如果配置为使用与 Trove 相同的 memory。 Or alternatively, you can think it consumes considerably lesser memory if configured to be equally fast to Trove.或者，如果配置为与 Trove 一样快，您可以认为它消耗的 memory 会少得多。

If you want to persist the map between JVM runs with very quick bootstrap, you might also be interested in Chronicle-Map which stores String s in UTF-8 by default (so you shouldn't bother with conversions String <-> byte[] as with Koloboke/Trove).如果您想在 JVM 之间保持 map 以非常快速的引导程序运行，您可能还对Chronicle-Map感兴趣，它将String存储在String byte[]科洛博克/特罗夫）。 Chronicle-Map is ultra fast for persisted key-value store, but a bit slower that Koloboke and even Trove. Chronicle-Map 对于持久键值存储来说是超快的，但比 Koloboke 甚至 Trove 要慢一些。

Just wanted to reference some more open source options that became available over time since this question was first asked.只是想参考一些自首次提出此问题以来随着时间的推移而变得可用的更多开源选项。

Apache 2, BTree, Apache Directory Project JDBM replacement effort: Apache 2、BTree、Apache 目录项目 JDBM 替换工作：

http://directory.apache.org/mavibot/ http://directory.apache.org/mavibot/

MPL2/EPL1, RTree, MVStore, H2 Storage Engine: MPL2/EPL1、RTree、MVStore、H2 存储引擎：

http://www.h2database.com/html/mvstore.html http://www.h2database.com/html/mvstore.html

Apache 2, Xodus Environments, JetBrains YouTrack and Hub storage engine: Apache 2、Xodus 环境、JetBrains YouTrack 和 Hub 存储引擎：

https://github.com/JetBrains/xodus https://github.com/JetBrains/xodus

The map is Map, and so I'm considering using GNU Trove TIntObjectHashMap, and storing the map value as an ascii byte array rather than String. map 是 Map，所以我正在考虑使用 GNU Trove TIntObjectHashMap，并将 map 值存储为 ascii 字节数组而不是字符串。

This doesn't entirely make sense because a TIntObjectHashMap is not a Map .这并不完全有意义，因为TIntObjectHashMap不是Map 。 However, the approach is sound.但是，这种方法是合理的。

Do you know what kind of space savings I can expect over HashMap for Trove?你知道我可以为 Trove 节省多少空间？

The best answer is to try it out.最好的答案是尝试一下。

But here are some rough estimates (assuming a 32bit JVM):但这里有一些粗略的估计（假设是 32 位 JVM）：

HashMap keys would need to be Integer instances. HashMap 密钥需要是 Integer 实例。 They will occupy ~18bytes per instance + 4 bytes per reference.它们将占用每个实例约 18 个字节 + 每个引用 4 个字节。 Total 24 bytes.共 24 个字节。
Trove keys would be 4 byte int values. Trove 键将是 4 字节int值。
String values would be 20 bytes + 12 bytes + 2 * number of "characters".字符串值将是 20 字节 + 12 字节 + 2 * “字符”数。
Byte array values would be 12 bytes + 1 * number of "characters".字节数组值将是 12 字节 + 1 * “字符”数。
I haven't examined the details of the respective hash table internal data structures.我还没有检查各个 hash 表内部数据结构的详细信息。

That probably amounts to around 50% memory saving, though it depends critically on the average length of the value "strings".这可能相当于节省了大约 50% memory，尽管它主要取决于值“字符串”的平均长度。 (The longer they are, the more they will dominate the space usage.) （它们越长，它们将越多地支配空间使用。）

FWIW, Trove publish their own benchmarks here . FWIW，Trove在这里发布他们自己的基准。 They don't look very convincing, but you should be able to dig out their benchmark code and modify it to better match your use-case.它们看起来不是很有说服力，但是您应该能够挖掘出他们的基准代码并对其进行修改以更好地匹配您的用例。