Java 内存高效键值存储

Question

我存储了 1.11 亿个键值对（一个键可以有多个值 - 最大 2/3），其键是 50 位整数，值是 32 位（最大）整数。 现在，我的要求是：

快速插入（键，值）对[允许重复]

基于键快速检索值/值。

这里基于MultiMap给出了一个很好的解决方案。 但是，我想在 main memory 中存储更多的键值对，而没有/有一点性能损失。 我从 web 篇文章中了解到 B+ Tree、R+ Tree、B Tree、Compact Multimap 等可以是一个很好的解决方案。 有谁能够帮助我：

是否有任何 Java 库可以正确满足我的所有这些需求（上述/其他 ds 也可以接受。没问题）？ 实际上，我想要一个高效的 java 库数据结构来存储/检索键值/值对，它需要更少的 memory 占用空间并且必须在内存中构建。

注意：我曾尝试使用 Louis Wasserman、Kyoto/Tokyo Cab.net 等提到的 HashMultiMap（Guava 与 trove 进行了一些修改）。我对磁盘烘焙解决方案的体验并不好。 所以请避免这种情况:)。 另一点是，对于选择 library/ds 来说，重要的一点是：键是 50 位（所以如果我们分配 64 位），14 位将丢失，值是 32 位 Int（最大）- 大多数是 10-12-14 位。 所以，我们也可以在那里节省空间。

Answer 1

我认为 JDK 中没有任何东西可以做到这一点。

然而，实现这样的事情是一个简单的编程问题。 这是一个带有线性探测的开放寻址哈希表，键和值并行存储 arrays：

public class LongIntParallelHashMultimap {

    private static final long NULL = 0L;

    private final long[] keys;
    private final int[] values;
    private int size;

    public LongIntParallelHashMultimap(int capacity) {
        keys = new long[capacity];
        values = new int[capacity];
    }

    public void put(long key, int value) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
        if (size == keys.length) throw new IllegalStateException("map is full");

        int index = indexFor(key);
        while (keys[index] != NULL) {
            index = successor(index);
        }
        keys[index] = key;
        values[index] = value;
        ++size;
    }

    public int[] get(long key) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);

        int index = indexFor(key);
        int count = countHits(key, index);

        int[] hits = new int[count];
        int hitIndex = 0;

        while (keys[index] != NULL) {
            if (keys[index] == key) {
                hits[hitIndex] = values[index];
                ++hitIndex;
            }
            index = successor(index);
        }

        return hits;
    }

    private int countHits(long key, int index) {
        int numHits = 0;
        while (keys[index] != NULL) {
            if (keys[index] == key) ++numHits;
            index = successor(index);
        }
        return numHits;
    }

    private int indexFor(long key) {
        // the hashing constant is (the golden ratio * Long.MAX_VALUE) + 1
        // see The Art of Computer Programming, section 6.4
        // the constant has two important properties:
        // (1) it is coprime with 2^64, so multiplication by it is a bijective function, and does not generate collisions in the hash
        // (2) it has a 1 in the bottom bit, so it does not add zeroes in the bottom bits of the hash, and does not generate (gratuitous) collisions in the index
        long hash = key * 5700357409661598721L;
        return Math.abs((int) (hash % keys.length));
    }

    private int successor(int index) {
        return (index + 1) % keys.length;
    }

    public int size() {
        return size;
    }

}

请注意，这是一个固定大小的结构。 您需要创建足够大的空间来容纳您的所有数据——对我来说，1.1 亿个条目占用 1.32 GB。 你做的越大，超过你需要存储的数据，插入和查找的速度就会越快。 我发现对于 1.1 亿个条目，负载系数为 0.5（2.64 GB，所需空间的两倍），查找密钥平均需要 403 纳秒，但负载系数为 0.75（1.76 GB，一个比所需空间多出三分之一），耗时 575 纳秒。 将负载系数降低到 0.5 以下通常不会产生太大影响，事实上，负载系数为 0.33（4.00 GB，空间是所需空间的三倍），我得到的平均时间为 394 纳秒。 因此，即使您有 5 GB 可用空间，也不要全部使用。

另请注意，零不允许作为键。 如果这是一个问题，请将 null 值更改为其他值，并在创建时用该值预填充键数组。

Answer 2

是否有任何 Java 库可以适当地满足我的所有这些需求。

据我所知，没有。 或者至少，不是最小化 memory 足迹的一个。

但是，编写专门针对这些要求的自定义 map class 应该很容易。

Answer 3

寻找数据库是个好主意，因为像这样的问题正是它们的设计目标。 近年来键值数据库变得非常流行，例如 web 服务（关键字“NoSQL”），所以你应该找到一些东西。

自定义数据结构的选择还取决于您是否要使用硬盘驱动器来存储数据（以及必须保证的安全性），或者它是否会在程序退出时完全丢失。

如果手动实施并且整个数据库很容易放入 memory，我只需在 C 中实施 hashmap。创建一个 hash function，它给出了一个（广泛传播的）88393302533.88 地址如果已经分配，则插入那里或旁边。 然后分配和检索是 O(1)。 如果你在 Java 中实现它，你将有每个（原始）object 的 4 字节开销。

Answer 4

基于@Tom Andersons 的解决方案，我消除了分配对象的需要，并添加了性能测试。

import java.util.Arrays;
import java.util.Random;

public class LongIntParallelHashMultimap {
    private static final long NULL = Long.MIN_VALUE;

    private final long[] keys;
    private final int[] values;
    private int size;

    public LongIntParallelHashMultimap(int capacity) {
        keys = new long[capacity];
        values = new int[capacity];
        Arrays.fill(keys, NULL);
    }

    public void put(long key, int value) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
        if (size == keys.length) throw new IllegalStateException("map is full");

        int index = indexFor(key);
        while (keys[index] != NULL) {
            index = successor(index);
        }
        keys[index] = key;
        values[index] = value;
        ++size;
    }

    public int get(long key, int[] hits) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);

        int index = indexFor(key);

        int hitIndex = 0;

        while (keys[index] != NULL) {
            if (keys[index] == key) {
                hits[hitIndex] = values[index];
                ++hitIndex;
                if (hitIndex == hits.length)
                    break;
            }
            index = successor(index);
        }

        return hitIndex;
    }

    private int indexFor(long key) {
        return Math.abs((int) (key % keys.length));
    }

    private int successor(int index) {
        index++;
        return index >= keys.length ? index - keys.length : index;
    }

    public int size() {
        return size;
    }

    public static class PerfTest {
        public static void main(String... args) {
            int values = 110* 1000 * 1000;
            long start0 = System.nanoTime();
            long[] keysValues = generateKeys(values);

            LongIntParallelHashMultimap map = new LongIntParallelHashMultimap(222222227);
            long start = System.nanoTime();
            addKeyValues(values, keysValues, map);
            long mid = System.nanoTime();
            int sum = lookUpKeyValues(values, keysValues, map);
            long time = System.nanoTime();
            System.out.printf("Generated %.1f M keys/s, Added %.1f M/s and looked up %.1f M/s%n",
                    values * 1e3 / (start - start0), values * 1e3 / (mid - start), values * 1e3 / (time - mid));
            System.out.println("Expected " + values + " got " + sum);
        }

        private static long[] generateKeys(int values) {
            Random rand = new Random();
            long[] keysValues = new long[values];
            for (int i = 0; i < values; i++)
                keysValues[i] = rand.nextLong();
            return keysValues;
        }

        private static void addKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
            for (int i = 0; i < values; i++) {
                map.put(keysValues[i], i);
            }
            assert map.size() == values;
        }

        private static int lookUpKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
            int[] found = new int[8];
            int sum = 0;
            for (int i = 0; i < values; i++) {
                sum += map.get(keysValues[i], found);
            }
            return sum;
        }
    }
}

印刷

Generated 34.8 M keys/s, Added 11.1 M/s and looked up 7.6 M/s

在带有 Java 7 update 3 的 3.8 GHz i7 上运行。

这比之前的测试慢得多，因为您正在访问主 memory，而不是随机访问缓存。 这个真的是考验你的memory的速度，写入速度更快，因为可以异步到main memory。

使用这个集合

final SetMultimap<Long, Integer> map = Multimaps.newSetMultimap(
        TDecorators.wrap(new TLongObjectHashMap<Collection<Integer>>()),
        new Supplier<Set<Integer>>() {
            public Set<Integer> get() {
                return TDecorators.wrap(new TIntHashSet());
            }
        });

使用 5000 万个条目（使用了大约 16 GB）和-mx20g I go 运行相同的测试，结果如下。

 Generated 47.2 M keys/s, Added 0.5 M/s and looked up 0.7 M/s

对于 110 M 条目，您将需要大约 35 GB 的 memory 和一台比我的速度快 10 倍的机器 (3.8 GHz) 来每秒执行 500 万次添加。

Answer 5

如果你必须使用 Java，那么实现你自己的 hashtable/hashmap。 表的一个重要属性是使用链表来处理冲突。 因此，当您进行查找时，您可能会返回列表中的所有元素。

Answer 6

可能是我回答这个问题晚了，但弹性搜索会解决你的问题。

Java 内存高效键值存储

问题描述

6 个解决方案

解决方案1
7 已采纳 2012-04-08 23:00:41

解决方案2
2 2012-04-08 16:47:33

解决方案3
2 2012-04-08 17:01:34

解决方案4
2 2012-04-09 07:34:05

解决方案5
0 2012-04-08 17:11:08

解决方案6
0 2015-10-07 11:22:15

Java 内存高效键值存储

问题描述

6 个解决方案

解决方案1 7 已采纳 2012-04-08 23:00:41

解决方案2 2 2012-04-08 16:47:33

解决方案3 2 2012-04-08 17:01:34

解决方案4 2 2012-04-09 07:34:05

解决方案5 0 2012-04-08 17:11:08

解决方案6 0 2015-10-07 11:22:15

解决方案1
7 已采纳 2012-04-08 23:00:41

解决方案2
2 2012-04-08 16:47:33

解决方案3
2 2012-04-08 17:01:34

解决方案4
2 2012-04-09 07:34:05

解决方案5
0 2012-04-08 17:11:08

解决方案6
0 2015-10-07 11:22:15