简体   繁体   English

Java 内存高效键值存储

[英]Java On-Memory Efficient Key-Value Store

I have store 111 million key-value pairs (one key can have multiple values - maximum 2/3) whose key are 50 bit Integers and values are 32 bit (maximum) Integers.我存储了 1.11 亿个键值对(一个键可以有多个值 - 最大 2/3),其键是 50 位整数,值是 32 位(最大)整数。 Now, my requirements are:现在,我的要求是:

  1. Fast Insertion of (Key, Value) pair [allowing duplicates]快速插入(键,值)对[允许重复]
  2. Fast retrieving of value/values based on key.基于键快速检索值/值。

A nice solution of it is given here based on MultiMap. 这里基于MultiMap给出了一个很好的解决方案。 However, I want to store more key-values pairs in main memory with no/little bit performance penalty.但是,我想在 main memory 中存储更多的键值对,而没有/有一点性能损失。 I studied from web articles that B+ Tree, R+ Tree, B Tree, Compact Multimap etc. can be a nice solution for that.我从 web 篇文章中了解到 B+ Tree、R+ Tree、B Tree、Compact Multimap 等可以是一个很好的解决方案。 Can anybody help me:有谁能够帮助我:

Is there any Java library which satisfies my all those needs properly (above mentioned/other ds also acceptable. no issue with that)?是否有任何 Java 库可以正确满足我的所有这些需求(上述/其他 ds 也可以接受。没问题)? Actually, I want an efficient java library data structure to store/retrieve key-value/values pairs which takes less memory footprint and must be built in-memory.实际上,我想要一个高效的 java 库数据结构来存储/检索键值/值对,它需要更少的 memory 占用空间并且必须在内存中构建。

NB: I have tried with HashMultiMap (Guava with some modification with trove) as mentioned by Louis Wasserman, Kyoto/Tokyo Cab.net etc etc.My experience is not good with disk-baked solutions.注意:我曾尝试使用 Louis Wasserman、Kyoto/Tokyo Cab.net 等提到的 HashMultiMap(Guava 与 trove 进行了一些修改)。我对磁盘烘焙解决方案的体验并不好。 So please avoid that:).所以请避免这种情况:)。 Another point is that, for choosing library/ds one important point is: keys are 50 bit (so if we assign 64bit) 14 bit will lost and values are 32 bit Int (maximum)- mostly they are 10-12-14 bits.另一点是,对于选择 library/ds 来说,重要的一点是:键是 50 位(所以如果我们分配 64 位),14 位将丢失,值是 32 位 Int(最大)- 大多数是 10-12-14 位。 So, we can save space there also.所以,我们也可以在那里节省空间。

I don't think there's anything in the JDK which will do this.我认为 JDK 中没有任何东西可以做到这一点。

However, implementing such a thing is a simple matter of programming.然而,实现这样的事情是一个简单的编程问题。 Here is an open-addressed hashtable with linear probing, with keys and values stored in parallel arrays:这是一个带有线性探测的开放寻址哈希表,键和值并行存储 arrays:

public class LongIntParallelHashMultimap {

    private static final long NULL = 0L;

    private final long[] keys;
    private final int[] values;
    private int size;

    public LongIntParallelHashMultimap(int capacity) {
        keys = new long[capacity];
        values = new int[capacity];
    }

    public void put(long key, int value) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
        if (size == keys.length) throw new IllegalStateException("map is full");

        int index = indexFor(key);
        while (keys[index] != NULL) {
            index = successor(index);
        }
        keys[index] = key;
        values[index] = value;
        ++size;
    }

    public int[] get(long key) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);

        int index = indexFor(key);
        int count = countHits(key, index);

        int[] hits = new int[count];
        int hitIndex = 0;

        while (keys[index] != NULL) {
            if (keys[index] == key) {
                hits[hitIndex] = values[index];
                ++hitIndex;
            }
            index = successor(index);
        }

        return hits;
    }

    private int countHits(long key, int index) {
        int numHits = 0;
        while (keys[index] != NULL) {
            if (keys[index] == key) ++numHits;
            index = successor(index);
        }
        return numHits;
    }

    private int indexFor(long key) {
        // the hashing constant is (the golden ratio * Long.MAX_VALUE) + 1
        // see The Art of Computer Programming, section 6.4
        // the constant has two important properties:
        // (1) it is coprime with 2^64, so multiplication by it is a bijective function, and does not generate collisions in the hash
        // (2) it has a 1 in the bottom bit, so it does not add zeroes in the bottom bits of the hash, and does not generate (gratuitous) collisions in the index
        long hash = key * 5700357409661598721L;
        return Math.abs((int) (hash % keys.length));
    }

    private int successor(int index) {
        return (index + 1) % keys.length;
    }

    public int size() {
        return size;
    }

}

Note that this is a fixed-size structure.请注意,这是一个固定大小的结构。 You will need to create it big enough to hold all your data - 110 million entries for me takes up 1.32 GB.您需要创建足够大的空间来容纳您的所有数据——对我来说,1.1 亿个条目占用 1.32 GB。 The bigger you make it, in excess of what you need to store the data, the faster that insertions and lookups will be.你做的越大,超过你需要存储的数据,插入和查找的速度就会越快。 I found that for 110 million entries, with a load factor of 0.5 (2.64 GB, twice as much space as needed), it took on average 403 nanoseconds to look up a key, but with a load factor of 0.75 (1.76 GB, a third more space than is needed), it took 575 nanoseconds.我发现对于 1.1 亿个条目,负载系数为 0.5(2.64 GB,所需空间的两倍),查找密钥平均需要 403 纳秒,但负载系数为 0.75(1.76 GB,一个比所需空间多出三分之一),耗时 575 纳秒。 Decreasing the load factor below 0.5 usually doesn't make much difference, and indeed, with a load factor of 0.33 (4.00 GB, three times more space than needed), i get an average time of 394 nanoseconds.将负载系数降低到 0.5 以下通常不会产生太大影响,事实上,负载系数为 0.33(4.00 GB,空间是所需空间的三倍),我得到的平均时间为 394 纳秒。 So, even though you have 5 GB available, don't use it all.因此,即使您有 5 GB 可用空间,也不要全部使用。

Note also that zero is not allowed as a key.另请注意,零不允许作为键。 If this is a problem, change the null value to be something else, and pre-fill the keys array with that on creation.如果这是一个问题,请将 null 值更改为其他值,并在创建时用该值预填充键数组。

Is there any Java library which satisfies my all those needs properly.是否有任何 Java 库可以适当地满足我的所有这些需求。

AFAIK no.据我所知,没有。 Or at least, not one that minimizes the memory footprint.或者至少,不是最小化 memory 足迹的一个。

However, it should be easy write a custom map class that is specialized to these requirements.但是,编写专门针对这些要求的自定义 map class 应该很容易。

It's a good idea to look for databases, because problems like these are what they are designed for.寻找数据库是个好主意,因为像这样的问题正是它们的设计目标。 In recent years Key-Value databases became very popular, eg for web services (keyword "NoSQL"), so you should find something.近年来键值数据库变得非常流行,例如 web 服务(关键字“NoSQL”),所以你应该找到一些东西。

The choice for a custom data structure also depends if you want to use a hard drive to store your data (and how safe that has to be) or if it completely lost on program exit.自定义数据结构的选择还取决于您是否要使用硬盘驱动器来存储数据(以及必须保证的安全性),或者它是否会在程序退出时完全丢失。

If implementing manually and the whole db fits into memory somewhat easily, I'd just implement a hashmap in C. Create a hash function that gives a (well-spread) memory address from a value.如果手动实施并且整个数据库很容易放入 memory,我只需在 C 中实施 hashmap。创建一个 hash function,它给出了一个(广泛传播的)88393302533.88 地址Insert there or next to it if already assigned.如果已经分配,则插入那里或旁边。 Assigning and retrieval is then O(1).然后分配和检索是 O(1)。 If you implement it in Java, you'll have the 4 byte overhead for each (primitive) object.如果你在 Java 中实现它,你将有每个(原始)object 的 4 字节开销。

Based on @Tom Andersons solution I removed the need to allocate objects, and added a performance test.基于@Tom Andersons 的解决方案,我消除了分配对象的需要,并添加了性能测试。

import java.util.Arrays;
import java.util.Random;

public class LongIntParallelHashMultimap {
    private static final long NULL = Long.MIN_VALUE;

    private final long[] keys;
    private final int[] values;
    private int size;

    public LongIntParallelHashMultimap(int capacity) {
        keys = new long[capacity];
        values = new int[capacity];
        Arrays.fill(keys, NULL);
    }

    public void put(long key, int value) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);
        if (size == keys.length) throw new IllegalStateException("map is full");

        int index = indexFor(key);
        while (keys[index] != NULL) {
            index = successor(index);
        }
        keys[index] = key;
        values[index] = value;
        ++size;
    }

    public int get(long key, int[] hits) {
        if (key == NULL) throw new IllegalArgumentException("key cannot be " + NULL);

        int index = indexFor(key);

        int hitIndex = 0;

        while (keys[index] != NULL) {
            if (keys[index] == key) {
                hits[hitIndex] = values[index];
                ++hitIndex;
                if (hitIndex == hits.length)
                    break;
            }
            index = successor(index);
        }

        return hitIndex;
    }

    private int indexFor(long key) {
        return Math.abs((int) (key % keys.length));
    }

    private int successor(int index) {
        index++;
        return index >= keys.length ? index - keys.length : index;
    }

    public int size() {
        return size;
    }

    public static class PerfTest {
        public static void main(String... args) {
            int values = 110* 1000 * 1000;
            long start0 = System.nanoTime();
            long[] keysValues = generateKeys(values);

            LongIntParallelHashMultimap map = new LongIntParallelHashMultimap(222222227);
            long start = System.nanoTime();
            addKeyValues(values, keysValues, map);
            long mid = System.nanoTime();
            int sum = lookUpKeyValues(values, keysValues, map);
            long time = System.nanoTime();
            System.out.printf("Generated %.1f M keys/s, Added %.1f M/s and looked up %.1f M/s%n",
                    values * 1e3 / (start - start0), values * 1e3 / (mid - start), values * 1e3 / (time - mid));
            System.out.println("Expected " + values + " got " + sum);
        }

        private static long[] generateKeys(int values) {
            Random rand = new Random();
            long[] keysValues = new long[values];
            for (int i = 0; i < values; i++)
                keysValues[i] = rand.nextLong();
            return keysValues;
        }

        private static void addKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
            for (int i = 0; i < values; i++) {
                map.put(keysValues[i], i);
            }
            assert map.size() == values;
        }

        private static int lookUpKeyValues(int values, long[] keysValues, LongIntParallelHashMultimap map) {
            int[] found = new int[8];
            int sum = 0;
            for (int i = 0; i < values; i++) {
                sum += map.get(keysValues[i], found);
            }
            return sum;
        }
    }
}

prints印刷

Generated 34.8 M keys/s, Added 11.1 M/s and looked up 7.6 M/s

Run on an 3.8 GHz i7 with Java 7 update 3.在带有 Java 7 update 3 的 3.8 GHz i7 上运行。

This is much slower than the previous test because you are accessing main memory, rather than the cache at random.这比之前的测试慢得多,因为您正在访问主 memory,而不是随机访问缓存。 This is really a test of the speed of your memory. The writes are faster because they can be performed asynchronously to main memory.这个真的是考验你的memory的速度,写入速度更快,因为可以异步到main memory。


Using this collection使用这个集合

final SetMultimap<Long, Integer> map = Multimaps.newSetMultimap(
        TDecorators.wrap(new TLongObjectHashMap<Collection<Integer>>()),
        new Supplier<Set<Integer>>() {
            public Set<Integer> get() {
                return TDecorators.wrap(new TIntHashSet());
            }
        });

Running the same test with 50 million entries (which used about 16 GB) and -mx20g I go the following result.使用 5000 万个条目(使用了大约 16 GB)和-mx20g I go 运行相同的测试,结果如下。

 Generated 47.2 M keys/s, Added 0.5 M/s and looked up 0.7 M/s

For 110 M entries you will need about 35 GB of memory and a machine 10 x faster than mine (3.8 GHz) to perform 5 million adds per second.对于 110 M 条目,您将需要大约 35 GB 的 memory 和一台比我的速度快 10 倍的机器 (3.8 GHz) 来每秒执行 500 万次添加。

If you must use Java, then implement your own hashtable/hashmap.如果你必须使用 Java,那么实现你自己的 hashtable/hashmap。 An important property of your table is to use a linkedlist to handle collisions.表的一个重要属性是使用链表来处理冲突。 Hence when you do a lookup, you may return all the elements on the list.因此,当您进行查找时,您可能会返回列表中的所有元素。

Might be I am late in answering this question but elastic search will solve your problem.可能是我回答这个问题晚了,但弹性搜索会解决你的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM