简体   繁体   English

我应该如何以内存效率的方式将字符串键映射到Java中的值?

[英]How should I map string keys to values in Java in a memory-efficient way?

I'm looking for a way to store a string->int mapping. 我正在寻找一种存储字符串 - > int映射的方法。 A HashMap is, of course, a most obvious solution, but as I'm memory constrained and need to store 2 million pairs, 7 characters long keys, I need something that's memory efficient, the retrieval speed is a secondary parameter. 当然,HashMap是一个最明显的解决方案,但由于我受内存限制,需要存储200万对,7个字符长的密钥,我需要一些内存有效的东西,检索速度是次要参数。

Currently I'm going along the line of: 目前我正沿着以下方向前进:

List<Tuple<String, int>> list = new ArrayList<Tuple<String, int>>();
list.add(...); // load from file
Collections.sort(list);

and then for retrieval: 然后进行检索:

Collections.binarySearch(list, key); // log(n), acceptable

Should I perhaps go for a custom tree (each node a single character, each leaf with result), or is there an existing collection that fits this nicely? 我是否可以选择自定义树(每个节点都是一个字符,每个叶子都有结果),或者是否有适合这种情况的现有集合? The strings are practically sequential (UK postcodes, they don't differ much), so I'm expecting nice memory savings here. 这些字符串实际上是顺序的(英国邮政编码,它们没有多大区别),所以我期待在这里节省大量内存。

Edit : I just saw you mentioned the String were UK postcodes so I'm fairly confident you couldn't get very wrong by using a Trove TLongIntHashMap (btw Trove is a small library and it's very easy to use). 编辑 :我刚刚看到你提到字符串是英国邮政编码,所以我相当自信你使用Trove TLongIntHashMap不会出错:顺便说一下, Trove是一个小型库,它非常容易使用。

Edit 2 : Lots of people seem to find this answer interesting so I'm adding some information to it. 编辑2 :很多人似乎觉得这个答案很有趣,所以我正在添加一些信息。

The goal here is to use a map containing keys/values in a memory-efficient way so we'll start by looking for memory-efficient collections. 这里的目标是以一种以内存效率的方式使用包含键/值的映射,因此我们将首先查找内存有效的集合。

The following SO question is related (but far from identical to this one). 以下SO问题是相关的(但与此相同)。

What is the most efficient Java Collections library? 什么是最有效的Java Collections库?

Jon Skeet mentions that Trove is "just a library of collections from primitive types" [sic] and, that, indeed, it doesn't add much functionality. Jon Skeet提到Trove “只是一个来自原始类型的集合库” [原文如此],实际上,它并没有增加太多功能。 We can also see a few benchmarks (by the.duckman ) about memory and speed of Trove compared to the default Collections. 我们还可以看到一些关于Trove的内存和速度与默认集合相比的基准(由.duckman提供 )。 Here's a snippet: 这是一个片段:

                      100000 put operations      100000 contains operations 
java collections             1938 ms                        203 ms
trove                         234 ms                        125 ms
pcj                           516 ms                         94 ms

And there's also an example showing how much memory can be saved by using Trove instead of a regular Java HashMap : 还有一个示例显示使用Trove而不是常规Java HashMap可以节省多少内存:

java collections        oscillates between 6644536 and 7168840 bytes
trove                                      1853296 bytes
pcj                                        1866112 bytes

So even though benchmarks always need to be taken with a grain of salt, it's pretty obvious that Trove will save not only memory but will always be much faster. 因此,尽管基准测试总是需要花费一些时间,但很明显, Trove不仅会节省内存,而且会更快。

So our goal now becomes to use Trove (seen that by putting millions and millions of entries in a regular HashMap , your app begins to feel unresponsive). 因此,我们现在的目标是使用Trove(通过在常规HashMap中投入数百万条条目,您的应用开始感到反应迟钝)。

You mentioned 2 million pairs, 7 characters long keys and a String/int mapping. 你提到了200万对,7个字符的长键和一个String / int映射。

2 million is really not that much but you'll still feel the "Object" overhead and the constant (un)boxing of primitives to Integer in a regular HashMap{String,Integer} which is why Trove makes a lot of sense here. 2000000是真的不那么多,但你还是会觉得“对象”开销和原语的常数(UN)拳击整数在一个普通的HashMap {字符串,整数}这就是为什么特罗韦使得有很大的意义在这里。

However, I'd point out that if you have control over the "7 characters", you could go even further: if you're using say only ASCII or ISO-8859-1 characters, your 7 characters would fit in a long (*). 但是,我要指出,如果你可以控制“7个字符”,你可以更进一步:如果你只使用ASCII或ISO-8859-1字符,那么你的7个字符就会很长( *)。 In that case you can dodge altogether objects creation and represent your 7 characters on a long. 在这种情况下,您可以完全躲避对象创建,并在很长时间内代表您的7个角色。 You'd then use a Trove TLongIntHashMap and bypass the "Java Object" overhead altogether. 然后,您将使用Trove TLongIntHashMap并完全绕过“Java对象”开销。

You stated specifically that your keys were 7 characters long and then commented they were UK postcodes: I'd map each postcode to a long and save a tremendous amount of memory by fitting millions of keys/values pair into memory using Trove. 你明确指出你的密​​钥是7个字符长然后评论他们是英国邮政编码:我将每个邮政编码映射到一个长的,并通过使用Trove将数百万个键/值对装入内存来节省大量内存。

The advantage of Trove is basically that it is not doing constant boxing/unboxing of Objects/primitives: Trove works, in many cases, directly with primitives and primitives only. Trove的优势基本上在于它不会对对象/原语进行持续的装箱/拆箱:在很多情况下,Trove只能直接使用基元和基元。

(*) say you only have at most 256 codepoints/characters used, then it fits on 7*8 == 56 bits, which is small enough to fit in a long. (*)假设您最多只使用256个码点/字符,那么它适合7 * 8 == 56位,这个小到足以适合长整数。

Sample method for encoding the String keys into long 's (assuming ASCII characters, one byte per character for simplification - 7 bits would be enough): String键编码为long的示例方法(假设ASCII字符,每个字符一个字节用于简化 - 7位就足够了):

long encode(final String key) {
    final int length = key.length();
    if (length > 8) {
        throw new IndexOutOfBoundsException(
                "key is longer than 8 characters");
    }
    long result = 0;
    for (int i = 0; i < length; i++) {
        result += ((long) ((byte) key.charAt(i))) << i * 8;
    }
    return result;
}

Use the Trove library. 使用Trove库。

The Trove library has optimized HashMap and HashSet classes for primitives. Trove库已经为基元优化了HashMapHashSet类。 In this case, TObjectIntHashMap<String> will map the parameterized object ( String ) to a primitive int . 在这种情况下, TObjectIntHashMap<String>会将参数化对象( String )映射到基本int

First of, did you measure that LinkedList is indeed more memory efficient than a HashMap , or how did you come to that conclusion? 首先,您是否测量到LinkedList确实比HashMap更具内存效率,或者您是如何得出这个结论的? Secondly, a LinkedList 's access time of an element is O(n) , so you cannot do efficient binary search on it. 其次, LinkedList的元素访问时间为O(n) ,因此您无法对其进行有效的二进制搜索。 If you want to do such approach, you should use an ArrayList , which should give you the beast compromise between performance and space. 如果你想做这样的方法,你应该使用一个ArrayList ,它可以让你在性能和空间之间做出妥协。 However, again, I doubt that a HashMap , HashTable or - in particular - a TreeMap would consume that much more memory, but the first two would provide constant access and the tree map logarithmic and provide a nicer interface that a normal list. 然而,我再次怀疑HashMapHashTable或者 - 特别是 - TreeMap将消耗更多的内存,但前两个将提供常量访问和树映射对数,并提供一个比普通列表更好的接口。 I would try to do some measurements, how much the difference in memory consumption really is. 我会尝试做一些测量,内存消耗的差异究竟是多少。

UPDATE : Given, as Adamski pointed out, that the String s themselves, not the data structure they are stored in, will consume the most memory, it might be a good idea to look into data structures that are specific for strings, such as tries (especially patricia tries ), which might reduce the storage space needed for the strings. 更新 :正如Adamski指出的那样, String本身,而不是它们存储的数据结构,将消耗最多的内存,查看特定于字符串的数据结构可能是个好主意,例如尝试 (特别是patricia尝试 ),这可能会减少字符串所需的存储空间。

What you are looking for is a succinct-trie - a trie which stores its data in nearly the least amount of space theoretically possible. 你正在寻找的是一个简洁的特里 - 一个trie ,它在理论上可以将其数据存储在几乎最小的空间内。

Unfortunately, there are no succinct-trie classes libraries currently available for Java. 不幸的是,目前没有适用于Java的简洁类库。 One of my next projects (in a few weeks) is to write one for Java (and other languages) . 我的下一个项目之一(在几周内)就是为Java (和其他语言)编写一个。

In the meanwhile, if you don't mind JNI , there are several good native succinct-trie libraries you could reference. 同时,如果你不介意JNI ,你可以参考几个 很好的本地简洁图书馆。

Have you looked at tries . 你看过尝试了吗? I've not used them but they may fit with what you're doing. 我没有使用它们,但它们可能适合你正在做的事情。

A custom tree would have the same complexity of O(log n) , don't bother. 自定义树将具有与O(log n)相同的复杂性,请勿打扰。 Your solution is sound, but I would go with an ArrayList instead of the LinkedList because the linked list allocates one extra object per stored value, which will amount to a lot of objects in your case. 你的解决方案是合理的,但我会使用ArrayList而不是LinkedList因为链表每个存储值分配一个额外的对象,这相当于你的案例中的很多对象。

As Erick writes using the Trove library is a good place to start as you save space in storing int primitives rather than Integer s. 正如Erick所写,使用Trove库是一个很好的起点,因为你在存储int原语而不是Integer s中节省了空间。

However, you are still faced with storing 2 million String instances. 但是,您仍然面临存储200万个String实例的问题。 Given that these are keys in the map, interning them won't offer any benefit so the next thing I'd consider is whether there's some characteristic of the Strings that can be exploited. 鉴于这些是地图中的关键,实习他们不会提供任何好处,所以接下来我要考虑的是是否有一些可以被利用的字符串的特征。 For example: 例如:

  • If the String s represent sentences of common words then you could transform the String into a Sentence class, and intern the individual words. 如果String表示常用单词的句子,那么您可以将String转换为Sentence类,并实习单个单词。
  • If the Strings only contain a subset of Unicode characters (eg only letters AZ, or letters + digits) you could use a more compact encoding scheme than Java's Unicode. 如果字符串仅包含Unicode字符的子集(例如,仅字母AZ或字母+数字),则可以使用比Java的Unicode更紧凑的编码方案。
  • You could consider transforming each String into a UTF-8 encoded byte array and wrapping this in class: MyString . 您可以考虑将每个String转换为UTF-8编码的字节数组,并将其包装在类: MyString Obviously the trade-off here is the additional time spent performing look-ups. 显然,这里的权衡是执行查找所花费的额外时间。
  • You could write the map to a file and then memory map a portion or all of the file. 您可以将地图写入文件,然后将内存映射到文件的一部分或全部。
  • You could consider libraries such as Berkeley DB that allow you to define persistent maps and cache a portion of the map in memory. 您可以考虑使用诸如Berkeley DB之类的库来定义持久映射并在内存中缓存一部分映射。 This offers a scalable approach. 这提供了可扩展的方法。

也许你可以使用RadixTree

Use java.util.TreeMap instead of java.util.HashMap . 使用java.util.TreeMap而不是java.util.HashMap It makes use of a red black binary search tree and doesn't use more memory than what is required for holding notes containing the elements in the map. 它使用红黑二进制搜索树,并且不使用比保存包含地图中元素的注释所需的更多内存。 No extra buckets, unlike HashMap or Hashtable. 没有额外的桶,不像HashMap或Hashtable。

I think the solution is to step a little outside of Java. 我认为解决方案是在Java之外做一点。 If you have that many values, you should use a database. 如果您有这么多值,则应使用数据库。 If you don't feel like installing Oracle, SQLite is quick and easy. 如果您不想安装Oracle,SQLite快速而简单。 That way the data you don't immediately need is stored on the disk, and all of the caching/storage is done for you. 这样,您不需要的数据就会存储在磁盘上,所有的缓存/存储都会为您完成。 Setting up a DB with one table and two columns won't take much time at all. 设置具有一个表和两列的DB不会花费太多时间。

我考虑使用一些缓存,因为它们通常具有溢出到磁盘的能力。

The problem is objects' memory overhead, but using some tricks you can try to implement your own hashset. 问题是对象的内存开销,但使用一些技巧可以尝试实现自己的hashset。 Something like this . 这样的东西。 Like others said strings have quite large overhead so you need to "compress" it somehow. 像其他人一样,字符串的开销很大,所以你需要以某种方式“压缩”它。 Also try not to use too many arrays(lists) in hashtable (if you do chaining type hashtable) as they are also objects and also have overhead. 另外,尽量不要在哈希表中使用太多的数组(列表)(如果你做链接类型哈希表),因为它们也是对象,也有开销。 Better yet do open addressing hashtable. 更好的是开放寻址哈希表。

You might create a key class that matches your needs. 您可以创建符合您需求的密钥类。 Perhaps like this: 也许是这样的:

public class MyKey implements Comparable<MyKey>
{
    char[7] keyValue;

    public MyKey(String keyValue)
    {
        ... load this.keyValue from the String keyValue.
    }

    public int compareTo(MyKey rhs)
    {
        ... blah
    }

    public boolean equals(Object rhs)
    {
        ... blah
    }

    public int hashCode()
    {
        ... blah
    }
}

try this one 试试这个

OptimizedHashMap<String, int[]> myMap = new OptimizedHashMap<String, int[]>();
for(int i = 0; i < 2000000; i++)
{
  myMap.put("iiiiii" + i, new int[]{i});
}
System.out.println(myMap.containsValue(new int[]{3}));
System.out.println(myMap.get("iiiiii" + 1));

public class OptimizedHashMap<K,V> extends HashMap<K,V>
{
    public boolean containsValue(Object value) {
    if(value != null)
    {
        Class<? extends Object> aClass = value.getClass();
        if(aClass.isArray())
        {
            Collection values = this.values();
            for(Object val : values)
            {
                int[] newval = (int[]) val;
                int[] newvalue = (int[]) value;
                if(newval[0] == newvalue[0])
                {
                    return true;
                }
            }
        }
    }
    return false;
}

Actually HashMap and List are too general for such specific task as a lookup of int by zipcode. 实际上,HashMap和List对于通过zipcode查找int这样的特定任务来说太笼统了。 You should use advantage of knowledge which data is used. 您应该利用使用数据的知识。 One of the options is to use a prefix tree with leaves that stores the int value. 其中一个选项是使用带有存储int值的叶子的前缀树。 Also, it could be pruned if (my guess) a lot of codes with same prefixes map to the same integer. 此外,如果(我的猜测)很多具有相同前缀的代码映射到相同的整数,它可以被修剪。

Lookup of the int by zipcode will be linear in such tree and will not grow if number of codes is increased, compare to O(log(N)) in case of binary search. 通过zipcode查找int将在这种树中是线性的,并且如果代码数量增加则不会增长,在二进制搜索的情况下与O(log(N))相比。

Since you are intending to use hashing, you can try numerical conversions of the strings based on ASCII values. 由于您打算使用散列,因此可以尝试基于ASCII值对字符串进行数值转换。 the simplest idea will be 最简单的想法是

    int sum=0;
    for(int i=0;i<arr.length;i++){
        sum+=(int)arr[i];

    }

hash "sum" using a well defined hash functions. 使用定义良好的散列函数散列“sum”。 You would use a hash function based on the expected input patterns. 您将使用基于预期输入模式的哈希函数。 eg if you use division method 例如,如果你使用除法

    public int hasher(int sum){
       return sum%(a prime number);
    }

selecting a prime number which is not close to an exact power of two improves performances and gives better uniformly hashed distribution of keys. 选择一个不接近精确2次幂的素数可以改善性能并提供更好的均匀散列键分配。

another method is to weigh the characters based on their respective position. 另一种方法是根据各自的位置权衡角色。

eg: if you use the above method, both "abc" and "cab" will be hashed into a same location. 例如:如果使用上述方法,“abc”和“cab”都将被散列到同一位置。 but if you need them to be stored in two distinct location give weights for locations like we use the number systems. 但如果您需要将它们存储在两个不同的位置,请为我们使用数字系统的位置提供权重。

     int sum=0;
     int weight=1;
     for(int i=0;i<arr.length;i++){
         sum+= (int)arr[i]*weight;
         weight=weight*2; // using powers of 2 gives better results. (you know why :))
     }  

As your sample is quite large, you'd avoid collisions by a chaining mechanism rather than using a probe sequence. 由于您的样本非常大,因此您可以通过链接机制避免冲突,而不是使用探测序列。 After all,What method you would choose totally depends on the nature of your application. 毕竟,您选择的方法完全取决于您的应用程序的性质。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM