[英]How should I map string keys to values in Java in a memory-efficient way?
I'm looking for a way to store a string->int mapping. 我正在寻找一种存储字符串 - > int映射的方法。 A HashMap is, of course, a most obvious solution, but as I'm memory constrained and need to store 2 million pairs, 7 characters long keys, I need something that's memory efficient, the retrieval speed is a secondary parameter. 当然,HashMap是一个最明显的解决方案,但由于我受内存限制,需要存储200万对,7个字符长的密钥,我需要一些内存有效的东西,检索速度是次要参数。
Currently I'm going along the line of: 目前我正沿着以下方向前进:
List<Tuple<String, int>> list = new ArrayList<Tuple<String, int>>();
list.add(...); // load from file
Collections.sort(list);
and then for retrieval: 然后进行检索:
Collections.binarySearch(list, key); // log(n), acceptable
Should I perhaps go for a custom tree (each node a single character, each leaf with result), or is there an existing collection that fits this nicely? 我是否可以选择自定义树(每个节点都是一个字符,每个叶子都有结果),或者是否有适合这种情况的现有集合? The strings are practically sequential (UK postcodes, they don't differ much), so I'm expecting nice memory savings here. 这些字符串实际上是顺序的(英国邮政编码,它们没有多大区别),所以我期待在这里节省大量内存。
Edit : I just saw you mentioned the String were UK postcodes so I'm fairly confident you couldn't get very wrong by using a Trove TLongIntHashMap (btw Trove is a small library and it's very easy to use). 编辑 :我刚刚看到你提到字符串是英国邮政编码,所以我相当自信你使用Trove TLongIntHashMap不会出错:顺便说一下, Trove是一个小型库,它非常容易使用。
Edit 2 : Lots of people seem to find this answer interesting so I'm adding some information to it. 编辑2 :很多人似乎觉得这个答案很有趣,所以我正在添加一些信息。
The goal here is to use a map containing keys/values in a memory-efficient way so we'll start by looking for memory-efficient collections. 这里的目标是以一种以内存效率的方式使用包含键/值的映射,因此我们将首先查找内存有效的集合。
The following SO question is related (but far from identical to this one). 以下SO问题是相关的(但与此相同)。
What is the most efficient Java Collections library? 什么是最有效的Java Collections库?
Jon Skeet mentions that Trove is "just a library of collections from primitive types" [sic] and, that, indeed, it doesn't add much functionality. Jon Skeet提到Trove “只是一个来自原始类型的集合库” [原文如此],实际上,它并没有增加太多功能。 We can also see a few benchmarks (by the.duckman ) about memory and speed of Trove compared to the default Collections. 我们还可以看到一些关于Trove的内存和速度与默认集合相比的基准(由.duckman提供 )。 Here's a snippet: 这是一个片段:
100000 put operations 100000 contains operations
java collections 1938 ms 203 ms
trove 234 ms 125 ms
pcj 516 ms 94 ms
And there's also an example showing how much memory can be saved by using Trove instead of a regular Java HashMap : 还有一个示例显示使用Trove而不是常规Java HashMap可以节省多少内存:
java collections oscillates between 6644536 and 7168840 bytes
trove 1853296 bytes
pcj 1866112 bytes
So even though benchmarks always need to be taken with a grain of salt, it's pretty obvious that Trove will save not only memory but will always be much faster. 因此,尽管基准测试总是需要花费一些时间,但很明显, Trove不仅会节省内存,而且会更快。
So our goal now becomes to use Trove (seen that by putting millions and millions of entries in a regular HashMap , your app begins to feel unresponsive). 因此,我们现在的目标是使用Trove(通过在常规HashMap中投入数百万条条目,您的应用开始感到反应迟钝)。
You mentioned 2 million pairs, 7 characters long keys and a String/int mapping. 你提到了200万对,7个字符的长键和一个String / int映射。
2 million is really not that much but you'll still feel the "Object" overhead and the constant (un)boxing of primitives to Integer in a regular HashMap{String,Integer} which is why Trove makes a lot of sense here. 2000000是真的不那么多,但你还是会觉得“对象”开销和原语的常数(UN)拳击整数在一个普通的HashMap {字符串,整数}这就是为什么特罗韦使得有很大的意义在这里。
However, I'd point out that if you have control over the "7 characters", you could go even further: if you're using say only ASCII or ISO-8859-1 characters, your 7 characters would fit in a long (*). 但是,我要指出,如果你可以控制“7个字符”,你可以更进一步:如果你只使用ASCII或ISO-8859-1字符,那么你的7个字符就会很长( *)。 In that case you can dodge altogether objects creation and represent your 7 characters on a long. 在这种情况下,您可以完全躲避对象创建,并在很长时间内代表您的7个角色。 You'd then use a Trove TLongIntHashMap and bypass the "Java Object" overhead altogether. 然后,您将使用Trove TLongIntHashMap并完全绕过“Java对象”开销。
You stated specifically that your keys were 7 characters long and then commented they were UK postcodes: I'd map each postcode to a long and save a tremendous amount of memory by fitting millions of keys/values pair into memory using Trove. 你明确指出你的密钥是7个字符长然后评论他们是英国邮政编码:我将每个邮政编码映射到一个长的,并通过使用Trove将数百万个键/值对装入内存来节省大量内存。
The advantage of Trove is basically that it is not doing constant boxing/unboxing of Objects/primitives: Trove works, in many cases, directly with primitives and primitives only. Trove的优势基本上在于它不会对对象/原语进行持续的装箱/拆箱:在很多情况下,Trove只能直接使用基元和基元。
(*) say you only have at most 256 codepoints/characters used, then it fits on 7*8 == 56 bits, which is small enough to fit in a long. (*)假设您最多只使用256个码点/字符,那么它适合7 * 8 == 56位,这个小到足以适合长整数。
Sample method for encoding the String
keys into long
's (assuming ASCII characters, one byte per character for simplification - 7 bits would be enough): 将String
键编码为long
的示例方法(假设ASCII字符,每个字符一个字节用于简化 - 7位就足够了):
long encode(final String key) {
final int length = key.length();
if (length > 8) {
throw new IndexOutOfBoundsException(
"key is longer than 8 characters");
}
long result = 0;
for (int i = 0; i < length; i++) {
result += ((long) ((byte) key.charAt(i))) << i * 8;
}
return result;
}
Use the Trove library. 使用Trove库。
The Trove library has optimized HashMap
and HashSet
classes for primitives. Trove库已经为基元优化了HashMap
和HashSet
类。 In this case, TObjectIntHashMap<String>
will map the parameterized object ( String
) to a primitive int
. 在这种情况下, TObjectIntHashMap<String>
会将参数化对象( String
)映射到基本int
。
First of, did you measure that LinkedList
is indeed more memory efficient than a HashMap
, or how did you come to that conclusion? 首先,您是否测量到LinkedList
确实比HashMap
更具内存效率,或者您是如何得出这个结论的? Secondly, a LinkedList
's access time of an element is O(n)
, so you cannot do efficient binary search on it. 其次, LinkedList
的元素访问时间为O(n)
,因此您无法对其进行有效的二进制搜索。 If you want to do such approach, you should use an ArrayList
, which should give you the beast compromise between performance and space. 如果你想做这样的方法,你应该使用一个ArrayList
,它可以让你在性能和空间之间做出妥协。 However, again, I doubt that a HashMap
, HashTable
or - in particular - a TreeMap
would consume that much more memory, but the first two would provide constant access and the tree map logarithmic and provide a nicer interface that a normal list. 然而,我再次怀疑HashMap
, HashTable
或者 - 特别是 - TreeMap
将消耗更多的内存,但前两个将提供常量访问和树映射对数,并提供一个比普通列表更好的接口。 I would try to do some measurements, how much the difference in memory consumption really is. 我会尝试做一些测量,内存消耗的差异究竟是多少。
UPDATE : Given, as Adamski pointed out, that the String
s themselves, not the data structure they are stored in, will consume the most memory, it might be a good idea to look into data structures that are specific for strings, such as tries (especially patricia tries ), which might reduce the storage space needed for the strings. 更新 :正如Adamski指出的那样, String
本身,而不是它们存储的数据结构,将消耗最多的内存,查看特定于字符串的数据结构可能是个好主意,例如尝试 (特别是patricia尝试 ),这可能会减少字符串所需的存储空间。
What you are looking for is a succinct-trie - a trie which stores its data in nearly the least amount of space theoretically possible. 你正在寻找的是一个简洁的特里 - 一个trie ,它在理论上可以将其数据存储在几乎最小的空间内。
Unfortunately, there are no succinct-trie classes libraries currently available for Java. 不幸的是,目前没有适用于Java的简洁类库。 One of my next projects (in a few weeks) is to write one for Java (and other languages) . 我的下一个项目之一(在几周内)就是为Java (和其他语言)编写一个。
In the meanwhile, if you don't mind JNI , there are several good native succinct-trie libraries you could reference. 同时,如果你不介意JNI ,你可以参考几个 很好的本地简洁图书馆。
A custom tree would have the same complexity of O(log n)
, don't bother. 自定义树将具有与O(log n)
相同的复杂性,请勿打扰。 Your solution is sound, but I would go with an ArrayList
instead of the LinkedList
because the linked list allocates one extra object per stored value, which will amount to a lot of objects in your case. 你的解决方案是合理的,但我会使用ArrayList
而不是LinkedList
因为链表每个存储值分配一个额外的对象,这相当于你的案例中的很多对象。
As Erick writes using the Trove library is a good place to start as you save space in storing int
primitives rather than Integer
s. 正如Erick所写,使用Trove库是一个很好的起点,因为你在存储int
原语而不是Integer
s中节省了空间。
However, you are still faced with storing 2 million String instances. 但是,您仍然面临存储200万个String实例的问题。 Given that these are keys in the map, interning them won't offer any benefit so the next thing I'd consider is whether there's some characteristic of the Strings that can be exploited. 鉴于这些是地图中的关键,实习他们不会提供任何好处,所以接下来我要考虑的是是否有一些可以被利用的字符串的特征。 For example: 例如:
String
s represent sentences of common words then you could transform the String into a Sentence
class, and intern the individual words. 如果String
表示常用单词的句子,那么您可以将String转换为Sentence
类,并实习单个单词。 MyString
. 您可以考虑将每个String转换为UTF-8编码的字节数组,并将其包装在类: MyString
。 Obviously the trade-off here is the additional time spent performing look-ups. 显然,这里的权衡是执行查找所花费的额外时间。 也许你可以使用RadixTree ?
Use java.util.TreeMap
instead of java.util.HashMap
. 使用java.util.TreeMap
而不是java.util.HashMap
。 It makes use of a red black binary search tree and doesn't use more memory than what is required for holding notes containing the elements in the map. 它使用红黑二进制搜索树,并且不使用比保存包含地图中元素的注释所需的更多内存。 No extra buckets, unlike HashMap or Hashtable. 没有额外的桶,不像HashMap或Hashtable。
I think the solution is to step a little outside of Java. 我认为解决方案是在Java之外做一点。 If you have that many values, you should use a database. 如果您有这么多值,则应使用数据库。 If you don't feel like installing Oracle, SQLite is quick and easy. 如果您不想安装Oracle,SQLite快速而简单。 That way the data you don't immediately need is stored on the disk, and all of the caching/storage is done for you. 这样,您不需要的数据就会存储在磁盘上,所有的缓存/存储都会为您完成。 Setting up a DB with one table and two columns won't take much time at all. 设置具有一个表和两列的DB不会花费太多时间。
我考虑使用一些缓存,因为它们通常具有溢出到磁盘的能力。
The problem is objects' memory overhead, but using some tricks you can try to implement your own hashset. 问题是对象的内存开销,但使用一些技巧可以尝试实现自己的hashset。 Something like this . 像这样的东西。 Like others said strings have quite large overhead so you need to "compress" it somehow. 像其他人一样,字符串的开销很大,所以你需要以某种方式“压缩”它。 Also try not to use too many arrays(lists) in hashtable (if you do chaining type hashtable) as they are also objects and also have overhead. 另外,尽量不要在哈希表中使用太多的数组(列表)(如果你做链接类型哈希表),因为它们也是对象,也有开销。 Better yet do open addressing hashtable. 更好的是开放寻址哈希表。
You might create a key class that matches your needs. 您可以创建符合您需求的密钥类。 Perhaps like this: 也许是这样的:
public class MyKey implements Comparable<MyKey>
{
char[7] keyValue;
public MyKey(String keyValue)
{
... load this.keyValue from the String keyValue.
}
public int compareTo(MyKey rhs)
{
... blah
}
public boolean equals(Object rhs)
{
... blah
}
public int hashCode()
{
... blah
}
}
try this one 试试这个
OptimizedHashMap<String, int[]> myMap = new OptimizedHashMap<String, int[]>();
for(int i = 0; i < 2000000; i++)
{
myMap.put("iiiiii" + i, new int[]{i});
}
System.out.println(myMap.containsValue(new int[]{3}));
System.out.println(myMap.get("iiiiii" + 1));
public class OptimizedHashMap<K,V> extends HashMap<K,V>
{
public boolean containsValue(Object value) {
if(value != null)
{
Class<? extends Object> aClass = value.getClass();
if(aClass.isArray())
{
Collection values = this.values();
for(Object val : values)
{
int[] newval = (int[]) val;
int[] newvalue = (int[]) value;
if(newval[0] == newvalue[0])
{
return true;
}
}
}
}
return false;
}
Actually HashMap and List are too general for such specific task as a lookup of int by zipcode. 实际上,HashMap和List对于通过zipcode查找int这样的特定任务来说太笼统了。 You should use advantage of knowledge which data is used. 您应该利用使用数据的知识。 One of the options is to use a prefix tree with leaves that stores the int value. 其中一个选项是使用带有存储int值的叶子的前缀树。 Also, it could be pruned if (my guess) a lot of codes with same prefixes map to the same integer. 此外,如果(我的猜测)很多具有相同前缀的代码映射到相同的整数,它可以被修剪。
Lookup of the int by zipcode will be linear in such tree and will not grow if number of codes is increased, compare to O(log(N)) in case of binary search. 通过zipcode查找int将在这种树中是线性的,并且如果代码数量增加则不会增长,在二进制搜索的情况下与O(log(N))相比。
Since you are intending to use hashing, you can try numerical conversions of the strings based on ASCII values. 由于您打算使用散列,因此可以尝试基于ASCII值对字符串进行数值转换。 the simplest idea will be 最简单的想法是
int sum=0;
for(int i=0;i<arr.length;i++){
sum+=(int)arr[i];
}
hash "sum" using a well defined hash functions. 使用定义良好的散列函数散列“sum”。 You would use a hash function based on the expected input patterns. 您将使用基于预期输入模式的哈希函数。 eg if you use division method 例如,如果你使用除法
public int hasher(int sum){
return sum%(a prime number);
}
selecting a prime number which is not close to an exact power of two improves performances and gives better uniformly hashed distribution of keys. 选择一个不接近精确2次幂的素数可以改善性能并提供更好的均匀散列键分配。
another method is to weigh the characters based on their respective position. 另一种方法是根据各自的位置权衡角色。
eg: if you use the above method, both "abc" and "cab" will be hashed into a same location. 例如:如果使用上述方法,“abc”和“cab”都将被散列到同一位置。 but if you need them to be stored in two distinct location give weights for locations like we use the number systems. 但如果您需要将它们存储在两个不同的位置,请为我们使用数字系统的位置提供权重。
int sum=0;
int weight=1;
for(int i=0;i<arr.length;i++){
sum+= (int)arr[i]*weight;
weight=weight*2; // using powers of 2 gives better results. (you know why :))
}
As your sample is quite large, you'd avoid collisions by a chaining mechanism rather than using a probe sequence. 由于您的样本非常大,因此您可以通过链接机制避免冲突,而不是使用探测序列。 After all,What method you would choose totally depends on the nature of your application. 毕竟,您选择的方法完全取决于您的应用程序的性质。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.