简体   繁体   English

Java HashMap的内存开销与ArrayList相比

[英]Memory overhead of Java HashMap compared to ArrayList

I am wondering what is the memory overhead of java HashMap compared to ArrayList? 我想知道java HashMap与ArrayList相比的内存开销是多少?

Update: 更新:

I would like to improve the speed for searching for specific values of a big pack (6 Millions+) of identical objects. 我想提高搜索大包(600万+)相同对象的特定值的速度。

Thus, I am thinking about using one or several HashMap instead of using ArrayList. 因此,我正在考虑使用一个或多个HashMap而不是使用ArrayList。 But I am wondering what is the overhead of HashMap. 但我想知道HashMap的开销是多少。

As far as i understand, the key is not stored, only the hash of the key, so it should be something like size of the hash of the object + one pointer . 据我所知,密钥不是存储的,只是密钥的散列,所以它应该像对象的散列大小+一个指针

But what hash function is used? 但是使用了什么哈希函数? Is it the one offered by Object or another one? 是Object提供的还是另一个?

If you're comparing HashMap with ArrayList, I presume you're doing some sort of searching/indexing of the ArrayList, such as binary search or custom hash table...? 如果您将HashMap与ArrayList进行比较,我假设您正在对ArrayList进行某种搜索/索引,例如二进制搜索或自定义哈希表...? Because a .get(key) thru 6 million entries would be infeasible using a linear search. 因为.get(key)到600万个条目使用线性搜索是不可行的。

Using that assumption, I've done some empirical tests and come up with the conclusion that "You can store 2.5 times as many small objects in the same amount of RAM if you use ArrayList with binary search or custom hash map implementation, versus HashMap". 使用这个假设,我做了一些实证测试并得出结论:“如果使用带有二进制搜索或自定义哈希映射实现的ArrayList,则可以在相同数量的RAM中存储2.5倍的小对象,而不是HashMap” 。 My test was based on small objects containing only 3 fields, of which one is the key, and the key is an integer. 我的测试是基于只包含3个字段的小对象,其中一个是键,键是整数。 I used a 32bit jdk 1.6. 我使用了32位的jdk 1.6。 See below for caveats on this figure of "2.5". 有关此图“2.5”的注意事项,请参见下文。

The key things to note are: 需要注意的关键事项是:

(a) it's not the space required for references or "load factor" that kills you, but rather the overhead required for object creation. (a)引用或“加载因子”不是杀死你所需的空间,而是创建对象所需的开销。 If the key is a primitive type, or a combination of 2 or more primitive or reference values, then each key will require its own object, which carries an overhead of 8 bytes. 如果密钥是基本类型,或者是2个或更多基元或引用值的组合,则每个密钥将需要其自己的对象,其承载8字节的开销。

(b) In my experience you usually need the key as part of the value, (eg to store customer records, indexed by customer id, you still want the customer id as part of the Customer object). (b)根据我的经验,您通常需要将密钥作为值的一部分(例如,存储客户记录,按客户ID索引,您仍然希望客户ID作为Customer对象的一部分)。 This means it is IMO somewhat wasteful that a HashMap separately stores references to keys and values. 这意味着IMO有点浪费,HashMap单独存储对键和值的引用。

Caveats: 注意事项:

  1. The most common type used for HashMap keys is String. 用于HashMap键的最常见类型是String。 The object creation overhead doesn't apply here so the difference would be less. 对象创建开销不适用于此处,因此差异会更小。

  2. I got a figure of 2.8, being 8880502 entries inserted into the ArrayList compared with 3148004 into the HashMap on -Xmx256M JVM, but my ArrayList load factor was 80% and my objects were quite small - 12 bytes plus 8 byte object overhead. 我有一个2.8的数字,插入到ArrayList中的8880502条目与3148004插入-Xmx256M JVM上的HashMap,但是我的ArrayList加载因子是80%而且我的对象非常小--12个字节加上8个字节的对象开销。

  3. My figure, and my implementation, requires that the key is contained within the value, otherwise I'd have the same problem with object creation overhead and it would be just another implementation of HashMap. 我的图和我的实现要求密钥包含在值中,否则我会遇到与对象创建开销相同的问题,它只是HashMap的另一个实现。

My code: 我的代码:

public class Payload {
    int key,b,c;
    Payload(int _key) { key = _key; }
}


import org.junit.Test;

import java.util.HashMap;
import java.util.Map;


public class Overhead {
    @Test
    public void useHashMap()
    {
        int i=0;
        try {
            Map<Integer, Payload> map = new HashMap<Integer, Payload>();
            for (i=0; i < 4000000; i++) {
                int key = (int)(Math.random() * Integer.MAX_VALUE);
                map.put(key, new Payload(key));
            }
        }
        catch (OutOfMemoryError e) {
            System.out.println("Got up to: " + i);
        }
    }

    @Test
    public void useArrayList()
    {
        int i=0;
        try {
            ArrayListMap map = new ArrayListMap();
            for (i=0; i < 9000000; i++) {
                int key = (int)(Math.random() * Integer.MAX_VALUE);
                map.put(key, new Payload(key));
            }
        }
        catch (OutOfMemoryError e) {
            System.out.println("Got up to: " + i);
        }
    }
}


import java.util.ArrayList;


public class ArrayListMap {
    private ArrayList<Payload> map = new ArrayList<Payload>();
    private int[] primes = new int[128];

    static boolean isPrime(int n)
    {
        for (int i=(int)Math.sqrt(n); i >= 2; i--) {
            if (n % i == 0)
                return false;
        }
        return true;
    }

    ArrayListMap()
    {
        for (int i=0; i < 11000000; i++)    // this is clumsy, I admit
            map.add(null);
        int n=31;
        for (int i=0; i < 128; i++) {
            while (! isPrime(n))
                n+=2;
            primes[i] = n;
            n += 2;
        }
        System.out.println("Capacity = " + map.size());
    }

    public void put(int key, Payload value)
    {
        int hash = key % map.size();
        int hash2 = primes[key % primes.length];
        if (hash < 0)
            hash += map.size();
        do {
            if (map.get(hash) == null) {
                map.set(hash, value);
                return;
            }
            hash += hash2;
            if (hash >= map.size())
                hash -= map.size();
        } while (true);
    }

    public Payload get(int key)
    {
        int hash = key % map.size();
        int hash2 = primes[key % primes.length];
        if (hash < 0)
            hash += map.size();
        do {
            Payload payload = map.get(hash);
            if (payload == null)
                return null;
            if (payload.key == key)
                return payload;
            hash += hash2;
            if (hash >= map.size())
                hash -= map.size();
        } while (true);
    }
}

The simplest thing would be to look at the source and work it out that way. 最简单的方法是查看源代码并以此方式进行处理。 However, you're really comparing apples and oranges - lists and maps are conceptually quite distinct. 但是,你真的在​​比较苹果和橘子 - 列表和地图在概念上非常不同。 It's rare that you would choose between them on the basis of memory usage. 您很少根据内存使用情况在它们之间进行选择。

What's the background behind this question? 这个问题背后的背景是什么?

All that is stored in either is pointers. 所有存储在其中的都是指针。 Depending on your architecture a pointer should be 32 or 64 bits (or more or less) 根据您的体系结构,指针应为32位或64位(或更多或更少)

An array list of 10 tends to allocate 10 "Pointers" at a minimum (and also some one-time overhead stuff). 10的数组列表倾向于至少分配10个“指针”(以及一些一次性开销的东西)。

A map has to allocate twice that (20 pointers) because it stores two values at a time. 地图必须分配两次(20个指针),因为它一次存储两个值。 Then on top of that, it has to store the "Hash". 然后,最重要的是,它必须存储“哈希”。 which should be bigger than the map, at a loading of 75% it SHOULD be around 13 32-bit values (hashes). 它应该大于地图,在75%的负载下它应该是大约13个32位值(散列)。

so if you want an offhand answer, the ratio should be about 1:3.25 or so, but you are only talking pointer storage--very small unless you are storing a massive number of objects--and if so, the utility of being able to reference instantly (HashMap) vs iterate (array) should be MUCH more significant than the memory size. 所以,如果你想要一个随便的答案,比例应该是大约1:3.25左右,但你只是在谈论指针存储 - 非常小,除非你存储大量的对象 - 如果是这样,能够实现即时引用(HashMap)vs iterate(数组)应该比内存大小更重要。

Oh, also: Arrays can be fit to the exact size of your collection. 哦,还有:阵列可以适合您收藏的确切尺寸。 HashMaps can as well if you specify the size, but if it "Grows" beyond that size, it will re-allocate a larger array and not use some of it, so there can be a little waste there as well. 如果你指定大小,HashMaps也可以,但如果它“超出”那个大小,它将重新分配一个更大的数组而不使用它的一些,所以也可能有一些浪费。

I don't have an answer for you either, but a quick google search turned up a function in Java that might help. 我也没有给你一个答案,但快速谷歌搜索在Java中发现了一个可能有帮助的功能。

Runtime.getRuntime().freeMemory(); 调用Runtime.getRuntime()freeMemory();

So I propose that you populate a HashMap and an ArrayList with the same data. 所以我建议用相同的数据填充HashMap和ArrayList。 Record the free memory, delete the first object, record memory, delete the second object, record the memory, compute the differences,..., profit!!! 记录空闲内存,删除第一个对象,记录内存,删除第二个对象,记录内存,计算差异,...,利润!

You should probably do this with magnitudes of data. 您可能应该使用大量数据。 ie Start with 1000, then 10000, 100000, 1000000. 即从1000开始,然后是10000,100000,1000000。

EDIT: Corrected, thanks to amischiefr. 编辑:更正,感谢amischiefr。

EDIT: Sorry for editing your post, but this is pretty important if you are going to use this (and It's a little much for a comment) . 编辑:很抱歉编辑你的帖子,但是如果你打算使用它,这是非常重要的(这对评论来说有点多)。 freeMemory does not work like you think it would. freeMemory不会像你想象的那样工作。 First, it's value is changed by garbage collection. 首先,垃圾收集改变了它的价值。 Secondly, it's value is changed when java allocates more memory. 其次,当java分配更多内存时,它的值会发生变化。 Just using the freeMemory call alone doesn't provide useful data. 仅仅使用freeMemory调用不能提供有用的数据。

Try this: 试试这个:

public static void displayMemory() {
    Runtime r=Runtime.getRuntime();
    r.gc();
    r.gc(); // YES, you NEED 2!
    System.out.println("Memory Used="+(r.totalMemory()-r.freeMemory()));
}

Or you can return the memory used and store it, then compare it to a later value. 或者您可以返回使用的内存并将其存储,然后将其与以后的值进行比较。 Either way, remember the 2 gcs and subtracting from totalMemory(). 无论哪种方式,记住2 gcs并从totalMemory()中减去。

Again, sorry to edit your post! 再次,抱歉编辑你的帖子!

Hashmaps try to maintain a load factor (usually 75% full), you can think of a hashmap as a sparsely filled array list. Hashmaps尝试维护加载因子(通常为75%已满),您可以将hashmap视为稀疏填充的数组列表。 The problem in a straight up comparison in size is this load factor of the map grows to meet the size of the data. 直接比较大小的问题是地图的这个加载因子增长以满足数据的大小。 ArrayList on the other hand grows to meet it's need by doubling it's internal array size. 另一方面,ArrayList通过将其内部数组大小加倍来增长以满足其需求。 For relatively small sizes they are comparable, however as you pack more and more data into the map it requires a lot of empty references in order to maintain the hash performance. 对于相对较小的大小,它们是可比较的,但是当您将越来越多的数据打包到地图中时,它需要大量空引用以保持散列性能。

In either case I recommend priming the expected size of the data before you start adding. 在任何一种情况下,我建议在开始添加之前启动数据的预期大小。 This will give the implementations a better initial setting and will likely consume less over all in both cases. 这将为实现提供更好的初始设置,并且在两种情况下都可能消耗更少。

Update: 更新:

based on your updated problem check out Glazed lists . 根据您更新的问题,查看Glazed列表 This is a neat little tool written by some of the Google people for doing operations similar to the one you describe. 这是一些由Google的一些人编写的简洁工具,用于执行与您描述的操作类似的操作。 It's also very quick. 它也很快。 Allows clustering, filtering, searching, etc. 允许群集,过滤,搜索等

HashMap hold a reference to the value and a reference to the key. HashMap保存对值的引用和对键的引用。

ArrayList just hold a reference to the value. ArrayList只保存对该值的引用。

So, assuming that the key uses the same memory of the value, HashMap uses 50% more memory ( although strictly speaking , is not the HashMap who uses that memory because it just keep a reference to it ) 因此,假设密钥使用相同的内存值,HashMap使用的内存增加了50%(尽管严格来说,不是使用该内存的HashMap,因为它只保留对它的引用)

In the other hand HashMap provides constant-time performance for the basic operations (get and put) So, although it may use more memory, getting an element may be much faster using a HashMap than a ArrayList. 另一方面,HashMap为基本操作(get和put)提供了恒定时间性能。因此,虽然它可能使用更多内存,但使用HashMap获取元素可能比使用ArrayList快得多。

So, the next thing you should do is not to care about who uses more memory but what are they good for . 所以,你应该做的下一件事是不关心谁使用更多的内存,但他们有什么好处

Using the correct data structure for your program saves more CPU/memory than how the library is implemented underneath. 为程序使用正确的数据结构可以节省比在其下实现库的方式更多的CPU /内存。

EDIT 编辑

After Grant Welch answer I decided to measure for 2,000,000 integers. 在Grant Welch回答之后,我决定测量2,000,000个整数。

Here's the source code 这是源代码

This is the output 这是输出

$
$javac MemoryUsage.java  
Note: MemoryUsage.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
$java -Xms128m -Xmx128m MemoryUsage 
Using ArrayListMemoryUsage@8558d2 size: 0
Total memory: 133.234.688
Initial free: 132.718.608
  Final free: 77.965.488

Used: 54.753.120
Memory Used 41.364.824
ArrayListMemoryUsage@8558d2 size: 2000000
$
$java -Xms128m -Xmx128m MemoryUsage H
Using HashMapMemoryUsage@8558d2 size: 0
Total memory: 133.234.688
Initial free: 124.329.984
  Final free: 4.109.600

Used: 120.220.384
Memory Used 129.108.608
HashMapMemoryUsage@8558d2 size: 2000000

Basically, you should be using the "right tool for the job". 基本上,你应该使用“正确的工具”。 Since there are different instances where you'll need a key/value pair (where you may use a HashMap ) and different instances where you'll just need a list of values (where you may use a ArrayList ) then the question of "which one uses more memory", in my opinion, is moot, since it is not a consideration of choosing one over the other. 由于有不同的实例,您需要一个键/值对(您可以使用HashMap )和不同的实例,您只需要一个值列表(您可以使用ArrayList ),然后问题是“哪个一个人使用更多的记忆“,在我看来,是没有实际意义的,因为它不是考虑选择一个而不是另一个。

But to answer the question, since HashMap stores key/value pairs while ArrayList stores just values, I would assume that the addition of keys alone to the HashMap would mean that it takes up more memory, assuming, of course, we are comparing them by the same value type (eg where the values in both are Strings). 但回答这个问题,由于HashMap存储键/值对,而ArrayList只存储值,我认为单独向HashMap添加键意味着它会占用更多内存,当然,假设我们通过它们来比较它们相同的值类型 (例如,两者中的值都是字符串)。

I think the wrong question is being asked here. 我认为这里有一个错误的问题。

If you would like to improve the speed at which you can search for an object in a List containing six million entries, then you should look into how fast these datatype's retrieval operations perform. 如果你想改善你可以搜索一个物体的速度List包含六个万个条目,那么你应该看看这些数据类型的检索操作的速度有多快执行。

As usual, the Javadocs for these classes state pretty plainly what type of performance they offer: 像往常一样,这些类的Javadoc很清楚地表明了它们提供的性能类型:

HashMap : HashMap

This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets. 假设散列函数在桶之间正确地分散元素,该实现为基本操作(get和put)提供了恒定时间性能。

This means that HashMap.get(key) is O(1) . 这意味着HashMap.get(key)是O(1)

ArrayList : ArrayList

The size, isEmpty, get, set, iterator, and listIterator operations run in constant time. size,isEmpty,get,set,iterator和listIterator操作以恒定时间运行。 The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. 添加操作以分摊的常量时间运行,即添加n个元素需要O(n)时间。 All of the other operations run in linear time (roughly speaking). 所有其他操作都以线性时间运行(粗略地说)。

This means that most of ArrayList 's operations are O(1) , but likely not the ones that you would be using to find objects that match a certain value. 这意味着ArrayList的大多数操作都是O(1) ,但可能不是您用来查找匹配特定值的对象的操作。

If you are iterating over every element in the ArrayList and testing for equality, or using contains() , then this means that your operation is running at O(n) time (or worse). 如果要迭代ArrayList每个元素并测试相等性,或者使用contains() ,那么这意味着您的操作在O(n)时间(或更糟)运行。

If you are unfamiliar with O(1) or O(n) notation, this is referring to how long an operation will take. 如果您不熟悉O(1)O(n)表示法,则表示操作需要多长时间。 In this case, if you can get constant-time performance, you want to take it. 在这种情况下,如果您可以获得恒定时间性能,则需要采用它。 If HashMap.get() is O(1) this means that retrieval operations take roughly the same amount of time regardless of how many entries are in the Map. 如果HashMap.get()O(1)则意味着无论 Map中有多少条目,检索操作的时间大致相同。

The fact that something like ArrayList.contains() is O(n) means that the amount of time it takes grows as the size of the list grows; ArrayList.contains()这样的事实是O(n)意味着随着列表大小的增长,它所花费的时间会增加; so iterating thru an ArrayList with six million entries will not be very effective at all. 因此,通过具有六百万个条目的ArrayList进行迭代将不会非常有效。

I don't know the exact number, but HashMaps are much heavier. 我不知道确切的数字,但HashMaps更重。 Comparing the two, ArrayList's internal representation is self evident, but HashMaps retain Entry objects (Entry) which can balloon your memory consumption. 比较这两者,ArrayList的内部表示是不言而喻的,但HashMaps保留了Entry对象(Entry),这可以增加你的内存消耗。

It's not that much larger, but it's larger. 它不是那么大,但它更大。 A great way to visualize this would be with a dynamic profiler such as YourKit which allows you to see all heap allocations. 一个可视化的好方法是使用动态分析器,例如YourKit ,它允许您查看所有堆分配。 It's pretty nice. 这很不错。

这篇文章提供了很多关于Java中对象大小的信息。

As Jon Skeet noted, these are completely different structures. 正如Jon Skeet所说,这些是完全不同的结构。 A map (such as HashMap) is a mapping from one value to another - ie you have a key that maps to a value, in a Key->Value kind of relationship. 地图(例如HashMap)是从一个值到另一个值的映射 - 即,您有一个映射到值的键,在Key-> Value类型的关系中。 The key is hashed, and is placed in an array for quick lookup. 密钥是经过哈希处理的,并且放在一个数组中以便快速查找。

A List, on the other hand, is a collection of elements with order - ArrayList happens to use an array as the back end storage mechanism, but that is irrelevant. 另一方面,List是具有顺序的元素的集合--ArrayList碰巧使用数组作为后端存储机制,但这是无关紧要的。 Each indexed element is a single element in the list. 每个索引元素都是列表中的单个元素。

edit: based on your comment, I have added the following information: 编辑:根据您的评论,我添加了以下信息:

The key is stored in a hashmap. 密钥存储在hashmap中。 This is because a hash is not guaranteed to be unique for any two different elements. 这是因为不保证散列对于任何两个不同的元素是唯一的。 Thus, the key has to be stored in the case of hashing collisions. 因此,必须在散列冲突的情况下存储密钥。 If you simply want to see if an element exists in a set of elements, use a Set (the standard implementation of this being HashSet). 如果您只是想查看一组元素中是否存在元素,请使用Set(此标准实现为HashSet)。 If the order matters, but you need a quick lookup, use a LinkedHashSet, as it keeps the order the elements were inserted. 如果订单很重要,但您需要快速查找,请使用LinkedHashSet,因为它保持元素的插入顺序。 The lookup time is O(1) on both, but the insertion time is slightly longer on a LinkedHashSet. 两者的查找时间均为O(1),但LinkedHashSet的插入时间稍长。 Use a Map only if you are actually mapping from one value to another - if you simply have a set of unique objects, use a Set, if you have ordered objects, use a List. 仅当您实际从一个值映射到另一个值时才使用Map - 如果您只有一组唯一对象,则使用Set,如果您有已排序的对象,则使用List。

If you're considering two ArrayLists vs one Hashmap, it's indeterminate; 如果您正在考虑两个ArrayLists与一个Hashmap,那么它是不确定的; both are partially-full data structures. 两者都是部分完整的数据结构。 If you were comparing Vector vs Hashtable, Vector is probably more memory efficient, because it only allocates the space it uses, whereas Hashtables allocate more space. 如果你比较Vector和Hashtable,Vector可能更有效,因为它只分配它使用的空间,而Hashtables分配更多的空间。

If you need a key-value pair and aren't doing incredibly memory-hungry work, just use the Hashmap. 如果你需要一个键值对并且没有做出令人难以置信的内存需求,那么只需使用Hashmap即可。

This site lists the memory consumption for several commonly (and not so commonly) used data structures. 站点列出了几种常用(并非常见)使用的数据结构的内存消耗。 From there one can see that the HashMap takes roughly 5 times the space of an ArrayList . 从那里可以看出HashMap大约是ArrayList空间的5倍。 The map will also allocate one additional object per entry. 地图还将为每个条目分配一个额外的对象。

If you need a predictable iteration order and use a LinkedHashMap , the memory consumption will be even higher. 如果您需要可预测的迭代顺序并使用LinkedHashMap ,则内存消耗将更高。

You can do your own memory measurements with Memory Measurer . 您可以使用Memory Measurer进行自己的内存测量。

There are two important facts to note however: 但是有两个重要的事实需要注意:

  1. A lot of data structures (including ArrayList and HashMap ) do allocate space more space than they need currently, because otherwise they would have to frequently execute a costly resize operation. 许多数据结构(包括ArrayListHashMap )确实为空间分配了比当前需要更多的空间,因为否则它们必须经常执行昂贵的调整大小操作。 Thus the memory consumption per element depends on how many elements are in the collection. 因此,每个元素的内存消耗取决于集合中有多少元素。 For example, an ArrayList with the default settings uses the same memory for 0 to 10 elements. 例如,具有默认设置的ArrayList对0到10个元素使用相同的内存。
  2. As others have said, the keys of the map are stored, too. 正如其他人所说的那样,地图的键也被存储起来。 So if they are not in memory anyway, you will have to add this memory cost, too. 因此,如果它们不在内存中,您也必须添加此内存成本。 An additional object will usually take 8 bytes of overhead alone, plus the memory for its fields, and possibly some padding. 另一个对象通常只需要8个字节的开销,加上其字段的内存,可能还有一些填充。 So this will also be a lot of memory. 所以这也将是很多记忆。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM