简体   繁体   English

Java hashmap与hashset性能

[英]Java hashmap vs hashset performance

I have a file consisting of 7.6M lines. 我有一个由7.6M行组成的文件。 Each line is of the form: A,B,C,D where B,C,D are values that are used to calculate a level of importance for A which is a String identifier that is unique for each line. 每行的形式为:A,B,C,D其中B,C,D是用于计算A的重要性级别的值,A是每个行唯一的字符串标识符。 My approach: 我的方法:

private void read(String filename) throws Throwable {
        BufferedReader br  = new BufferedReader(new FileReader(filename));

        Map<String, Double> mmap = new HashMap<>(10000000,0.8f);
        String line;
        long t0 = System.currentTimeMillis();
        while ((line = br.readLine()) != null) {
            split(line);
            mmap.put(splitted[0], 0.0);
        }
        long t1 = System.currentTimeMillis();
        br.close();
        System.out.println("Completed in " + (t1 - t0)/1000.0 + " seconds");
}

private void split(String line) {
    int idxComma, idxToken = 0, fromIndex = 0;
    while ((idxComma = line.indexOf(delimiter, fromIndex)) != -1) {
        splitted[idxToken++] = line.substring(fromIndex, idxComma);
        fromIndex = idxComma + 1;
    }
    splitted[idxToken] = line.substring(fromIndex);
}

where the dummy value 0.0 is inserted for "profiling" purposes and splitted is a simple String array defined for the class. 其中插入虚拟值0.0用于“分析”目的,splitted是为该类定义的简单String数组。 I initially worked with String's split() method, but found the above to be be faster. 我最初使用String的split()方法,但发现上面的更快。

When I run the above code, it takes 12 seconds to parse the file which is waaaay more than I think it should take. 当我运行上面的代码时,解析文件需要12秒,这比我认为应该花费更多。 If I, eg, replace the HashMap with a Vector of strings and just take the first entry from each line (ie I do not put an associated value with it as this should be amortized constant), the entire file can be read in less than 3 seconds. 如果我,例如,用一个字符串向量替换HashMap并且只从每一行获取第一个条目(即我没有给它一个关联的值,因为它应该是分摊的常量),整个文件可以读取少于3秒。

This suggests to me that (i) there are a lot of collisions in the HashMap (I have tried to minimise the number of resizes by preallocating the size and setting the load factor accordingly) or (ii) the hashCode() function is somehow slow. 这告诉我(i)HashMap中存在很多冲突(我试图通过预先分配大小并相应地设置加载因子来最小化调整大小的数量)或者(ii)hashCode()函数以某种方式缓慢。 I doubt its (ii) because if I use a HashSet the files can be read in under 4 seconds. 我怀疑它(ii)因为如果我使用HashSet,文件可以在4秒内读取。

My question is: What could be the reason that the HashMap performs so slowly? 我的问题是:HashMap执行速度如此之慢的原因是什么? Is the hashCode() insufficient for maps of this size, or is there something fundamentally that I have overlooked? hashCode()对于这个大小的地图是不够的,还是从根本上忽略了一些东西?

HashMap vs Vector: Inserting in HashMap is way more costlier than inserting in Vector. HashMap与Vector:在HashMap中插入比在Vector中插入更昂贵。 Although both are amortized constant time operations, but the HashMap performs a number of other operations internally (like generating hashCode, checking collissions, resolving collissions, etc), whereas the Vector just inserts the element at the end (increasing the size of the structure, if required). 虽然两者都是摊销的常量时间操作,但HashMap在内部执行许多其他操作(如生成hashCode,检查碰撞,解决碰撞等),而Vector只是在末尾插入元素(增加结构的大小,如果需要)。

HashMap vs HashSet: HashSet internally uses HashMap. HashMap vs HashSet: HashSet内部使用HashMap。 So, there shouldn't be any performance difference whatsoever if you use them for the same purpose. 因此,如果您将它们用于同一目的,则不应存在任何性能差异。 Ideally, both of these have different purposes, so the discussion regarding which is better is useless. 理想情况下,这两者具有不同的目的,因此关于哪个更好的讨论是无用的。

Since, you need B,C,D as value for A as key, you should definitely stick to HashMap. 因为,你需要B,C,D作为A的值作为键,你一定要坚持使用HashMap。 If you really want to just compare the performance, put "null" instead of 0.0 as the value for all the keys (because that is what HashSet uses while putting the keys in its backed HashMap). 如果你真的只想比较性能,可以将“null”而不是0.0作为所有键的值(因为这是HashSet在将键放入其支持的HashMap时使用的)。

Update: HashSet uses a dummy constant value (static final) to insert in the HashMap, and not null. 更新:HashSet使用虚拟常量值(静态final)插入HashMap,而不是null。 Sorry about that. 对于那个很抱歉。 You can replace your 0.0 with any constant and the performance should be similar to HashSet. 你可以用任何常量替换你的0.0,性能应该类似于HashSet。

You could use a more memory-efficient Collections library. 您可以使用更具内存效率的Collections库。

I suggest the Eclipse Collections ( https://www.eclipse.org/collections/ ), that has a ObjectDoubleMap ( https://www.eclipse.org/collections/javadoc/8.0.0/org/eclipse/collections/api/map/primitive/ObjectDoubleMap.html ), which is a map of object (String in your case) that has a double (yes, primitive double) as associated value. 我建议Eclipse Collections( https://www.eclipse.org/collections/ ),它有一个ObjectDoubleMap( https://www.eclipse.org/collections/javadoc/8.0.0/org/eclipse/collections/api /map/primitive/ObjectDoubleMap.html ),它是一个对象的映射(在你的例子中是String),它有一个double(yes,primitive double)作为关联值。 It is much better in handling memory and in performance. 它在处理内存和性能方面要好得多。

You can get an empty instance of this by doing: 你可以通过这样做获得一个空的实例:

ObjectDoubleMaps.mutable.empty();

Yep, checked your example with 0.0 as dummy value VS static final constant as dummy value VS HashSet . 是的,检查你的例子用0.0作为虚拟值VS静态最终常量作为虚拟值VS HashSet That's rough comparison, for better precision i would recommend to use JHM tool, but my HashSet performance was pretty much the same as static constant as dummy performance. 这是粗略的比较,为了更好的精度,我建议使用JHM工具,但我的HashSet性能与静态常量几乎相同,就像虚拟性能一样。

So, most probably, that low performance is caused by wrapping your 0.0 dummy value for every line (it's replaced by Double.valueOf() during compilation, which explicitly creates a new Double object every time). 因此,最有可能的是,低性能是由每行包装0.0虚拟值引起的Double.valueOf()在编译期间它被Double.valueOf()替换,每次都显式创建一个新的Double对象)。

That would explain low performance, as HashSet has predefined static final dummy object (which is not null , btw). 这可以解释低性能,因为HashSet预定义了静态最终虚拟对象(不是null ,顺便说一句)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM