简体   繁体   English

为什么 Java 中的 HashSet 占用这么多内存?

[英]Why is HashSet in Java taking so much memory?

I'm loading a 1GB ASCII text file with about 38 million rows into a HashSet.我正在将一个大约有 3800 万行的 1GB ASCII 文本文件加载到 HashSet 中。 Using Java 11, the process takes about 8GB of memory.使用 Java 11,该过程需要大约 8GB 的​​内存。

HashSet<String> addresses = new HashSet<>(38741847);
​try (Stream<String> lines = Files.lines(Paths.get("test.txt"), Charset.defaultCharset())) {
    lines​.forEach(addresses::add);
​}
​System.out.println(addresses.size());
​Thread.sleep(100000);

Why is Java taking so much memory?为什么 Java 占用这么多内存?

In comparison, I've implemented the same thing in Python, which takes only 4GB of memory.相比之下,我在 Python 中实现了同样的事情,它只需要 4GB 的内存。

s = set()
with open("test.txt") as file:
for line in file:
    s.add(line)
print(len(s))
time.sleep(1000)

A HashSet has a load factor which defaults to 0.75. HashSet的负载因子默认为 0.75。 That means memory is reallocated once the hashset is 75% full.这意味着一旦哈希集已满 75%,就会重新分配内存。 If your hash set should hold 38741847 elements, you have to initialize it with 38741847/0.75 or set a higher load factor:如果您的哈希集应该包含 38741847 个元素,则必须使用 38741847/0.75 对其进行初始化或设置更高的负载因子:

new HashSet<>(38741847, 1); // load factor 1 (100%)

Meanwhile I found the answer here , where I also discovered a few alternative HashSet implementations which are part of the trove4j and hppc libraries.同时我在这里找到了答案,在那里我还发现了一些替代的 HashSet 实现,它们是trove4jhppc库的一部分。 I've tested them with the same code.我用相同的代码测试了它们。

trove4j took only 5.5GB trove4j 只占用了 5.5GB

THashSet<String> s = new THashSet<>(38742847,1);

hppc took only 5GB hppc 只占用了 5GB

ObjectIdentityHashSet<String> s2 = new ObjectIdentityHashSet<>(38742847,1, 0.99); 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM