简体繁体 English

Java 8 String重复数据删除与String.intern（）

[英]Java 8 String deduplication vs. String.intern()

原文 2015-09-29 22:46:02 8 3 java/ string/ jvm-hotspot/ deduplication

I am reading about the feature in Java 8 update 20 for String deduplication ( more info ) but I am not sure if this basically makes String.intern() obsolete. 我正在阅读Java 8更新20中的字符串重复数据删除功能（更多信息），但我不确定这是否会使String.intern()过时。

I know that this JVM feature needs the G1 garbage collector, which might not be an option for many, but assuming one is using G1GC, is there any difference/advantage/disadvantage of the automatic deduplication done by the JVM vs manually having to intern your strings (one obvious one is the advantage of not having to pollute your code with calls to intern() )? 我知道，这JVM功能需要的G1垃圾收集器，这可能不是很多的选择，但假设一个使用G1GC， 是有自动重复数据删除的由JVM完成VS手动有任何差异/优势/劣势intern您字符串 （一个明显的优点是不必通过调用intern()来污染代码）？

This is especially interesting considering that Oracle might make G1GC the default GC in java 9 考虑到Oracle可能使G1GC成为java 9中的默认GC，这一点尤为有趣

3 个解决方案

With this feature, if you have 1000 distinct String objects, all with the same content "abc" , JVM could make them share the same char[] internally. 使用此功能，如果您有1000个不同的String对象，所有对象具有相同的内容"abc" ，JVM可以使它们在内部共享相同的char[] 。 However, you still have 1000 distinct String objects. 但是，您仍然有1000个不同的String对象。

With intern() , you will have just one String object. 使用intern() ，您将只有一个String对象。 So if memory saving is your concern, intern() would be better. 因此，如果您关注内存节省， intern()会更好。 It'll save space, as well as GC time. 它将节省空间，以及GC时间。

However, the performance of intern() isn't that great, last time I heard. 然而，上次我听说， intern()的表现并不是那么好。 You might be better off by having your own string cache, even using a ConcurrentHashMap ... but you need to benchmark it to make sure. 拥有自己的字符串缓存可能会更好，即使使用ConcurrentHashMap ......但您需要对其进行基准测试以确保。

As a comment references, do see: http://java-performance.info/string-intern-in-java-6-7-8/ . 作为评论参考，请参阅： http ： //java-performance.info/string-intern-in-java-6-7-8/ 。 It is very insightful reference and I learned a lot, however I'm not sure its conclusions are necessarily "one size fits all". 这是非常有见地的参考，我学到了很多，但我不确定它的结论是否“一刀切”。 Each aspect depends on the needs of your own application - taking measurements of realistic input data is highly recommended! 每个方面都取决于您自己的应用程序的需求 - 强烈建议您测量实际的输入数据！

The main factor probably depends on what you are in control over: 主要因素可能取决于您控制的内容：

Do you have full control over the choice of GC? 您是否可以完全控制GC的选择？ In a GUI application for example, there is still a strong case to be made for using Serial GC. 例如，在GUI应用程序中，使用Serial GC仍然有很强的理由。 (far lower total memory footprint for the process - think 400 MB vs ~1 GB for a moderately complex app, and being much more willing release memory, eg after a transient spike in usage). （该过程的总内存占用空间要低得多 - 对于中等复杂的应用程序，请考虑400 MB与~1 GB，并且更愿意释放内存，例如在使用瞬间激增之后）。 So you might pick that or give your users the option. 所以你可以选择它或给你的用户选择。 (If the heap remains small the pauses should not be a big deal). （如果堆仍然很小，暂停不应该是一个大问题）。
Do you have full control over the code? 你有完全控制代码吗？ The G1GC option is great for 3rd party libraries (and applications!) which you can't edit. G1GC选项非常适合您无法编辑的第三方库（和应用程序！）。

The second consideration (as per @ZhongYu's answer) is that String.intern can de-duplication the String objects themselves, whereas G1GC necessarily can only de-duplicate their private char[] field. 第二个考虑因素（根据@ ZhongYu的回答）是String.intern可以对String对象本身进行重复数据删除，而G1GC必须只能去除它们的私有char[]字段。

A third consideration may be CPU usage, say if impact on laptop battery life might be of concern to your users. 第三个考虑因素可能是CPU使用率，例如，如果您的用户可能会对笔记本电脑电池寿命产生影响。 G1GC will run an extra thread dedicated to de-duplicating the heap. G1GC将运行一个专门用于重复堆栈的额外线程。 For example, I played with this to run Eclipse and found it caused an initial period of increased CPU activity after starting up (think 1 - 2 minutes) but it settled on a smaller heap "in-use" and no obvious (just eye-balling the task manager) CPU overhead or slow-down thereafter. 例如，我使用它来运行Eclipse并发现它在启动后导致初始阶段的CPU活动增加（想想1 - 2分钟）但它确定在一个较小的堆“使用中”并且没有明显的（只是眼睛 - 对任务管理器进行计算）此后CPU开销或减速。 So I imagine a certain % of a CPU core will be taken up on de-duplication (during? after?) periods of high memory-churn. 所以我想在CPU内核的某个百分比将被用于重复数据删除（在？之后？）高内存流失期间。 (Of course there may be a comparable overhead if you call String.intern everywhere , which would also runs in serial, but then...) （当然，如果你到处调用String.intern，可能会有相似的开销，这也会串行运行，但是......）

You probably don't need string de-duplication everywhere. 您可能不需要在任何地方进行字符串重复数据删除。 There are probably only certain areas of code that: 可能只有某些代码区域：

really impact long-term heap usage, and 真正影响长期堆使用，和
create a high proportion of duplicate strings 创建高比例的重复字符串

By using String.intern selectively, other parts of the code (which may create temporary or semi-temporary strings) don't pay the price. 通过有选择地使用String.intern ，代码的其他部分（可能会创建临时或半临时字符串）不支付价格。

And finally, a quick plug for the Guava utility: Interner , which: 最后，快速插入Guava实用程序： Interner ，它：

Provides equivalent behavior to String.intern() for other immutable types 为其他不可变类型提供String.intern()等效行为

You can also use that for Strings. 您也可以将它用于字符串。 Memory probably is (and should be) your top performance concern, so this probably doesn't apply often: however when you need to squeeze every drop of speed out of some hot-spot area, my experience is that Java-based weak-reference HashMap solutions do run slightly but consistently faster than the JVM's C++ implementation of String.intern() , even after tuning the jvm options. 内存可能（并且应该）是您最关注的性能问题，因此这可能不经常适用：但是当您需要从某些热点区域挤出每一滴速度时，我的经验是基于Java的弱引用即使在调整jvm选项之后，HashMap解决方案也会比JVM的String.intern() C ++实现略微但一致地运行得更快。 (And bonus: you don't need to tune the JVM options to scale to different input.) （并且奖励：您不需要调整JVM选项以扩展到不同的输入。）

I want to introduce another decision factor regarding the targeted audience: 我想介绍另一个关于目标受众的决策因素：

For a system integrator having a system composed by many different libraries/frameworks, with low capacity to influence those libraries internal development, StringDeDuplication could be a quick winner if memory is a problem. 对于具有由许多不同库/框架组成的系统的系统集成商，如果内存存在问题，StringDeDuplication可能是一个快速的赢家。 It will affect all the Strings in the JVM, but G1 will use only spare time to do it. 它会影响JVM中的所有字符串，但G1只会使用空闲时间来完成它。 You may even tweak when DeDuplication is calculated by using another parameter(StringDeduplicationAgeThreshold) 您甚至可以通过使用其他参数（StringDeduplicationAgeThreshold）计算DeDuplication时进行调整
For a developer profiling his own code, String.intern could be more interesting. 对于开发人员分析他自己的代码，String.intern可能会更有趣。 Thoughful review of the domain model is necessary to decide whether to call intern, and when. 必须仔细审查域模型才能决定是否调用实习生，以及何时调用实习生。 As rule of thumb you may use intern when you know the String will contain a limited set of values, like a kind of enumerated set (ie Country name, month, day of week...). 根据经验，当您知道字符串将包含一组有限的值时，您可以使用实习生，例如一种枚举集（即国家名称，月份，星期几......）。