简体   繁体   中英

Java 8 String deduplication vs. String.intern()

I am reading about the feature in Java 8 update 20 for String deduplication ( more info ) but I am not sure if this basically makes String.intern() obsolete.

I know that this JVM feature needs the G1 garbage collector, which might not be an option for many, but assuming one is using G1GC, is there any difference/advantage/disadvantage of the automatic deduplication done by the JVM vs manually having to intern your strings (one obvious one is the advantage of not having to pollute your code with calls to intern() )?

This is especially interesting considering that Oracle might make G1GC the default GC in java 9

With this feature, if you have 1000 distinct String objects, all with the same content "abc" , JVM could make them share the same char[] internally. However, you still have 1000 distinct String objects.

With intern() , you will have just one String object. So if memory saving is your concern, intern() would be better. It'll save space, as well as GC time.

However, the performance of intern() isn't that great, last time I heard. You might be better off by having your own string cache, even using a ConcurrentHashMap ... but you need to benchmark it to make sure.

As a comment references, do see: http://java-performance.info/string-intern-in-java-6-7-8/ . It is very insightful reference and I learned a lot, however I'm not sure its conclusions are necessarily "one size fits all". Each aspect depends on the needs of your own application - taking measurements of realistic input data is highly recommended!

The main factor probably depends on what you are in control over:

  • Do you have full control over the choice of GC? In a GUI application for example, there is still a strong case to be made for using Serial GC. (far lower total memory footprint for the process - think 400 MB vs ~1 GB for a moderately complex app, and being much more willing release memory, eg after a transient spike in usage). So you might pick that or give your users the option. (If the heap remains small the pauses should not be a big deal).

  • Do you have full control over the code? The G1GC option is great for 3rd party libraries (and applications!) which you can't edit.

The second consideration (as per @ZhongYu's answer) is that String.intern can de-duplication the String objects themselves, whereas G1GC necessarily can only de-duplicate their private char[] field.

A third consideration may be CPU usage, say if impact on laptop battery life might be of concern to your users. G1GC will run an extra thread dedicated to de-duplicating the heap. For example, I played with this to run Eclipse and found it caused an initial period of increased CPU activity after starting up (think 1 - 2 minutes) but it settled on a smaller heap "in-use" and no obvious (just eye-balling the task manager) CPU overhead or slow-down thereafter. So I imagine a certain % of a CPU core will be taken up on de-duplication (during? after?) periods of high memory-churn. (Of course there may be a comparable overhead if you call String.intern everywhere , which would also runs in serial, but then...)

You probably don't need string de-duplication everywhere. There are probably only certain areas of code that:

  • really impact long-term heap usage, and
  • create a high proportion of duplicate strings

By using String.intern selectively, other parts of the code (which may create temporary or semi-temporary strings) don't pay the price.

And finally, a quick plug for the Guava utility: Interner , which:

Provides equivalent behavior to String.intern() for other immutable types

You can also use that for Strings. Memory probably is (and should be) your top performance concern, so this probably doesn't apply often: however when you need to squeeze every drop of speed out of some hot-spot area, my experience is that Java-based weak-reference HashMap solutions do run slightly but consistently faster than the JVM's C++ implementation of String.intern() , even after tuning the jvm options. (And bonus: you don't need to tune the JVM options to scale to different input.)

I want to introduce another decision factor regarding the targeted audience:

  • For a system integrator having a system composed by many different libraries/frameworks, with low capacity to influence those libraries internal development, StringDeDuplication could be a quick winner if memory is a problem. It will affect all the Strings in the JVM, but G1 will use only spare time to do it. You may even tweak when DeDuplication is calculated by using another parameter(StringDeduplicationAgeThreshold)
  • For a developer profiling his own code, String.intern could be more interesting. Thoughful review of the domain model is necessary to decide whether to call intern, and when. As rule of thumb you may use intern when you know the String will contain a limited set of values, like a kind of enumerated set (ie Country name, month, day of week...).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM