简体繁体 English

真实生活，在Java中使用String.intern（）的实际例子？

[英]Real Life, Practical Example of Using String.intern() in Java?

原文 2010-08-18 08:59:18 7 5 java/ string/ permgen/ string-interning

I've seen many primitive examples describing how String intern()'ing works, but I have yet to see a real-life use-case that would benefit from it. 我已经看到很多原始的例子描述了String intern（）的工作方式，但我还没有看到一个可以从中受益的真实用例。

The only situation that I can dream up is having a web service that receives a considerable amount of requests, each being very similar in nature due to a rigid schema. 我能想到的唯一情况是拥有一个接收大量请求的Web服务，由于僵化的架构，每个请求都非常相似。 By intern()'ing the request field names in this case, memory consumption can be significantly reduced. 通过intern（）在这种情况下使用请求字段名称，可以显着减少内存消耗。

Can anyone provide an example of using intern() in a production environment with great success? 任何人都可以提供在生产环境中使用intern（）并取得巨大成功的示例吗？ Maybe an example of it in a popular open source offering? 也许是一个流行的开源产品中的一个例子？

Edit: I am referring to manual interning, not the guaranteed interning of String literals, etc. 编辑：我指的是手动实习，而不是字符串文字的保证实习等。

5 个解决方案

Interning can be very beneficial if you have N strings that can take only K different values, where N far exceeds K . 如果您有N字符串只能采用K不同的值，其中N远超过K ，则实习可能非常有用。 Now, instead of storing N strings in memory, you will only be storing up to K . 现在，不是将N字符串存储在内存中，而是只存储K 。

For example, you may have an ID type which consists of 5 digits. 例如，您可能有一个由5位数组成的ID类型。 Thus, there can only be 10^5 different values. 因此，只能有10^5不同的值。 Suppose you're now parsing a large document that has many references/cross references to ID values. 假设您现在正在解析一个包含许多ID值引用/交叉引用的大型文档。 Let's say this document have 10^9 references total (obviously some references are repeated in other parts of the documents). 假设这个文件总共有10^9引用（显然在文档的其他部分重复了一些引用）。

So N = 10^9 and K = 10^5 in this case. 因此在这种情况下N = 10^9且K = 10^5 。 If you are not interning the strings, you will be storing 10^9 strings in memory, where lots of those strings are equals (by Pigeonhole Principle ). 如果你没有实习字符串，你将在内存中存储10^9字符串，其中许多字符串是equals （通过Pigeonhole Principle ）。 If you intern() the ID string you get when you're parsing the document, and you don't keep any reference to the uninterned strings you read from the document (so they can be garbage collected), then you will never need to store more than 10^5 strings in memory. 如果你intern()你在解析文档时得到的ID字符串，并且你没有保留对从文档中读取的未处理字符串的任何引用（因此它们可以被垃圾收集），那么你将永远不需要在内存中存储超过10^5字符串。

Not a complete answer but additional food for thought ( found here ): 不是一个完整的答案，但需要额外的思考（在这里找到）：

Therefore, the primary benefit in this case is that using the == operator for internalized strings is a lot faster than use the equals() method [for not internalized Strings]. 因此，在这种情况下的主要好处是使用内部字符串的==运算符比使用equals()方法[对于非内部化字符串]快得多。 So, use the intern() method if you're going to be comparing strings more than a time or three. 因此，如果要比较字符串超过一次或三次，请使用intern()方法。

We had a production system that processes literally millions of pieces of data at a time, many of which have string fields. 我们有一个生产系统，一次处理数百万条数据，其中许多都有字符串字段。 We should have been interning strings, but there was a bug which meant we were not. 我们本来应该是实习生，但有一个错误意味着我们没有。 By fixing the bug we avoided having to do a very costly (at least 6 figures, possibly 7) server upgrade. 通过修复错误，我们避免了必须进行非常昂贵的（至少6位数，可能是7位）服务器升级。

Examples where interning will be beneficial involve a large numbers strings where: 实习将有益的示例涉及大量字符串，其中：

the strings are likely to survive multiple GC cycles, and 字符串很可能在多个GC循环中存活，并且
there are likely to be multiple copies of a large percentage of the Strings. 很可能会有大量字符串的多个副本。

Typical examples involve splitting / parsing a text into symbols (words, identifiers, URIs) and then attaching those symbols to long-lived data structures. 典型示例涉及将文本拆分/解析为符号（单词，标识符，URI），然后将这些符号附加到长寿命数据结构中。 XML processing, programming language compilation and RDF / OWL triple stores spring to mind as applications where interning is likely to be beneficial. XML处理，编程语言编译和RDF / OWL三重存储作为内部可能有益的应用而浮现在脑海中。

But interning is not without its problems, especially if it turns out that the assumptions above are not correct: 但实习并非没有问题，特别是如果事实证明上述假设不正确：

the pool data structure used to hold the interned strings takes extra space, 用于保存实习字符串的池数据结构占用额外空间，
interning takes time, and 实习需要时间，而且
interning doesn't prevent the creation of the duplicate string in the first place. interning不会阻止首先创建重复的字符串。

Finally, interning potentially increases GC overheads by increasing the number of objects that need to be traced and copied, and by increasing the number of weak references that need to be dealt with. 最后，通过增加需要跟踪和复制的对象数量，以及增加需要处理的弱引用数量，实习可能会增加GC开销。 This increase in overheads has to be balanced against the decrease in GC overheads that results from effective interning. 这种间接费用的增加必须与有效实习产生的GC费用减少相平衡。

Never, ever , use intern on user-supplied data, as that can cause denial of service attacks (as intern()ed strings are never freed). 永远，永远，使用用户提供的数据实习生，因为这可能会导致拒绝服务攻击（如实习生（）的字符串是永远不会被释放）。 You can do validation on the user-supplied strings, but then again you've done most of the work needed for intern(). 您可以对用户提供的字符串进行验证，但是您再次完成了intern（）所需的大部分工作。