简体   繁体   English

String.intern()vs手动字符串到标识符映射?

[英]String.intern() vs manual string-to-identifier mapping?

I recall seeing a couple of string-intensive programs that do a lot of string comparison but relatively few string manipulation, and that have used a separate table to map strings to identifiers for efficient equality and lower memory footprint, eg: 我记得看到几个字符串密集型程序进行了大量的字符串比较,但相对较少的字符串操作,并且使用单独的表将字符串映射到标识符以实现有效的相等性和更低的内存占用,例如:

public class Name {
    public static Map<String, Name> names = new SomeMap<String, Name>();
    public static Name from(String s) {
        Name n = names.get(s);
        if (n == null) {
            n = new Name(s);
            names.put(s, n);
        }
        return n;
    }
    private final String str;
    private Name(String str) { this.str = str; }
    @Override public String toString() { return str; }
    // equals() and hashCode() are not overridden!
}

I'm pretty sure one of these programs was javac from OpenJDK, so not some toy application. 我很确定其中一个程序是来自OpenJDK的javac,所以不是一些玩具应用程序。 Of course the actual class was more complex (and also I think it implemented CharSequence), but you get the idea - the entire program was littered with Name in any location you would expect String , and on the rare cases where string manipulation was needed, it converted to strings and then cached them again, conceptually like: 当然实际的类更复杂(而且我认为它实现了CharSequence),但是你明白了 - 整个程序在你期望String任何位置都充斥着Name ,并且在极少数需要字符串操作的情况下,它转换为字符串,然后再次缓存它们,在概念上如下:

Name newName = Name.from(name.toString().substring(5));

I think I understand the point of this - especially when there are a lot of identical strings all around and a lot of comparisons - but couldn't the same be achieved by just using regular strings and intern ing them? 我想我明白了这一点 - 尤其是当有很多相同的字符串和很多比较时 - 但是通过使用常规字符串并intern它们可能无法实现相同的目标吗? The documentation for String.intern() explicitly says: String.intern()文档明确说:

... ...
When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. 调用实习方法时,如果池已经包含等于此字符串对象的字符串(由equals(Object)方法确定),则返回池中的字符串。 Otherwise, this String object is added to the pool and a reference to this String object is returned. 否则,将此String对象添加到池中,并返回对此String对象的引用。

It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true. 因此,对于任何两个字符串s和t,当且仅当s.equals(t)为真时,s.intern()== t.intern()才为真。
... ...

So, what are the advantages and disadvantages of manually managing a Name -like class vs using intern() ? 那么, 什么是手动管理的优点和缺点Name状类VS使用intern()

What I've thought about so far was: 到目前为止我所想到的是:

  • Manually managing the map means using regular heap, intern() uses the permgen. 手动管理地图意味着使用常规堆, intern()使用permgen。
  • When manually managing the map you enjoy type-checking that can verify something is a Name , while an interned string and a non-interned string share the same type so it's possible to forget interning in some places. 当手动管理地图时,您喜欢可以验证某些东西是Name类型检查,而实习字符串和非实习字符串共享相同的类型,因此可能会忘记在某些地方实习。
  • Relying on intern() means reusing an existing, optimized, tried-and-tested mechanism without coding any extra classes. 依赖于intern()意味着重用现有的,经过优化的,经过试验和测试的机制,而无需编写任何额外的类。
  • Manually managing the map results in a code more confusing to new users, and strign operations become more cumbersome. 手动管理地图会导致代码对新用户更加困惑,并且strign操作变得更加麻烦。

... but I feel like I'm missing something else here. ......但我觉得我在这里缺少别的东西。

Unfortunately, String.intern() can be slower than a simple synchronized HashMap. 不幸的是, String.intern()可能比简单的同步HashMap慢。 It doesn't need to be so slow, but as of today in Oracle's JDK, it is slow (probably due to JNI) 它不需要那么慢,但是到今天在甲骨文的JDK中,它很慢(可能是由于JNI)

Another thing to consider: you are writing a parser; 另一件需要考虑的事情是:你正在编写一个解析器; you collected some chars in a char[] , and you need to make a String out of them. 你在char[]收集了一些字符,你需要用它们制作一个字符串。 Since the string is probably common and can be shared, we'd like to use a pool. 由于字符串可能很常见并且可以共享,因此我们想使用池。

String.intern() uses such a pool; String.intern()使用这样的池; yet to look up, you'll need a String to begin with. 要查找,你需要一个字符串开头。 So we need to new String(char[],offset,length) first. 所以我们首先需要new String(char[],offset,length)

We can avoid that overhead in a custom pool, where lookup can be done directly based on a char[],offset,length . 我们可以避免自定义池中的开销,其中可以基于char[],offset,length直接进行查找。 For example, the pool is a trie . 例如,游泳池是特里 The string most likely is in the pool, so we'll get the String without any memory allocation. 字符串最有可能在池中,因此我们将获得没有任何内存分配的String。

If we don't want to write our own pool, but use the good old HashMap, we'll still need to create a key object that wraps char[],offset,length (something like CharSequence). 如果我们不想编写自己的池,但使用旧的HashMap,我们仍然需要创建一个包装char[],offset,length (类似CharSequence)的密钥对象。 This is still cheaper than a new String, since we don't copy chars. 这仍然比新的字符串便宜,因为我们不复制字符。

what are the advantages and disadvantages of manually managing a Name-like class vs using intern() 手动管理类似名称的类与使用实习生()的优点和缺点是什么?

Type checking is a major concern, but invariant preservation is also a significant concern. 类型检查是一个主要问题,但不变保存也是一个重要问题。

Adding a simple check to the Name constructor Name构造函数中添加一个简单的检查

Name(String s) {
  if (!isValidName(s)) { throw new IllegalArgumentException(s); }
  ...
}

can ensure* that there exist no Name instances corresponding to invalid names like "12#blue,," which means that methods that take Name s as arguments and that consume Name s returned by other methods don't need to worry about where invalid Name s might creep in. 可以确保*没有Name实例对应于无效名称,如"12#blue,,"这意味着将Name s作为参数并且使用其他方法返回的Name s的方法不需要担心无效Name可能会蔓延。

To generalize this argument, imagine your code is a castle with walls designed to protect it from invalid inputs. 为了概括这个论点,想象一下你的代码是一个带有墙壁的城堡,旨在保护它免受无效输入的影响。 You want some inputs to get through so you install gates with guards that check inputs as they come through. 您需要一些输入才能通过,因此您需要使用警卫来安装门,以便在输入时检查输入。 The Name constructor is an example of a guard. Name构造函数是一个后卫​​的示例。

The difference between String and Name is that String s can't be guarded against. 之间的区别StringNameString s不能被防御。 Any piece of code, malicious or naive, inside or outside the perimeter, can create any string value. 外围内外的任何恶意或天真代码都可以创建任何字符串值。 Buggy String manipulation code is analogous to a zombie outbreak inside the castle. Buggy String操作代码类似于城堡内的僵尸爆发。 The guards can't protect the invariants because the zombies don't need to get past them. 守卫无法保护不变量,因为僵尸不需要越过它们。 The zombies just spread and corrupt data as they go. 僵尸只是在他们去的时候传播和破坏数据。

That a value "is a" String satisfies fewer useful invariants than that a value "is a" Name . 值“是一个” String满足的有用不变量少于值“是” Name

See stringly typed for another way to look at the same topic. 请参阅字符串键入以查看同一主题的另一种方法。

* - usual caveat re deserializing of Serializable allowing bypass of constructor. * - 通常需要重新反Serializable允许绕过构造函数。

I would always go with the Map because intern() has to do a (probably linear) search inside the internal String's pool of strings. 我总是使用Map,因为intern() 必须在内部String的字符串池中进行(可能是线性的)搜索。 If you do that quite often it is not as efficient as Map - Map is made for fast search. 如果你经常这样做,它就不如Map - Map快速搜索那么高效。

String.intern() in Java 5.0 & 6 uses the perm gen space which usually has a low maximum size. Java 5.0和6中的String.intern()使用通常具有较小最大大小的perm gen空间。 It can mean you run out of space even though there is plenty of free heap. 它可能意味着即使有足够的空闲堆也会耗尽空间。

Java 7 uses its the regular heap to store intern()ed Strings. Java 7使用常规堆来存储intern()ed字符串。

String comparison it pretty fast and I don't imagine there is much advantage in cutting comparison times when you consider the overhead. 字符串比较它非常快,我不认为在考虑开销时削减比较时间有很多优势。

Another reason this might be done is if there are many duplicate strings. 这样做的另一个原因是,如果有许多重复的字符串。 If there is enough duplication, this can save a lot of memory. 如果有足够的重复,这可以节省大量内存。

A simpler way to cache Strings is to use a LRU cache like LinkedHashMap 缓存字符串的一种更简单的方法是使用像LinkedHashMap这样的LRU缓存

private static final int MAX_SIZE = 10000;
private static final Map<String, String> STRING_CACHE = new LinkedHashMap<String, String>(MAX_SIZE*10/7, 0.70f, true) {
    @Override
    protected boolean removeEldestEntry(Map.Entry<String, String> eldest) {
        return size() > 10000;
    }
};

public static String intern(String s) {
    // s2 is a String equals to s, or null if its not there.
    String s2 = STRING_CACHE.get(s);
    if (s2 == null) {
        // put the string in the map if its not there already.
        s2 = s;
        STRING_CACHE.put(s2,s2);
    }
    return s2;
}

Here is an example of how it works. 这是一个如何工作的例子。

public static void main(String... args) {
    String lo = "lo";
    for (int i = 0; i < 10; i++) {
        String a = "hel" + lo + " " + (i & 1);
        String b = intern(a);
        System.out.println("String \"" + a + "\" has an id of "
                + Integer.toHexString(System.identityHashCode(a))
                + " after interning is has an id of "
                + Integer.toHexString(System.identityHashCode(b))
        );
    }
    System.out.println("The cache contains "+STRING_CACHE);
}

prints 版画

String "hello 0" has an id of 237360be after interning is has an id of 237360be
String "hello 1" has an id of 5736ab79 after interning is has an id of 5736ab79
String "hello 0" has an id of 38b72ce1 after interning is has an id of 237360be
String "hello 1" has an id of 64a06824 after interning is has an id of 5736ab79
String "hello 0" has an id of 115d533d after interning is has an id of 237360be
String "hello 1" has an id of 603d2b3 after interning is has an id of 5736ab79
String "hello 0" has an id of 64fde8da after interning is has an id of 237360be
String "hello 1" has an id of 59c27402 after interning is has an id of 5736ab79
String "hello 0" has an id of 6d4e5d57 after interning is has an id of 237360be
String "hello 1" has an id of 2a36bb87 after interning is has an id of 5736ab79
The cache contains {hello 0=hello 0, hello 1=hello 1}

This ensure the cache of intern()ed Strings will be limited in number. 这样可以确保intern()ed字符串的缓存数量有限。

A faster but less effective way is to use a fixed array. 更快但不太有效的方法是使用固定阵列。

private static final int MAX_SIZE = 10191;
private static final String[] STRING_CACHE = new String[MAX_SIZE];

public static String intern(String s) {
    int hash = (s.hashCode() & 0x7FFFFFFF) % MAX_SIZE;
    String s2 = STRING_CACHE[hash];
    if (!s.equals(s2))
        STRING_CACHE[hash] = s2 = s;
    return s2;
}

The test above works the same except you need 除了您的需要,上述测试的工作方式相同

System.out.println("The cache contains "+ new HashSet<String>(Arrays.asList(STRING_CACHE)));

to print out the contents which shows the following include on null for the empty entries. 打印出显示以下内容的内容包括null表示空条目。

The cache contains [null, hello 1, hello 0]

The advantage of this approach is speed and that it can be safely used by multiple thread without locking. 这种方法的优点是速度,并且可以安全地使用多个线程而无需锁定。 ie it doesn't matter if different threads have different view of STRING_CACHE. 即,不同的线程是否具有不同的STRING_CACHE视图并不重要。

So, what are the advantages and disadvantages of manually managing a Name-like class vs using intern()? 那么,手动管理类似于类的类与使用intern()相比有哪些优点和缺点?

One advantage is: 一个优点是:

It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true. 因此,对于任何两个字符串s和t,当且仅当s.equals(t)为真时,s.intern()== t.intern()才为真。

In a program where many many small strings must be compared often, this may pay off. 在一个必须经常比较许多小字符串的程序中,这可能会有所回报。 Also, it saves space in the end. 而且,它最终节省了空间。 Consider a source program that uses names like AbstractSyntaxTreeNodeItemFactorySerializer quite often. 考虑一个经常使用AbstractSyntaxTreeNodeItemFactorySerializer名称的源程序。 With intern(), this string will be stored once and that is it. 使用intern(),这个字符串将被存储一次,就是这样。 Everything else if just references to that, but the references you have anyway. 其他所有内容,如果只是引用,但无论如何参考。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM