简体   繁体   English

Java字符串实习的替代方案

[英]Alternatives to Java string interning

Since Java's default string interning has got a lot of bad press, I am looking for an alternative. 由于Java的默认字符串实习已经有很多坏消息,我正在寻找替代方案。

Can you suggest an API which is a good alternative to Java string interning? 你能建议一个API,它是Java字符串实习的一个很好的替代品吗? My application uses Java 6. My requirement is mainly to avoid duplicate strings via interning. 我的应用程序使用Java 6.我的要求主要是通过实习来避免重复的字符串。

Regarding the bad press: 关于坏消息:

  • String intern is implemented via a native method. 字符串实习生是通过本机方法实现的。 And the C implementation uses a fixed size of some 1k entries and scales very poorly for large number of strings. 并且C实现使用大约1k条目的固定大小,并且对于大量字符串而言非常差。
  • Java 6 stores interned strings in Perm gen. Java 6在Perm gen中存储实习字符串。 And therefore are not GC'd and possibly lead to perm gen errors. 因此不是GC'd并且可能导致烫发错误。 I know this is fixed in java 7 but I can't upgrade to java 7. 我知道这在java 7中已修复,但我无法升级到java 7。

Why do I need to use intering? 为什么我需要使用intering?

  • My application is a server app with heap size of 10-20G for different deployments. 我的应用程序是一个服务器应用程序,堆大小为10-20G,适用于不同的部署。
  • During profiling we have figured that hundrends of thousands of string are duplicates and we can significantly improve the memory usage by avoiding storing duplicate strings. 在分析期间,我们已经发现数千个字符串的数据是重复的,我们可以通过避免存储重复的字符串来显着提高内存使用率。
  • Memory has been a bottleneck for us and therefore we are targetting it rather than doing any premature optimization. 内存一直是我们的瓶颈,因此我们正在针对它而不是进行任何过早的优化。

String intern is implemented via a native method. 字符串实习生是通过本机方法实现的。 And the C implementation uses a fixed size of some 1k entries and scales very poorly for large number of strings. 并且C实现使用大约1k条目的固定大小,并且对于大量字符串而言非常差。

It scales poorly for many thousand Strings. 对于数千个字符串,它的扩展性很差。

Java 6 stores interned strings in Perm gen. Java 6在Perm gen中存储实习字符串。 And therefore are not GC'd 因此不是GC'd

It will be cleaned up when the perm gen is cleaned up which is not often but it can mean you reach the maximum of this space if you don't increase it. 当清理烫发时它将被清理,但这并不常见,但如果你不增加烫发,它可能意味着你达到这个空间的最大值。

My application is a server app with heap size of 10-20G for different deployments. 我的应用程序是一个服务器应用程序,堆大小为10-20G,适用于不同的部署。

I suggest you consider using off heap memory. 我建议你考虑使用off heap memory。 I have 500 GB in off heap memory and about 1 GB in heap in one application. 我在off heap内存中有500 GB,在一个应用程序中有大约1 GB的堆。 It isn't useful in all cases but worth considering. 它并非在所有情况下都有用,但值得考虑。

During profiling we have figured that hundrends of thousands of string are duplicates and we can significantly improve the memory usage by avoiding storing duplicate strings. 在分析期间,我们已经发现数千个字符串的数据是重复的,我们可以通过避免存储重复的字符串来显着提高内存使用率。

For this I have used a simple array of String. 为此,我使用了一个简单的String数组。 This is very light weight and you can control the upper bound of Strings stored easily. 这是非常轻的重量,您可以轻松控制存储的字符串的上限。


Here is an example of generic interner. 这是一个通用内部的例子。

class Interner<T> {
    private final T[] cache;

    @SuppressWarnings("unchecked")
    public Interner(int primeSize) {
        cache = (T[]) new Object[primeSize];
    }

    public T intern(T t) {
        int hash = Math.abs(t.hashCode() % cache.length);
        T t2 = cache[hash];
        if (t2 != null && t.equals(t2))
            return t2;
        cache[hash] = t;
        return t;
    }
}

An interest property of this cache is it doesn't matter that its not thread safe. 这个缓存的兴趣属性是它不是线程安全的并不重要。

For extra speed you can use a power of 2 size and a bit mask, but its more complicated and may not work very well depending on how your hashCodes are calculated. 对于额外的速度,您可以使用2大小和位掩码的功能,但它更复杂,可能无法很好地工作,具体取决于您的hashCodes的计算方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM