绝对最快的Java HTML转义功能

Question

Basically, this post is a challenge. 基本上，这篇文章是一个挑战。 I've been trying to optimize an HTML escape function today with moderate success. 今天，我一直在尝试优化HTML转义功能，并取得了一定的成功。 But I know there are some serious Java hackers out there who can probably do this way better than me and I'd love to learn. 但是我知道那里有一些严重的Java黑客，他们可能比我做得更好，我很想学习。

I've been profiling my Java web app and have found that a major hotspot was our String escaping function. 我一直在分析Java Web应用程序，发现一个主要的热点是我们的字符串转义功能。 We currently use Apache Commons Lang for this task, calling StringEscapeUtils.escapeHtml(). 当前，我们将Apache Commons Lang用于此任务，调用StringEscapeUtils.escapeHtml（）。 I assumed since it is so widely used it would be reasonably fast, but even my most naive implementation was significantly faster. 我以为它被广泛使用会相当快，但即使是我最幼稚的实现也明显更快。

Here's the benchmark code I used along with my Naive implementation. 这是我与Naive实现一起使用的基准代码。 It tests a strings of varying length, some containing nothing but plain text and some containing HTML that needs escaping. 它测试了不同长度的字符串，其中一些仅包含纯文本，而另一些则包含需要转义的HTML。

public class HTMLEscapeBenchmark {

    public static String escapeHtml(String text) {
        if (text == null) return null;

        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < text.length(); i++) {
            char c = text.charAt(i);

            if (c == '&') {
                sb.append("&amp;");
            } else if (c == '\'') {
                sb.append("&#39;");
            } else if (c == '"') {
                sb.append("&quot;");
            } else if (c == '<') {
                sb.append("&lt;");
            } else if (c == '>') {
                sb.append("&gt;");
            } else {
                sb.append(c);
            }
        }

        return sb.toString();
    }

    /*
    public static String escapeHtml(String text) {
        if (text == null) return null;
        return StringEscapeUtils.escapeHtml(text);
    }
    */


    public static void main(String[] args) {

        final int RUNS = 5;
        final int ITERATIONS = 1000000;


        // Standard lorem ipsum text.
        String loremIpsum = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut " +
            "labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut " +
            "aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum " +
            "dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia " +
            "deserunt mollit anim id est laborum. ";
        while (loremIpsum.length() < 1000) loremIpsum += loremIpsum;

        // Add some characters that need HTML escaping.  Bold every 2 and 3 letter word, quote every 5 letter word.
        String loremIpsumHtml = loremIpsum.replaceAll("[A-Za-z]{2}]", "<b>$0</b>").replaceAll("[A-Za-z]{5}", "\"$0\"");

        System.out.print("\nNormal-10");
        String text = loremIpsum.substring(0, 10);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nNormal-100");
        text = loremIpsum.substring(0, 100);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nNormal-1000");
        text = loremIpsum.substring(0, 1000);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nHtml-10");
        text = loremIpsumHtml.substring(0, 10);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nHtml-100");
        text = loremIpsumHtml.substring(0, 100);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nHtml-1000");
        text = loremIpsumHtml.substring(0, 1000);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }
    }
}

On my two year old MacBook pro, I get the following results. 在两岁的MacBook Pro上，我得到以下结果。

Commons Lang StringEscapeUtils.escapeHtml 公地郎StringEscapeUtils.escapeHtml

Normal-10     0.439     0.357     0.351     0.343     0.342
Normal-100     2.244     0.934     0.930     0.932     0.931
Normal-1000     8.993     9.020     9.007     9.043     9.052
Html-10     0.270     0.259     0.258     0.258     0.257
Html-100     1.769     1.753     1.765     1.754     1.759
Html-1000     17.313     17.479     17.347     17.266     17.246

Naive Implementation 天真的实现

Normal-10    0.111    0.091    0.086     0.084     0.088
Normal-100    0.636     0.627     0.626     0.626     0.627
Normal-1000     5.740     5.755     5.721     5.728     5.720
Html-10     0.145     0.138     0.138     0.138     0.138
Html-100     0.899     0.901     0.896     0.901     0.900
Html-1000     8.249     8.288     8.272     8.262     8.284

I'll post my own best attempt at optimization as an answer. 我将发表自己的最佳尝试作为答案。 So, my question is, can you do better? 所以，我的问题是，你能做得更好吗？ What is the fastest possible method for escaping HTML? 逃脱HTML最快的方法是什么？

Answer 1

Here's my best attempt at optimizing it. 这是我对其进行优化的最佳尝试。 I optimized for what I hope is the common case of plain text strings, but I wasn't able to make it much better for strings with HTML entities. 我针对普通文本字符串的常见情况进行了优化，但是对于带有HTML实体的字符串，我无法做到更好。

    public static String escapeHtml(String value) {
        if (value == null) return null;

        int length = value.length();
        String encoded;

        for (int i = 0; i < length; i++) {
            char c = value.charAt(i);

            if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
                // We found a character to encode, so we need to start from here and buffer the encoded string.
                StringBuilder sb = new StringBuilder((int) (length * 1.25));
                sb.append(value.substring(0, i));
                sb.append(encoded);

                i++;

                for (; i < length; i++) {
                    c = value.charAt(i);

                    if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
                        sb.append(encoded);
                    } else {
                        sb.append(c);
                    }
                }

                value = sb.toString();
                break;
            }
        }

        return value;
    }

    private static String getHtmlEntity(char c) {
        switch (c) {
            case '&': return "&amp;";
            case '\'': return "&#39;";
            case '"': return "&quot;";
            case '<': return "&lt;";
            case '>': return "&gt;";
            default: return null;
        }
    }


Normal-10     0.021     0.023     0.011     0.012     0.011
Normal-100     0.074     0.074     0.073     0.074     0.074
Normal-1000     0.671     0.678     0.675     0.675     0.680
Html-10     0.222     0.152     0.153     0.153     0.154
Html-100     0.739     0.715     0.718     0.724     0.706
Html-1000     6.812     6.828     6.802     6.802     6.806

Answer 2

I assumed since it is so widely used it would be reasonably fast, but even my most naive implementation was significantly faster. 我以为它被广泛使用会相当快，但即使是我最幼稚的实现也明显更快。

If you look at the source code of the Apache version ( for example ) you will see that it is dealing with a number of cases that your version ignores: 如果查看Apache版本的源代码（例如），您会发现它正在处理许多您的版本忽略的情况：

it encodes all of the entities defined in HTML 4 (and others added by the application), not just a hard-wired minimal subset 它对HTML 4中定义的所有实体（以及应用程序添加的其他实体）进行编码，而不仅仅是硬连接的最小子集
it encodes all characters greater or equal to 0x7F. 它编码所有大于或等于0x7F的字符。

In short, it is slower because it is more general. 简而言之，它更慢，因为它更通用。

绝对最快的Java HTML转义功能

问题描述

2 个解决方案

解决方案1
3 2012-10-20 00:26:49

解决方案2
2 2012-10-20 01:03:05

绝对最快的Java HTML转义功能

问题描述

2 个解决方案

解决方案1 3 2012-10-20 00:26:49

解决方案2 2 2012-10-20 01:03:05

解决方案1
3 2012-10-20 00:26:49

解决方案2
2 2012-10-20 01:03:05