絕對最快的Java HTML轉義功能

Question

基本上，這篇文章是一個挑戰。 今天，我一直在嘗試優化HTML轉義功能，並取得了一定的成功。 但是我知道那里有一些嚴重的Java黑客，他們可能比我做得更好，我很想學習。

我一直在分析Java Web應用程序，發現一個主要的熱點是我們的字符串轉義功能。 當前，我們將Apache Commons Lang用於此任務，調用StringEscapeUtils.escapeHtml（）。 我以為它被廣泛使用會相當快，但即使是我最幼稚的實現也明顯更快。

這是我與Naive實現一起使用的基准代碼。 它測試了不同長度的字符串，其中一些僅包含純文本，而另一些則包含需要轉義的HTML。

public class HTMLEscapeBenchmark {

    public static String escapeHtml(String text) {
        if (text == null) return null;

        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < text.length(); i++) {
            char c = text.charAt(i);

            if (c == '&') {
                sb.append("&amp;");
            } else if (c == '\'') {
                sb.append("&#39;");
            } else if (c == '"') {
                sb.append("&quot;");
            } else if (c == '<') {
                sb.append("&lt;");
            } else if (c == '>') {
                sb.append("&gt;");
            } else {
                sb.append(c);
            }
        }

        return sb.toString();
    }

    /*
    public static String escapeHtml(String text) {
        if (text == null) return null;
        return StringEscapeUtils.escapeHtml(text);
    }
    */


    public static void main(String[] args) {

        final int RUNS = 5;
        final int ITERATIONS = 1000000;


        // Standard lorem ipsum text.
        String loremIpsum = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut " +
            "labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut " +
            "aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum " +
            "dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia " +
            "deserunt mollit anim id est laborum. ";
        while (loremIpsum.length() < 1000) loremIpsum += loremIpsum;

        // Add some characters that need HTML escaping.  Bold every 2 and 3 letter word, quote every 5 letter word.
        String loremIpsumHtml = loremIpsum.replaceAll("[A-Za-z]{2}]", "<b>$0</b>").replaceAll("[A-Za-z]{5}", "\"$0\"");

        System.out.print("\nNormal-10");
        String text = loremIpsum.substring(0, 10);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nNormal-100");
        text = loremIpsum.substring(0, 100);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nNormal-1000");
        text = loremIpsum.substring(0, 1000);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nHtml-10");
        text = loremIpsumHtml.substring(0, 10);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nHtml-100");
        text = loremIpsumHtml.substring(0, 100);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nHtml-1000");
        text = loremIpsumHtml.substring(0, 1000);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }
    }
}

在兩歲的MacBook Pro上，我得到以下結果。

公地郎StringEscapeUtils.escapeHtml

Normal-10     0.439     0.357     0.351     0.343     0.342
Normal-100     2.244     0.934     0.930     0.932     0.931
Normal-1000     8.993     9.020     9.007     9.043     9.052
Html-10     0.270     0.259     0.258     0.258     0.257
Html-100     1.769     1.753     1.765     1.754     1.759
Html-1000     17.313     17.479     17.347     17.266     17.246

天真的實現

Normal-10    0.111    0.091    0.086     0.084     0.088
Normal-100    0.636     0.627     0.626     0.626     0.627
Normal-1000     5.740     5.755     5.721     5.728     5.720
Html-10     0.145     0.138     0.138     0.138     0.138
Html-100     0.899     0.901     0.896     0.901     0.900
Html-1000     8.249     8.288     8.272     8.262     8.284

我將發表自己的最佳嘗試作為答案。 所以，我的問題是，你能做得更好嗎？ 逃脫HTML最快的方法是什么？

Answer 1

這是我對其進行優化的最佳嘗試。 我針對普通文本字符串的常見情況進行了優化，但是對於帶有HTML實體的字符串，我無法做到更好。

    public static String escapeHtml(String value) {
        if (value == null) return null;

        int length = value.length();
        String encoded;

        for (int i = 0; i < length; i++) {
            char c = value.charAt(i);

            if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
                // We found a character to encode, so we need to start from here and buffer the encoded string.
                StringBuilder sb = new StringBuilder((int) (length * 1.25));
                sb.append(value.substring(0, i));
                sb.append(encoded);

                i++;

                for (; i < length; i++) {
                    c = value.charAt(i);

                    if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
                        sb.append(encoded);
                    } else {
                        sb.append(c);
                    }
                }

                value = sb.toString();
                break;
            }
        }

        return value;
    }

    private static String getHtmlEntity(char c) {
        switch (c) {
            case '&': return "&amp;";
            case '\'': return "&#39;";
            case '"': return "&quot;";
            case '<': return "&lt;";
            case '>': return "&gt;";
            default: return null;
        }
    }


Normal-10     0.021     0.023     0.011     0.012     0.011
Normal-100     0.074     0.074     0.073     0.074     0.074
Normal-1000     0.671     0.678     0.675     0.675     0.680
Html-10     0.222     0.152     0.153     0.153     0.154
Html-100     0.739     0.715     0.718     0.724     0.706
Html-1000     6.812     6.828     6.802     6.802     6.806

Answer 2

我以為它被廣泛使用會相當快，但即使是我最幼稚的實現也明顯更快。

如果查看Apache版本的源代碼（例如），您會發現它正在處理許多您的版本忽略的情況：

它對HTML 4中定義的所有實體（以及應用程序添加的其他實體）進行編碼，而不僅僅是硬連接的最小子集
它編碼所有大於或等於0x7F的字符。

簡而言之，它更慢，因為它更通用。

絕對最快的Java HTML轉義功能

問題描述

2 個解決方案

解決方案1
3 2012-10-20 00:26:49

解決方案2
2 2012-10-20 01:03:05

絕對最快的Java HTML轉義功能

問題描述

2 個解決方案

解決方案1 3 2012-10-20 00:26:49

解決方案2 2 2012-10-20 01:03:05

解決方案1
3 2012-10-20 00:26:49

解決方案2
2 2012-10-20 01:03:05