简体   繁体   中英

Absolute fastest possible Java HTML escape function

Basically, this post is a challenge. I've been trying to optimize an HTML escape function today with moderate success. But I know there are some serious Java hackers out there who can probably do this way better than me and I'd love to learn.

I've been profiling my Java web app and have found that a major hotspot was our String escaping function. We currently use Apache Commons Lang for this task, calling StringEscapeUtils.escapeHtml(). I assumed since it is so widely used it would be reasonably fast, but even my most naive implementation was significantly faster.

Here's the benchmark code I used along with my Naive implementation. It tests a strings of varying length, some containing nothing but plain text and some containing HTML that needs escaping.

public class HTMLEscapeBenchmark {

    public static String escapeHtml(String text) {
        if (text == null) return null;

        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < text.length(); i++) {
            char c = text.charAt(i);

            if (c == '&') {
                sb.append("&amp;");
            } else if (c == '\'') {
                sb.append("&#39;");
            } else if (c == '"') {
                sb.append("&quot;");
            } else if (c == '<') {
                sb.append("&lt;");
            } else if (c == '>') {
                sb.append("&gt;");
            } else {
                sb.append(c);
            }
        }

        return sb.toString();
    }

    /*
    public static String escapeHtml(String text) {
        if (text == null) return null;
        return StringEscapeUtils.escapeHtml(text);
    }
    */


    public static void main(String[] args) {

        final int RUNS = 5;
        final int ITERATIONS = 1000000;


        // Standard lorem ipsum text.
        String loremIpsum = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut " +
            "labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut " +
            "aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum " +
            "dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia " +
            "deserunt mollit anim id est laborum. ";
        while (loremIpsum.length() < 1000) loremIpsum += loremIpsum;

        // Add some characters that need HTML escaping.  Bold every 2 and 3 letter word, quote every 5 letter word.
        String loremIpsumHtml = loremIpsum.replaceAll("[A-Za-z]{2}]", "<b>$0</b>").replaceAll("[A-Za-z]{5}", "\"$0\"");

        System.out.print("\nNormal-10");
        String text = loremIpsum.substring(0, 10);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nNormal-100");
        text = loremIpsum.substring(0, 100);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nNormal-1000");
        text = loremIpsum.substring(0, 1000);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nHtml-10");
        text = loremIpsumHtml.substring(0, 10);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nHtml-100");
        text = loremIpsumHtml.substring(0, 100);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }

        System.out.print("\nHtml-1000");
        text = loremIpsumHtml.substring(0, 1000);
        for (int run = 1; run <= RUNS; run++) {
            long start = System.nanoTime();
            for (int i = 0; i < ITERATIONS; i++) {
                escapeHtml(text);
            }
            System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
        }
    }
}

On my two year old MacBook pro, I get the following results.

Commons Lang StringEscapeUtils.escapeHtml

Normal-10     0.439     0.357     0.351     0.343     0.342
Normal-100     2.244     0.934     0.930     0.932     0.931
Normal-1000     8.993     9.020     9.007     9.043     9.052
Html-10     0.270     0.259     0.258     0.258     0.257
Html-100     1.769     1.753     1.765     1.754     1.759
Html-1000     17.313     17.479     17.347     17.266     17.246

Naive Implementation

Normal-10    0.111    0.091    0.086     0.084     0.088
Normal-100    0.636     0.627     0.626     0.626     0.627
Normal-1000     5.740     5.755     5.721     5.728     5.720
Html-10     0.145     0.138     0.138     0.138     0.138
Html-100     0.899     0.901     0.896     0.901     0.900
Html-1000     8.249     8.288     8.272     8.262     8.284

I'll post my own best attempt at optimization as an answer. So, my question is, can you do better? What is the fastest possible method for escaping HTML?

Here's my best attempt at optimizing it. I optimized for what I hope is the common case of plain text strings, but I wasn't able to make it much better for strings with HTML entities.

    public static String escapeHtml(String value) {
        if (value == null) return null;

        int length = value.length();
        String encoded;

        for (int i = 0; i < length; i++) {
            char c = value.charAt(i);

            if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
                // We found a character to encode, so we need to start from here and buffer the encoded string.
                StringBuilder sb = new StringBuilder((int) (length * 1.25));
                sb.append(value.substring(0, i));
                sb.append(encoded);

                i++;

                for (; i < length; i++) {
                    c = value.charAt(i);

                    if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
                        sb.append(encoded);
                    } else {
                        sb.append(c);
                    }
                }

                value = sb.toString();
                break;
            }
        }

        return value;
    }

    private static String getHtmlEntity(char c) {
        switch (c) {
            case '&': return "&amp;";
            case '\'': return "&#39;";
            case '"': return "&quot;";
            case '<': return "&lt;";
            case '>': return "&gt;";
            default: return null;
        }
    }


Normal-10     0.021     0.023     0.011     0.012     0.011
Normal-100     0.074     0.074     0.073     0.074     0.074
Normal-1000     0.671     0.678     0.675     0.675     0.680
Html-10     0.222     0.152     0.153     0.153     0.154
Html-100     0.739     0.715     0.718     0.724     0.706
Html-1000     6.812     6.828     6.802     6.802     6.806

I assumed since it is so widely used it would be reasonably fast, but even my most naive implementation was significantly faster.

If you look at the source code of the Apache version ( for example ) you will see that it is dealing with a number of cases that your version ignores:

  • it encodes all of the entities defined in HTML 4 (and others added by the application), not just a hard-wired minimal subset
  • it encodes all characters greater or equal to 0x7F.

In short, it is slower because it is more general.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM