简体   繁体   中英

Most efficient way to escape XML/HTML in C++ string?

I can't believe this question hasn't been asked before. I have a string that needs to be inserted into an HTML file but it may contain special HTML characters. I want to replace these with the appropriate HTML representation.

The code below works but is pretty verbose and ugly. Performance is not critical for my application but I guess there are scalability problems here also. How can I improve this? I guess this is a job for STL algorithms or some esoteric Boost function, but the code below is the best I can come up with myself.

void escape(std::string *data)
{
    std::string::size_type pos = 0;
    for (;;)
    {
        pos = data->find_first_of("\"&<>", pos);
        if (pos == std::string::npos) break;
        std::string replacement;
        switch ((*data)[pos])
        {
        case '\"': replacement = "&quot;"; break;   
        case '&':  replacement = "&amp;";  break;   
        case '<':  replacement = "&lt;";   break;   
        case '>':  replacement = "&gt;";   break;   
        default: ;
        }
        data->replace(pos, 1, replacement);
        pos += replacement.size();
    };
}

Instead of just replacing in the original string, you can do copying with on-the-fly replacement which avoids having to move characters in the string. This will have much better complexity and cache behavior, so I'd expect a huge improvement. Or you can use boost::spirit::xml encode or http://code.google.com/p/pugixml/ .

void encode(std::string& data) {
    std::string buffer;
    buffer.reserve(data.size());
    for(size_t pos = 0; pos != data.size(); ++pos) {
        switch(data[pos]) {
            case '&':  buffer.append("&amp;");       break;
            case '\"': buffer.append("&quot;");      break;
            case '\'': buffer.append("&apos;");      break;
            case '<':  buffer.append("&lt;");        break;
            case '>':  buffer.append("&gt;");        break;
            default:   buffer.append(&data[pos], 1); break;
        }
    }
    data.swap(buffer);
}

EDIT: A small improvement can be achieved by using an heuristic to determine the size of the buffer. Replace the buffer.reserve line with data.size()*1.1 (10%) or something similar depending of how much replacements are expected.

void escape(std::string *data)
{
    using boost::algorithm::replace_all;
    replace_all(*data, "&",  "&amp;");
    replace_all(*data, "\"", "&quot;");
    replace_all(*data, "\'", "&apos;");
    replace_all(*data, "<",  "&lt;");
    replace_all(*data, ">",  "&gt;");
}

Could win the prize for least verbose?

Here is a simple ~30 line C program that does the trick in a rather good manner. Here I am assuming that the temp_str will have allocated memory enough to have the additional escaped characters.

void toExpatEscape(char *temp_str)
{
    const char cEscapeChars[6]={'&','\'','\"','>','<','\0'};
    const char * const pEscapedSeqTable[] =
    {
        "&amp;",
        "&apos;",
        "&quot;",
        "&gt;",
        "&lt;",
    };
    unsigned int i, j, k, nRef = 0, nEscapeCharsLen = strlen(cEscapeChars), str_len = strlen(temp_str);
    int nShifts = 0; 

    for (i=0; i<str_len; i++)
    {
        for(nRef=0; nRef<nEscapeCharsLen; nRef++)
        {
            if(temp_str[i] == cEscapeChars[nRef])
            {
                if((nShifts = strlen(pEscapedSeqTable[nRef]) - 1) > 0)
                {
                    memmove(temp_str+i+nShifts, temp_str+i, str_len-i+nShifts); 
                    for(j=i,k=0; j<=i+nShifts,k<=nShifts; j++,k++)
                        temp_str[j] = pEscapedSeqTable[nRef][k];
                    str_len += nShifts;
                }
            }
        }  
    }
    temp_str[str_len] = '\0';
}

My tests showed this answer gave the best performance from offered (not surprising it has the most rate).
I've implemented same algorithm for my project (I really want good performance & memory usage) - my tests showed my implementation has ~2.6-3.25 better speed performace. Also I don't like previous best offered algorithm bcs of bad memory usage - you will have extra memory usage as when apply 1.1 multiplier 'heuristic', as when .append() lead to resize.
So, leave my code here - maybe somebody find it useful.

HtmlPreprocess.h:

#ifndef _HTML_PREPROCESS_H_
#define _HTML_PREPROCESS_H_

#include <string>

class HtmlPreprocess
{
public:
    HtmlPreprocess();
    ~HtmlPreprocess();

    static void htmlspecialchars(
        const std::string & in,
        std::string & out
        );
};

#endif // _HTML_PREPROCESS_H_

HtmlPreprocess.cpp:

#include "HtmlPreprocess.h"


HtmlPreprocess::HtmlPreprocess()
{
}


HtmlPreprocess::~HtmlPreprocess()
{
}


const unsigned char map_char_to_final_size[] = 
{
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   6,   1,   1,   1,   5,   6,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   4,   1,   4,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1
};


const unsigned char map_char_to_index[] = 
{
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   2,      0xFF,   0xFF,   0xFF,   0,      1,      0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   4,      0xFF,   3,      0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,
   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF,   0xFF
};


void HtmlPreprocess::htmlspecialchars(
    const std::string & in,
    std::string & out
    )
{
    const char * lp_in_stored = &in[0];
    size_t in_size = in.size();

    const char * lp_in = lp_in_stored;
    size_t final_size = 0;
    for (size_t i = 0; i < in_size; i++)
        final_size += map_char_to_final_size[*lp_in++];

    out.resize(final_size);

    lp_in = lp_in_stored;
    char * lp_out = &out[0];

    for (size_t i = 0; i < in_size; i++)
    {
        char current_char = *lp_in++;
        unsigned char next_action = map_char_to_index[current_char];

        switch (next_action){
        case 0:
            *lp_out++ = '&';
            *lp_out++ = 'a';
            *lp_out++ = 'm';
            *lp_out++ = 'p';
            *lp_out++ = ';';
            break;
        case 1:
            *lp_out++ = '&';
            *lp_out++ = 'a';
            *lp_out++ = 'p';
            *lp_out++ = 'o';
            *lp_out++ = 's';
            *lp_out++ = ';';
            break;
        case 2:
            *lp_out++ = '&';
            *lp_out++ = 'q';
            *lp_out++ = 'u';
            *lp_out++ = 'o';
            *lp_out++ = 't';
            *lp_out++ = ';';
            break;
        case 3:
            *lp_out++ = '&';
            *lp_out++ = 'g';
            *lp_out++ = 't';
            *lp_out++ = ';';
            break;
        case 4:
            *lp_out++ = '&';
            *lp_out++ = 'l';
            *lp_out++ = 't';
            *lp_out++ = ';';
            break;
        default:
            *lp_out++ = current_char;
        }
    }
}

If you're going for processing speed, then it seems to me that the best would be to have a second string that you build as you go, copying from the first string to the second string, and then appending the html escapes as you encounter them. Since I assume that the replace method involves first a memory move, followed by a copy into the replaced position, it's going to be very slow for large strings. If you have a second string to build using .append(), it will avoid the memory move.

As far was code "cleanness", I think that's about as pretty as you're going to get. You could create an array of characters and their replacements, and then search the array, but that would probably be slower and not much cleaner anyway.

I'd honestly go with a more generic version using iterators, such that you can "stream" the encoding. Consider the following implementation:

#include <algorithm>

namespace xml {

    // Helper for null-terminated ASCII strings (no end of string iterator).
    template<typename InIter, typename OutIter>
    OutIter copy_asciiz ( InIter begin, OutIter out )
    {
        while ( *begin != '\0' ) {
            *out++ = *begin++;
        }
        return (out);
    }

    // XML escaping in it's general form.  Note that 'out' is expected
    // to an "infinite" sequence.
    template<typename InIter, typename OutIter>
    OutIter escape ( InIter begin, InIter end, OutIter out )
    {
        static const char bad[] = "&<>";
        static const char* rep[] = {"&amp;", "&lt;", "&gt;"};
        static const std::size_t n = sizeof(bad)/sizeof(bad[0]);

        for ( ; (begin != end); ++begin )
        {
            // Find which replacement to use.
            const std::size_t i =
                std::distance(bad, std::find(bad, bad+n, *begin));

            // No need for escaping.
            if ( i == n ) {
                *out++ = *begin;
            }
            // Escape the character.
            else {
                out = copy_asciiz(rep[i], out);
            }
        }
        return (out);
    }

}

Then, you can simplify the average case using a few overloads:

#include <iterator>
#include <string>

namespace xml {

    // Get escaped version of "content".
    std::string escape ( const std::string& content )
    {
        std::string result;
        result.reserve(content.size());
        escape(content.begin(), content.end(), std::back_inserter(result));
        return (result);
    }

    // Escape data on the fly, using "constant" memory.
    void escape ( std::istream& in, std::ostream& out )
    {
        escape(std::istreambuf_iterator<char>(in),
            std::istreambuf_iterator<char>(),
            std::ostreambuf_iterator<char>(out));
    }

}

Finally, test the whole lot:

#include <iostream>

int main ( int, char ** )
{
    std::cout << xml::escape("<foo>bar & qux</foo>") << std::endl;
}

I profiled 3 solutions with Visual Studio 2017. Input were 10 000 000 strings of size 5-20 with a probability of 9,4% that a char needs to be escaped.

  1. Solution from Giovanni Funchal
  2. Solution from HostageBrain
  3. Solution is mine

The result:

  1. needs 1.675 seconds
  2. needs 0.769 seconds
  3. needs 0.368 seconds

In mine Solution, the final size is precalculated and a copy of string data is done, only when needed. So the heap memory allocations should be minimal.

const unsigned char calcFinalSize[] =
{
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   6,   1,   1,   1,   5,   6,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   4,   1,   4,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1
};

void escapeXml(std::string & in)
{
    const char* dataIn = in.data();
    size_t sizeIn = in.size();

    const char* dataInCurrent = dataIn;
    const char* dataInEnd = dataIn + sizeIn;
    size_t outSize = 0;
    while (dataInCurrent < dataInEnd)
    {
        outSize += calcFinalSize[static_cast<uint8_t>(*dataInCurrent)];
        dataInCurrent++;
    }


    if (outSize == sizeIn)
    {
        return;
    }
    std::string out;
    out.resize(outSize);

    dataInCurrent = dataIn;
    char* dataOut = &out[0];
    while (dataInCurrent < dataInEnd)
    {
        switch (*dataInCurrent) {
        case '&':
            memcpy(dataOut, "&amp;", sizeof("&amp;") - 1);
            dataOut += sizeof("&amp;") - 1;
            break;
        case '\'':
            memcpy(dataOut, "&apos;", sizeof("&apos;") - 1);
            dataOut += sizeof("&apos;") - 1;
            break;
        case '\"':
            memcpy(dataOut, "&quot;", sizeof("&quot;") - 1);
            dataOut += sizeof("&quot;") - 1;
            break;
        case '>':
            memcpy(dataOut, "&gt;", sizeof("&gt;") - 1);
            dataOut += sizeof("&gt;") - 1;
            break;
        case '<':
            memcpy(dataOut, "&lt;", sizeof("&lt;") - 1);
            dataOut += sizeof("&lt;") - 1;
            break;
        default:
            *dataOut++ = *dataInCurrent;
        }
        dataInCurrent++;
    }
    in.swap(out);
}

Edit: Replaced "&quote;" with "&quot;" . Old solution was overwriting memory, because the look-up table contained a length of 6 for "&quote;" .

You can use the boost::property_tree::xml_parser::encode_char_entities if you don't want to write it yourself.

For reference, here's the code in boost 1.64.0 :

```

template<class Str>
Str encode_char_entities(const Str &s)
{
    // Don't do anything for empty strings.
    if(s.empty()) return s;

    typedef typename Str::value_type Ch;

    Str r;
    // To properly round-trip spaces and not uglify the XML beyond
    // recognition, we have to encode them IF the text contains only spaces.
    Str sp(1, Ch(' '));
    if(s.find_first_not_of(sp) == Str::npos) {
        // The first will suffice.
        r = detail::widen<Str>("&#32;");
        r += Str(s.size() - 1, Ch(' '));
    } else {
        typename Str::const_iterator end = s.end();
        for (typename Str::const_iterator it = s.begin(); it != end; ++it)
        {
            switch (*it)
            {
                case Ch('<'): r += detail::widen<Str>("&lt;"); break;
                case Ch('>'): r += detail::widen<Str>("&gt;"); break;
                case Ch('&'): r += detail::widen<Str>("&amp;"); break;
                case Ch('"'): r += detail::widen<Str>("&quot;"); break;
                case Ch('\''): r += detail::widen<Str>("&apos;"); break;
                default: r += *it; break;
            }
        }
    }
    return r;
}

```

Or with just stl :

 std::string& rep(std::string &s, std::string from, std::string to)
    {
      int pos = -1;
      while ( (pos = s.find(from, pos+1) ) != string::npos)
        s.erase(pos, from.length()).insert(pos, to);

      return s;
    }

Usage:

rep(s, "&", "&quot;");
rep(s, "\"", "&quot;");

or:

rep(s, "HTML","xxxx");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM