Replace HTML escape sequence with its single character equivalent in C

Question

My program is loading some news article from the web. I then have an array of html documents representing these articles. I need to parse them and show on the screen only the relevant content. That includes converting all html escape sequences into readable symbols. So I need some function which is similar to unEscape in JavaScript.

I know there are libraries in C to parse html. But is there some easy way to convert html escape sequences like & or ! to just & and ! ?

Answer 1

Just wrote and tested a version that does this (crudely). Didn't take long.

You'll want something like this:

typedef struct  {
    int gotLen; // save myriad calls to strlen()
    char *got;
    char *want;
} trx_t;

trx_t lut[][2] = {
    { 5, "&amp;", "&" },
    { 5, "&#33;", "!" },
    { 8, "&dagger;", "*" },
};
const int nLut = sizeof lut/sizeof lut[0];

And then a loop with two pointers that copies characters within the same buf, sniffing for the '&' that triggers a search of the replacement table. If found, copy the replacement string to the destination and advance the source pointer to skip past the HTML token. If not found, then the LUT may need additional tokens.

Here's a beginning...

void replace( char *buf ) {
    char *pd = buf, *ps = buf;
    while( *ps )
        if( *ps != '&' )
            *pd++ = *ps++;
        else {
            // EDIT: Credit @Craig Estey
            if( ps[1] == '#' ) {
                if( ps[2] == 'x' || ps[2] == 'X' ) {
                     /* decode hex value and save as char(s) */
                } else {
                     /* decode decimal value and save as char(s) */
                }
                 /* advance pointers and continue */
            }
            for( int i = 0; i < nLut; i++ )
                /* not giving it all away */
                /* handle "found" and "not found" in LUT *
        }
    *pd = '\0';
}

This was the test program

int main() {
    char str[] = "The fox &amp; hound&dagger; went for a walk&#33; & chat.";

    puts( str );
    replace( str );
    puts( str );

    return 0;
}

and this was the output

The fox &amp; hound&dagger; went for a walk&#33; & chat.
The fox & hound* went for a walk! & chat.

The "project" is to write the interesting bit of the code. It's not difficult.

Caveat: Only works when substitution length is shorter or equal to target length. Otherwise need two buffers.

Answer 2

This is something that you typically wouldn't use C for. I would have used Python. Here are two questions that could be a good start:

What's the easiest way to escape HTML in Python?

How do you call Python code from C code?

But apart from that, the solution is to write a proper parser. There are lots of resources out there on that topic, but basically you could do something like this:

parseFile()
    while not EOF
        ch = readNextCharacter()
        if ch == '\'
            readNextCharacter()
        elseif ch == '&'
            readEscapeSequence()
        else
            output += ch

readEscapeSequence()
    seq = ""
    ch = readNextCharacter();
    while ch != ';'
        seq += ch
        ch = readNextCharacter();
    replace = lookupEscape(seq)
    output += replace

Note that this is only pseudo code to get you started

Replace HTML escape sequence with its single character equivalent in C

Question

2 answers

solution1
0 2022-09-09 22:28:40

solution2
0 2022-09-10 08:12:08

Replace HTML escape sequence with its single character equivalent in C

Question

2 answers

solution1 0 2022-09-09 22:28:40

solution2 0 2022-09-10 08:12:08

solution1
0 2022-09-09 22:28:40

solution2
0 2022-09-10 08:12:08