Stripping out malformed HTML from string

Question

Sometimes from a 3rd party API I get malformed HTML elements returned:

olor:red">Text</span>

when I expect:

<span style="color:red">Text</span>

For my context, the text content of the HTML is more important so it does not matter if I lose surrounding tags/formatting.

What would be the best way to strip out the malformed tags such that the first example would read

Text

and the second would not change?

Answer 1

I recommend you to take a look at the HtmlAgilityPack , which is a very handy tool also for HTML sanitization.

Here's an approach example by using the aforementioned library:

static void Main()
{
    var inputs = new[] { 
    @"olor:red"">Text</span>",
    @"<span style=""color:red"">Text</span>",
    @"Text</span>",
    @"<span style=""color:red"">Text",
    @"<span style=""color:red"">Text"
    };
    var doc = new HtmlDocument();
    inputs.ToList().ForEach(i => {
        if (!i.StartsWith("<"))
        {
            if (i.IndexOf(">") != i.Length-1)
                i = "<" + i;
            else
                i = i.Substring(0, i.IndexOf("<"));
            doc.LoadHtml(i);
            Console.WriteLine(doc.DocumentNode.InnerText);
        }
        else
        {
            doc.LoadHtml(i);
            Console.WriteLine(doc.DocumentNode.OuterHtml);
        }
    });
}

Outputs:

Text
<span style="color:red">Text</span>
Text
<span style="color:red">Text</span>
<span style="color:red">Text</span>

Answer 2

Very crudely, you could strip out all 'tags' by stripping everything before a > and keeping everything before a < .

I'm assuming you also need to consider the situation where the text your receive is without tags: eg Text .

In pseudo-code:

returnText = ""

loop:
    gtI = text.IndexOf(">")
    ltI = text.IndexOf("<")
    if -1==gtI and -1==ltI:
        returnText += text
        we're done
    if gtI==-1:
        returnText += text up to position ltI
        return returnText
    if ltI==-1:
        returnText += text after gtI
        return returnText
    if ltI < gtI:
        returnText += textBefore ltI
        text = text after ltI
        loop
    // gtI < ltI:
    text = text after gtI
    loop

It's crude and can be done much better (and faster) with a custom coded parser, but essentially the logic would be the same.

You should really be asking why the API returns only part of what you require: I can't see why it should be returning ext</span> either, which really messes you up.

Answer 3

If you just need the content of the tags, and no information of what type of tag etc, you could use Regular Expressions:

var r = new Regex(">([^>]+)<");
var text = "olor:red\">Text</span>";

var m = r.Match(text);

This will find every inner text of each tag.

Stripping out malformed HTML from string

Question

3 answers

solution1
1 ACCPTED 2013-11-26 16:17:51

solution2
0 2013-11-26 15:44:43

solution3
0 2013-11-26 15:47:51

Stripping out malformed HTML from string

Question

3 answers

solution1 1 ACCPTED 2013-11-26 16:17:51

solution2 0 2013-11-26 15:44:43

solution3 0 2013-11-26 15:47:51

solution1
1 ACCPTED 2013-11-26 16:17:51

solution2
0 2013-11-26 15:44:43

solution3
0 2013-11-26 15:47:51