Handling of errors while parsing HTML

Question

For various reasons that are beyond the scope of this question, I am using an adhoc html parsing class written in python. This simple class has been so far sufficient for the kind of input it was fed but it recently tried to parse http://forum.macbidouille.com/index.php?showtopic=160607

This webpage is obviously automatically generated by some php code but it contains user-generated html which are included verbatim as a signature for each post. Most notably, http://forum.macbidouille.com/index.php?showtopic=160607#entry1563022 contains the following HTML (comments removed and tags indented for clarity):

<div class="signature">
  <span style="font-family:Verdana">
    <span style="color:#8B0000">
      <span style="font-size:12pt;line-height:100%">
        <div align='center'>La Culture coûte cher, mais l&#39;inculture coûte encore plus cher à la Société. <br />
          <span style="font-size:8pt;line-height:100%"><i>Marcel Landowsky</i></span>
      </span><br />
        </div>
      </span>
    </span>
  <div align='left'><br />macbook unibody 10.6.8 - 2.26ghz - 4Go- 250Go - <br />Je n&#39;ai pas de télévision &#33;</div>
</div>

As should be obvious from the above, there is a stray tag that is closed too early. ie, we have invalid HTML here. Nothing extraordinary but this is sufficient to make my parsing code fail. Specifically, so far, that parsing code has a very simple error handling strategy: it merely tries to match each closing tag with the currently opened tag and if the closing tag does not match, it is ignored.

In the case of the above code, this results in ignoring on line 7 because it does not match the currently open tag from line 5 and then ignoring on the last line because it does not match the currently open tag on line 2. The result is that all the html that follows this block is assumed to be hierarchicaly included within the first tag which leads to other problems later.

What I would like to achieve is to 'synchronize' the parsing state better and I wonder what kind of simple approach would lead to a parser that can handle this block of html. I can see how I could try to minimize the number of closing tags thrown away once I have completed the parsing by re-arranging the generated tree but I am looking for a simpler solution.

I know that the first answer will be: "use library X" and this is likely what I am going to end up doing but I am actually curious as to what kind of interesting parsing and error handling strategies could be used in this case. ie, I am trying to get educated :)

thanks!

Answer 1

Your best bet is to try to parse (and fix) the user-supplied HTML first, otherwise you may end up with all kinds of the original DOM structure corruptions. First off, I guess, you should check user HTML for the tag nesting and sanitize it (ie the </span> has no corresponding start tag, so it should be removed). If you have an HTML-only parser, enclose the user HTML in <div>..</div> before parsing - this should do the trick.

Handling of errors while parsing HTML

Question

1 answers

solution1
0 2012-02-03 15:16:06

Handling of errors while parsing HTML

Question

1 answers

solution1 0 2012-02-03 15:16:06

solution1
0 2012-02-03 15:16:06