How do I parse HTML-like with errors?

Question

I have data that looks like it is part of an HTML document. However there are some bugs in it like

<td class= foo"bar">

on which all the parsers I tried (lxml, xml.etree) fail with an error.

Since I don't actually care about this specific part of the document I am looking for a more robust parser.

Something where I can allow errors in specific subtrees to be ignored and maybe just not insert the nodes or something that will only lazily parse the parts of the tree I am traversing for example.

Answer 1

You are using XML parsers. XML is a strict language, while the HTML standard requires parsers to be tolerant of errors.

Use a compliant HTML parser like lxml.html , or html5lib , or the wrapper library BeautifulSoup (which uses either of the previous with a cleaner API). html5lib is slower but closely mimics how a modern browser would treat errors.

Answer 2

Use lxml:

Create a HTML parser with the recover set to True:

parser = etree.HTMLParser(recover=True)
tree   = etree.parse(StringIO(broken_html), parser)

See the tutorial Parsing XML and HTML with lxml .

How do I parse HTML-like with errors?

Question

2 answers

solution1
1 ACCPTED 2016-11-06 13:36:13

solution2
1 2016-11-06 13:39:32

How do I parse HTML-like with errors?

Question

2 answers

solution1 1 ACCPTED 2016-11-06 13:36:13

solution2 1 2016-11-06 13:39:32

solution1
1 ACCPTED 2016-11-06 13:36:13

solution2
1 2016-11-06 13:39:32