I have data that looks like it is part of an HTML document. However there are some bugs in it like
<td class= foo"bar">
on which all the parsers I tried (lxml, xml.etree) fail with an error.
Since I don't actually care about this specific part of the document I am looking for a more robust parser.
Something where I can allow errors in specific subtrees to be ignored and maybe just not insert the nodes or something that will only lazily parse the parts of the tree I am traversing for example.
You are using XML parsers. XML is a strict language, while the HTML standard requires parsers to be tolerant of errors.
Use a compliant HTML parser like lxml.html
, or html5lib
, or the wrapper library BeautifulSoup (which uses either of the previous with a cleaner API). html5lib
is slower but closely mimics how a modern browser would treat errors.
Use lxml:
Create a HTML parser with the recover
set to True:
parser = etree.HTMLParser(recover=True)
tree = etree.parse(StringIO(broken_html), parser)
See the tutorial Parsing XML and HTML with lxml .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.