简体   繁体   中英

How do I parse HTML-like with errors?

I have data that looks like it is part of an HTML document. However there are some bugs in it like

<td class= foo"bar">

on which all the parsers I tried (lxml, xml.etree) fail with an error.

Since I don't actually care about this specific part of the document I am looking for a more robust parser.

Something where I can allow errors in specific subtrees to be ignored and maybe just not insert the nodes or something that will only lazily parse the parts of the tree I am traversing for example.

You are using XML parsers. XML is a strict language, while the HTML standard requires parsers to be tolerant of errors.

Use a compliant HTML parser like lxml.html , or html5lib , or the wrapper library BeautifulSoup (which uses either of the previous with a cleaner API). html5lib is slower but closely mimics how a modern browser would treat errors.

Use lxml:

Create a HTML parser with the recover set to True:

parser = etree.HTMLParser(recover=True)
tree   = etree.parse(StringIO(broken_html), parser)

See the tutorial Parsing XML and HTML with lxml .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM