简体繁体 English

Python xml.dom和错误的XML

[英]Python xml.dom and bad XML

原文 2009-07-18 09:24:49 5 4 python/ xml/ dom/ expat-parser

I'm trying to extract some data from various HTML pages using a python program. 我正在尝试使用python程序从各种HTML页面提取一些数据。 Unfortunately, some of these pages contain user-entered data which occasionally has "slight" errors - namely tag mismatching. 不幸的是，其中一些页面包含用户输入的数据，这些数据有时会出现“轻微”错误-即标签不匹配。

Is there a good way to have python's xml.dom try to correct errors or something of the sort? 有没有一种好的方法让python的xml.dom尝试更正错误或类似的东西？ Alternatively, is there a better way to extract data from HTML pages which may contain errors? 或者，是否有更好的方法从可能包含错误的HTML页面提取数据？

4 个解决方案

You could use HTML Tidy to clean up, or Beautiful Soup to parse. 您可以使用HTML Tidy进行清理，或使用Beautiful Soup进行解析。 Could be that you have to save the result to a temp file, but it should work. 可能是您必须将结果保存到临时文件中，但是应该可以。

Cheers, 干杯，

I used to use BeautifulSoup for such tasks but now I have shifted to HTML5lib ( http://code.google.com/p/html5lib/ ) which works well in many cases where BeautifulSoup fails 我曾经将BeautifulSoup用于此类任务，但现在我已改用HTML5lib （ http://code.google.com/p/html5lib/ ），它在BeautifulSoup失败的许多情况下都能正常工作

other alternative is to use " Element Soup " ( http://effbot.org/zone/element-soup.htm ) which is a wrapper for Beautiful Soup using ElementTree 另一种选择是使用“ Element Soup ”（ http://effbot.org/zone/element-soup.htm ），它是使用ElementTree制作“美丽汤”的包装

lxml does a decent job at parsing invalid HTML. lxml在解析无效的HTML方面做得不错。

According to their documentation Beautiful Soup and html5lib sometimes perform better depending on the input. 根据他们的文档， Beautiful Soup和html5lib有时根据输入效果更好。 With lxml you can choose which parser to use, and access them via an unified API. 使用lxml，您可以选择要使用的解析器，并通过统一的API访问它们。

If jython is acceptable to you, tagsoup is very good at parsing junk - if it is, I found the jdom libraries far easier to use than other xml alternatives. 如果您可以接受jython，则tagsoup非常擅长解析垃圾-如果可以，我发现jdom库比其他xml替代品更易于使用。

This is a snippet from a demo mockup to do with screen scraping from tfl's journey planner: 这是一个演示模型的摘录，涉及到tfl的旅程计划器中的屏幕抓取：

private Document getRoutePage(HashMap params) throws Exception {
        String uri = "http://journeyplanner.tfl.gov.uk/bcl/XSLT_TRIP_REQUEST2";
        HttpWrapper hw = new HttpWrapper();
        String page = hw.urlEncPost(uri, params);
        SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
        Reader pageReader = new StringReader(page);
        return builder.build(pageReader);
    }