简体繁体 English

忽略Python中的XML错误

[英]Ignoring XML errors in Python

原文 2008-12-30 10:48:33 4 3 python/ xml/ minidom

I am using XML minidom (xml.dom.minidom) in Python, but any error in the XML will kill the parser. 我在Python中使用XML minidom（xml.dom.minidom），但XML中的任何错误都将终止解析器。 Is it possible to ignore them, like a browser for example? 是否可以忽略它们，例如浏览器？ I am trying to write a browser in Python, but it just throws an exception if the tags aren't fully compatible. 我试图用Python编写浏览器，但如果标签不完全兼容，它只会引发异常。

3 个解决方案

There is a library called BeautifulSoup , I think it's what you're looking for. 有一个叫做BeautifulSoup的图书馆，我想这就是你要找的东西。 As you're trying to parse a invalid XML, the normal XML parser won't work. 当您尝试解析无效的XML时，普通的XML解析器将无法正常工作。 BeautifulSoup is more fail-tolerant, it can still extract information from invalid XML. BeautifulSoup更容错，它仍然可以从无效的XML中提取信息。

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Beautiful Soup是一个Python HTML / XML解析器，专为快速周转项目而设计，例如屏幕抓取。 Three features make it powerful: 三个功能使其功能强大：

Beautiful Soup won't choke if you give it bad markup. 如果给它不好的标记，美丽的汤不会窒息。 It yields a parse tree that makes approximately as much sense as your original document. 它产生一个解析树，使其与原始文档几乎一样有意义。 This is usually good enough to collect the data you need and run away. 这通常足以收集您需要的数据并逃跑。

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. Beautiful Soup提供了一些简单的方法和Pythonic习语，用于导航，搜索和修改解析树：用于剖析文档和提取所需内容的工具包。 You don't have to create a custom parser for each application. 您不必为每个应用程序创建自定义解析器。

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Beautiful Soup会自动将传入的文档转换为Unicode，将传出的文档转换为UTF-8。 You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. 您不必考虑编码，除非文档没有指定编码并且Beautiful Soup不能自动检测编码。 Then you just have to specify the original encoding. 然后你只需要指定原始编码。

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. 美丽的汤解析你给它的任何东西，并为你做树遍历的东西。 You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text." 您可以告诉它“查找所有链接”，或“查找类externalLink的所有链接”，或“查找其网址匹配的所有链接”foo.com“，或”查找具有粗体文本的表格标题，然后给出我那个文字。“

It should be noted that while HTML looks like XML it is not XML. 应该注意的是，虽然HTML看起来像XML，但它不是XML。 XHTML is an XML form of HTML. XHTML是HTML的XML形式。

例如，请参阅extract-text-from-html-file-using-python，以获取有关在Python中解析HTML的方法的建议。