使用Python解析大型xml文件-etree.parse错误

Question

尝试使用lxml.etree.iterparse函数解析以下Python文件。

“ sampleoutput.xml”

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

我尝试使用Python lxml和Iterparse分析大型XML文件中的代码

在etree.iterparse（MYFILE）调用之前，我做了MYFILE = open（“ / Users / eric / Desktop / wikipedia_map / sampleoutput.xml”，“ r”）

但它出现以下错误

Traceback (most recent call last):
  File "/Users/eric/Documents/Programming/Eclipse_Workspace/wikipedia_mapper/testscraper.py", line 6, in <module>
    for event, elem in context :
  File "iterparse.pxi", line 491, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:98565)
  File "iterparse.pxi", line 543, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:99086)
  File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 5, column 1

有任何想法吗？ 谢谢！

Answer 1

问题在于，如果XML恰好没有一个顶级标签，那么XML的格式将不正确。 您可以通过将整个文档包装在<items></items>标签中来修复示例。 您还需要<desc/>标记以匹配您正在使用的查询（ description ）。

以下文档使用您现有的代码会产生正确的结果：

<items>
  <item>
    <title>Item 1</title>
    <description>Description 1</description>
  </item>
  <item>
    <title>Item 2</title>
    <description>Description 2</description>
  </item>
</items>

Answer 2

据我所知，xml.etree.ElementTree通常期望XML文件包含一个“根”元素，即，一个包含完整文档结构的XML标记。 从您发布的错误消息中，我也认为这也是问题所在：

“第5行”指的是第二个<item>标记，因此我想Python抱怨说，在假定的根元素（即第一个<item>标记）关闭之后，还有更多数据。

使用Python解析大型xml文件-etree.parse错误

问题描述

2 个解决方案

解决方案1
11 已采纳 2012-07-09 05:01:29

解决方案2
5 2012-07-09 04:39:49

使用Python解析大型xml文件-etree.parse错误

问题描述

2 个解决方案

解决方案1 11 已采纳 2012-07-09 05:01:29

解决方案2 5 2012-07-09 04:39:49

解决方案1
11 已采纳 2012-07-09 05:01:29

解决方案2
5 2012-07-09 04:39:49