简体   繁体   English

使用Python lxml和Iterparse解析大型XML文件

[英]Parsing Large XML file with Python lxml and Iterparse

I'm attempting to write a parser using lxml and the iterparse method to step through a very large xml file containing many items. 我试图使用lxml和iterparse方法编写一个解析器,以逐步浏览包含许多项目的非常大的xml文件。

My file is of the format: 我的文件格式为:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
  <url>
     <item>http://www.url1.com</item>
  </url>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
  <url>
     <item>http://www.url2.com</item>
  </url>
</item>

and so far my solution is: 到目前为止,我的解决方案是:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )
      elem.clear( )
      while elem.getprevious( ) is not None :
            del elem.getparent( )[0]

del context

When I run it, I get something similar to: 当我运行它时,我得到类似于以下内容:

[]
['description1']
[]
['description2']

The blank sets are because it also pulls out the item tags that are children to the url tag, and they obviously have no description field to extract with xpath. 空集是因为它还将子项的item标记拉出到url标记中,并且显然没有使用xpath提取的描述字段。 My hope was to parse out each of the items 1 by 1 and then process the child fields as required. 我的希望是逐项分析每个项目,然后根据需要处理子字段。 I'm sorta just learning the lxml libarary, so I'm curious if there is a way to pull out the main items while leaving any sub items alone if encountered? 我只是在学习lxml库,所以我很好奇是否有一种方法可以提取主要项目,而如果遇到任何子项目,则不进行任何设置?

The entire xml is parsed anyway by the core implementation. 无论如何,整个XML都是由核心实现解析的。 The etree.iterparse is just a view in generator style, that provides a simple filtering by tag name (see docstring http://lxml.de/api/lxml.etree.iterparse-class.html ). etree.iterparse只是生成器样式的视图,它提供了按标记名称的简单过滤(请参阅docstring http://lxml.de/api/lxml.etree.iterparse-class.html )。 If you want a complex filtering you should do by it's own. 如果您想进行复杂的过滤,则应自己完成。

A solution: registering for start event also: 一个解决方案:还注册启动事件:

iterparse(self, source, events=("start", "end",), tag="item")

and have a bool to know when you are at the "item" end, when you are the "item/url/item" end. 并知道您何时处于“ item”端,何时处于“ item / url / item”端。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM