简体   繁体   English

使用 lxml 和 iterparse() 来解析一个大 (+- 1Gb) XML 文件

[英]using lxml and iterparse() to parse a big (+- 1Gb) XML file

I have to parse a 1Gb XML file with a structure such as below and extract the text within the tags "Author" and "Content":我必须解析具有如下结构的 1Gb XML 文件,并提取标签“作者”和“内容”中的文本:

<Database>
    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>

    [...]

    <BlogPost>
        <Date>MM/DD/YY</Date>
        <Author>Last Name, Name</Author>
        <Content>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.</Content>
    </BlogPost>
</Database>

So far I've tried two things: i) reading the whole file and going through it with .find(xmltag) and ii) parsing the xml file with lxml and iterparse().到目前为止,我已经尝试了两件事:i) 读取整个文件并使用 .find(xmltag) 和 ii) 使用 lxml 和 iterparse() 解析 xml 文件。 The first option I've got it to work, but it is very slow.我已经让它工作的第一个选项,但它很慢。 The second option I haven't managed to get it off the ground.第二个选择我还没有设法让它起步。

Here's part of what I have:这是我所拥有的一部分:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
    if element.tag == "BlogPost":
        print element.text
    else:
        print 'Finished'

The result of that is only blank spaces, with no text in them.结果只是空格,其中没有文本。

I must be doing something wrong, but I can't grasp it.我一定是做错了什么,但我无法理解。 Also, In case it wasn't obvious enough, I am quite new to python and it is the first time I'm using lxml.另外,如果它不够明显,我对 python 很陌生,这是我第一次使用 lxml。 Please, help!请帮忙!

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

the final clear will stop you from using too much memory.最后清除将阻止您使用太多内存。

[update:] to get "everything between ... as a string" i guess you want one of: [更新:] 将“......之间的一切都作为一个字符串”我想你想要一个:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(etree.tostring(element))
  element.clear()

or或者

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([etree.tostring(child) for child in element]))
  element.clear()

or perhaps even:或者甚至:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  print(''.join([child.text for child in element]))
  element.clear()

For future searchers: The top answer here suggests clearing the element on each iteration, but that still leaves you with an ever-increasing set of empty elements that will slowly build up in memory:对于未来的搜索者:这里的最佳答案建议在每次迭代时清除元素,但这仍然会给您留下不断增加的空元素集,这些空元素将在内存中慢慢积累:

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
  for child in element:
    print(child.tag, child.text)
    element.clear()

^ This is not a scalable solution, especially as your source file gets larger and larger. ^ 这不是一个可扩展的解决方案,尤其是当您的源文件越来越大时。 The better solution is to get the root element, and clear that every time you load a complete record.更好的解决方案是获取元素,并每次加载完整记录时清除 This will keep memory usage pretty stable (sub-20MB I would say).这将保持内存使用相当稳定(我会说低于 20MB)。

Here's a solution that doesn't require looking for a specific tag.这是一个不需要查找特定标签的解决方案。 This function will return a generator that yields all 1st child nodes (eg <BlogPost> elements) underneath the root node (eg <Database> ).此函数将返回一个生成器,该生成器生成根节点(例如<Database> )下的所有第一个子节点(例如<BlogPost>元素)。 It does this by recording the start of the first tag after the root node, then waiting for the corresponding end tag, yielding the entire element, and then clearing the root node.它通过记录根节点之后第一个标签的开始,然后等待相应的结束标签,产生整个元素,然后清除根节点来实现这一点。

from lxml import etree

xmlfile = '/path/to/xml/file.xml'

def iterate_xml(xmlfile):
    doc = etree.iterparse(xmlfile, events=('start', 'end'))
    _, root = next(doc)
    start_tag = None
    for event, element in doc:
        if event == 'start' and start_tag is None:
            start_tag = element.tag
        if event == 'end' and element.tag == start_tag:
            yield element
            start_tag = None
            root.clear()

I prefer XPath for such things:对于这样的事情,我更喜欢XPath

In [1]: from lxml.etree import parse

In [2]: tree = parse('/tmp/database.xml')

In [3]: for post in tree.xpath('/Database/BlogPost'):
   ...:     print 'Author:', post.xpath('Author')[0].text
   ...:     print 'Content:', post.xpath('Content')[0].text
   ...: 
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.
Author: Last Name, Name
Content: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas dictum dictum vehicula.

I'm not sure if it's different in terms of processing big files, though.不过,我不确定它在处理大文件方面是否有所不同。 Comments about this would be appreciated.对此的评论将不胜感激。

Doing it your way,按照你的方式去做,

for event, element in etree.iterparse(path_to_file, tag="BlogPost"):
     for info in element.iter():
         if info.tag in ('Author', 'Content'):
             print info.tag, ':', info.text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM