简体   繁体   English

对大型 XML 文件使用 Python Iterparse

[英]Using Python Iterparse For Large XML Files

I need to write a parser in Python that can process some extremely large files ( > 2 GB ) on a computer without much memory (only 2 GB).我需要用 Python 编写一个解析器,它可以在没有太多内存(只有 2 GB)的计算机上处​​理一些非常大的文件(> 2 GB)。 I wanted to use iterparse in lxml to do it.我想在 lxml 中使用 iterparse 来做到这一点。

My file is of the format:我的文件格式如下:

<item>
  <title>Item 1</title>
  <desc>Description 1</desc>
</item>
<item>
  <title>Item 2</title>
  <desc>Description 2</desc>
</item>

and so far my solution is:到目前为止,我的解决方案是:

from lxml import etree

context = etree.iterparse( MYFILE, tag='item' )

for event, elem in context :
      print elem.xpath( 'description/text( )' )

del context

Unfortunately though, this solution is still eating up a lot of memory.不幸的是,这个解决方案仍然占用了大量内存。 I think the problem is that after dealing with each "ITEM" I need to do something to cleanup empty children.我认为问题是在处理完每个“项目”之后,我需要做一些事情来清理空孩子。 Can anyone offer some suggestions on what I might do after processing my data to properly cleanup?任何人都可以提供一些关于在处理我的数据以正确清理后我可以做什么的建议吗?

Try Liza Daly's fast_iter .试试Liza Daly 的 fast_iter After processing an element, elem , it calls elem.clear() to remove descendants and also removes preceding siblings.处理elem.clear()元素elem ,它调用elem.clear()来移除后代并移除之前的兄弟。

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context


def process_element(elem):
    print elem.xpath( 'description/text( )' )

context = etree.iterparse( MYFILE, tag='item' )
fast_iter(context,process_element)

Daly's article is an excellent read, especially if you are processing large XML files. Daly 的文章非常值得一读,尤其是在处理大型 XML 文件时。


Edit: The fast_iter posted above is a modified version of Daly's fast_iter .编辑: fast_iter发布的fast_iter是 Daly 的fast_iter的修改版本。 After processing an element, it is more aggressive at removing other elements that are no longer needed.处理完一个元素后,它会更积极地删除不再需要的其他元素。

The script below shows the difference in behavior.下面的脚本显示了行为的差异。 Note in particular that orig_fast_iter does not delete the A1 element, while the mod_fast_iter does delete it, thus saving more memory.特别注意orig_fast_iter不会删除A1元素,而mod_fast_iter会删除它,从而节省更多内存。

import lxml.etree as ET
import textwrap
import io

def setup_ABC():
    content = textwrap.dedent('''\
      <root>
        <A1>
          <B1></B1>
          <C>1<D1></D1></C>
          <E1></E1>
        </A1>
        <A2>
          <B2></B2>
          <C>2<D></D></C>
          <E2></E2>
        </A2>
      </root>
        ''')
    return content


def study_fast_iter():
    def orig_fast_iter(context, func, *args, **kwargs):
        for event, elem in context:
            print('Processing {e}'.format(e=ET.tostring(elem)))
            func(elem, *args, **kwargs)
            print('Clearing {e}'.format(e=ET.tostring(elem)))
            elem.clear()
            while elem.getprevious() is not None:
                print('Deleting {p}'.format(
                    p=(elem.getparent()[0]).tag))
                del elem.getparent()[0]
        del context

    def mod_fast_iter(context, func, *args, **kwargs):
        """
        http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
        Author: Liza Daly
        See also http://effbot.org/zone/element-iterparse.htm
        """
        for event, elem in context:
            print('Processing {e}'.format(e=ET.tostring(elem)))
            func(elem, *args, **kwargs)
            # It's safe to call clear() here because no descendants will be
            # accessed
            print('Clearing {e}'.format(e=ET.tostring(elem)))
            elem.clear()
            # Also eliminate now-empty references from the root node to elem
            for ancestor in elem.xpath('ancestor-or-self::*'):
                print('Checking ancestor: {a}'.format(a=ancestor.tag))
                while ancestor.getprevious() is not None:
                    print(
                        'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))
                    del ancestor.getparent()[0]
        del context

    content = setup_ABC()
    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
    orig_fast_iter(context, lambda elem: None)
    # Processing <C>1<D1/></C>
    # Clearing <C>1<D1/></C>
    # Deleting B1
    # Processing <C>2<D/></C>
    # Clearing <C>2<D/></C>
    # Deleting B2

    print('-' * 80)
    """
    The improved fast_iter deletes A1. The original fast_iter does not.
    """
    content = setup_ABC()
    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')
    mod_fast_iter(context, lambda elem: None)
    # Processing <C>1<D1/></C>
    # Clearing <C>1<D1/></C>
    # Checking ancestor: root
    # Checking ancestor: A1
    # Checking ancestor: C
    # Deleting B1
    # Processing <C>2<D/></C>
    # Clearing <C>2<D/></C>
    # Checking ancestor: root
    # Checking ancestor: A2
    # Deleting A1
    # Checking ancestor: C
    # Deleting B2

study_fast_iter()

iterparse() lets you do stuff while building the tree , that means that unless you remove what you don't need anymore, you'll still end up with the whole tree in the end. iterparse()可以让你在构建树的同时做一些事情,这意味着除非你删除不再需要的东西,否则你最终仍然会得到整棵树。

For more information: read this by the author of the original ElementTree implementation (but it's also applicable to lxml)欲了解更多信息:阅读这个由最初的ElementTree实现的作者(但它也适用于LXML)

In my experience, iterparse with or without element.clear (see F. Lundh and L. Daly) cannot always cope with very large XML files: It goes well for some time, suddenly the memory consumption goes through the roof and a memory error occurs or the system crashes.根据我的经验,有或没有element.clear iterparse(参见F. Lundh和 L. Daly)不能总是处理非常大的 XML 文件:它运行了一段时间,突然内存消耗超过屋顶并发生内存错误或者系统崩溃。 If you encounter the same problem, maybe you can use the same solution: the expat parser .如果您遇到同样的问题,也许您可​​以使用相同的解决方案: expat parser See also F. Lundh or the following example using OP's XML snippet (plus two umlaute for checking that there are no encoding issues):另请参阅F. Lundh或以下使用 OP 的 XML 片段的示例(加上两个元音以检查是否存在编码问题):

import xml.parsers.expat
from collections import deque

def iter_xml(inpath: str, outpath: str) -> None:
    def handle_cdata_end():
        nonlocal in_cdata
        in_cdata = False

    def handle_cdata_start():
        nonlocal in_cdata
        in_cdata = True

    def handle_data(data: str):
        nonlocal in_cdata
        if not in_cdata and open_tags and open_tags[-1] == 'desc':
            data = data.replace('\\', '\\\\').replace('\n', '\\n')
            outfile.write(data + '\n')

    def handle_endtag(tag: str):
        while open_tags:
            open_tag = open_tags.pop()
            if open_tag == tag:
                break

    def handle_starttag(tag: str, attrs: 'Dict[str, str]'):
        open_tags.append(tag)

    open_tags = deque()
    in_cdata = False
    parser = xml.parsers.expat.ParserCreate()
    parser.CharacterDataHandler = handle_data
    parser.EndCdataSectionHandler = handle_cdata_end
    parser.EndElementHandler = handle_endtag
    parser.StartCdataSectionHandler = handle_cdata_start
    parser.StartElementHandler = handle_starttag
    with open(inpath, 'rb') as infile:
        with open(outpath, 'w', encoding = 'utf-8') as outfile:
            parser.ParseFile(infile)

iter_xml('input.xml', 'output.txt')

input.xml:输入.xml:

<root>
    <item>
    <title>Item 1</title>
    <desc>Description 1ä</desc>
    </item>
    <item>
    <title>Item 2</title>
    <desc>Description 2ü</desc>
    </item>
</root>

output.txt:输出.txt:

Description 1ä
Description 2ü

为什么不使用sax的“回调”方法?

Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing.请注意,iterparse 仍然会构建一棵树,就像 parse 一样,但是您可以在解析时安全地重新排列或删除树的某些部分。 For example, to parse large files, you can get rid of elements as soon as you've processed them:例如,要解析大文件,您可以在处理完元素后立即删除它们:

for event, elem in iterparse(source): if elem.tag == "record": ... process record elements ... elem.clear() The above pattern has one drawback; for event, elem in iterparse(source): if elem.tag == "record": ... process record elements ... elem.clear()上述模式有一个缺点; it does not clear the root element, so you will end up with a single element with lots of empty child elements.它不会清除根元素,因此您最终会得到一个带有许多空子元素的单个元素。 If your files are huge, rather than just large, this might be a problem.如果您的文件很大,而不仅仅是很大,这可能是一个问题。 To work around this, you need to get your hands on the root element.要解决此问题,您需要掌握根元素。 The easiest way to do this is to enable start events, and save a reference to the first element in a variable:最简单的方法是启用开始事件,并将对第一个元素的引用保存在变量中:

get an iterable得到一个可迭代的

context = iterparse(source, events=("start", "end"))

turn it into an iterator把它变成一个迭代器

context = iter(context)

get the root element获取根元素

event, root = context.next()

for event, elem in context:
    if event == "end" and elem.tag == "record":
        ... process record elements ...
        root.clear()

So this is a question of Incremental Parsing , This link can give you detailed answer for summarized answer you can refer the above所以这是一个增量解析的问题,这个链接可以给你详细的答案总结答案你可以参考上面

The only problem with the root.clear() method is it returns NoneTypes. root.clear() 方法的唯一问题是它返回 NoneTypes。 This means you can't, for instance, edit what data you parse with string methods like replace() or title().例如,这意味着您无法编辑您使用诸如 replace() 或 title() 之类的字符串方法解析的数据。 That said, this is a optimum method to use if you're just parsing the data as is.也就是说,如果您只是按原样解析数据,这是一种最佳使用方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM