简体   繁体   English

Python使用XML Iterparse从大型xml文件中删除元素

[英]Python removing elements from large xml file with xml iterparse

I'm a completely newbie to Python and have been using it recently to try and parse a large-ish xml file 700mb. 我是Python的新手,最近一直在使用它来尝试解析700mb的大型xml文件。

Having looked around I have been attempting to use the iterparse methods to to remove a element called Revision_History for the XML since we no longer require this information. 环顾四周之后,我一直在尝试使用iterparse方法删除XML的称为Revision_History的元素,因为我们不再需要此信息。

I've been through a couple of variations with this script, so it could be horribly wrong now. 我已经对该脚本进行了一些变体,所以现在它可能是非常错误的。 It seems to work fine for the first two removals. 对于前两次删除,它似乎工作正常。 However it then stops working and finds no further revision_history tags. 但是,它随后停止工作,并且找不到其他的version_history标签。

import xml.etree.ElementTree as ET
for event, elem in ET.iterparse("AAT.xml", events=("end",)):
if event == "end":
     for subject in elem.findall ("{http://localhost/namespace}Subject"):
        print ("subject found")
        for revision in subject.findall("("{http://localhost/namespace}Revision_History"):
            print ("revision found")
            subject.remove (revision)
            print ("done")
    elem.clear()

Any advice much appreciated! 任何建议,不胜感激!

Adam 亚当

Try using cElementTree instead of ElementTree. 尝试使用cElementTree而不是ElementTree。 It's been significantly faster for me, but I've never parsed files the size you are parsing 它已经为我显著快,但我从来没有解析文件,你正在分析大小

from xml.etree import cElementTree as ET

Secondly, try using iterfind() instead of findall() on the the matching elements. 其次,尝试在匹配的元素上使用iterfind()而不是findall()

from xml.etree import cElementTree as ET

for event, elem in ET.iterparse("books.xml", events=("end",)):
    if elem.tag == "book":
        for d in elem.iterfind("description"):
            elem.remove(d)

Thirdly, depending on how much RAM you want to use, you could try using XPath to find the elements which have the child you wish to delete. 第三,根据要使用的RAM数量,可以尝试使用XPath查找具有要删除的子元素的元素。 Then, iterate through the parents, deleting those children. 然后,遍历父母,删除那些孩子。 Very poor example: 很差的例子:

for event, elem in ET.iterparse("books.xml", events=("end",)):
    for book_with_desc in elem.iterfind(".//Subject[Revision_History]"):
        for child in book_with_desc:
            if child.tag == "Revision_History":
                remove(child)

With XPath, try to avoid the .//foo path if you know the structure of your document, and write a more efficient query, such as ./path/to/element/foo[@attr=bar] or similar. 使用XPath,如果您知道文档的结构,请尝试避免使用.//foo路径,并编写更有效的查询,例如./path/to/element/foo[@attr=bar]或类似文件。

There are much better ways to solve this, I'm sure. 我敢肯定,有更好的方法可以解决这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM