简体   繁体   中英

Python removing elements from large xml file with xml iterparse

I'm a completely newbie to Python and have been using it recently to try and parse a large-ish xml file 700mb.

Having looked around I have been attempting to use the iterparse methods to to remove a element called Revision_History for the XML since we no longer require this information.

I've been through a couple of variations with this script, so it could be horribly wrong now. It seems to work fine for the first two removals. However it then stops working and finds no further revision_history tags.

import xml.etree.ElementTree as ET
for event, elem in ET.iterparse("AAT.xml", events=("end",)):
if event == "end":
     for subject in elem.findall ("{http://localhost/namespace}Subject"):
        print ("subject found")
        for revision in subject.findall("("{http://localhost/namespace}Revision_History"):
            print ("revision found")
            subject.remove (revision)
            print ("done")
    elem.clear()

Any advice much appreciated!

Adam

Try using cElementTree instead of ElementTree. It's been significantly faster for me, but I've never parsed files the size you are parsing

from xml.etree import cElementTree as ET

Secondly, try using iterfind() instead of findall() on the the matching elements.

from xml.etree import cElementTree as ET

for event, elem in ET.iterparse("books.xml", events=("end",)):
    if elem.tag == "book":
        for d in elem.iterfind("description"):
            elem.remove(d)

Thirdly, depending on how much RAM you want to use, you could try using XPath to find the elements which have the child you wish to delete. Then, iterate through the parents, deleting those children. Very poor example:

for event, elem in ET.iterparse("books.xml", events=("end",)):
    for book_with_desc in elem.iterfind(".//Subject[Revision_History]"):
        for child in book_with_desc:
            if child.tag == "Revision_History":
                remove(child)

With XPath, try to avoid the .//foo path if you know the structure of your document, and write a more efficient query, such as ./path/to/element/foo[@attr=bar] or similar.

There are much better ways to solve this, I'm sure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM