简体   繁体   English

在Python /元素树中从300MG Xml中删除元素

[英]Removing Elements From 300MG Xml In Python / Element Tree

I'm trying to parse a 300MB XML in ElementTree, based on advise like Can Python xml ElementTree parse a very large xml file? 我正在尝试根据ElementTree解析300MB XML的建议,例如Python xml ElementTree可以解析非常大的xml文件吗?

from xml.etree import ElementTree as Et

for event, elem in Et.iterparse('C:\...path...\desc2015.xml'):  
    if elem.tag == 'DescriptorRecord':
        for e in elem._children:
            if str(e.tag) in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
                e.clear()
                elem.remove(e)
                print 'removed %s' % e

giving... 给...

removed <Element 'HistoryNote' at 0x557cc7f0>
removed <Element 'DateCreated' at 0x557fa990>
removed <Element 'HistoryNote' at 0x55809af0>
removed <Element 'DateCreated' at 0x5580f5d0>

However, this just keeps going, the file isn't getting any smaller, and on inspection the elements are still there. 但是,这种情况一直存在,文件没有变小,并且在检查时元素仍然存在。 Tried either e.clear() or elem.remove(e), but the same results. 尝试使用e.clear()或elem.remove(e),但结果相同。 Regards 问候

UPDATE UPDATE

Error code from my first comment on @alexanderlukanin13 s answer: 我对@ alexanderlukanin13的答案的第一条评论中的错误代码:

Traceback (most recent call last): File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\pydevd.py", line 1570, in trace_dispatch Traceback (most recent call last): File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\pydevd.py", line 2278, in globals = debugger.run(setup['file'], None, None) File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\pydevd.py", line 1704, in run pydev_imports.execfile(file, globals, locals) # execute the script File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\runfiles.py", line 234, in main() File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\runfiles.py", line 78, in main return pydev_runfiles.main(configuration) # Note: still doesn't return a proper value. 追溯(最近一次调用):文件“ C:\\ Users \\ Eddie \\ Downloads \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ pydevd.py”,行1570,位于trace_dispatch追溯(最新调用)最后):文件“ C:\\ Users \\ Eddie \\ Downloads \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ pydevd.py”,第2278行,全局变量= debugger.run(setup ['file' ],无,无)文件“ C:\\ Users \\ Eddie \\ Downloads \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ pydevd.py”,行1704,在运行pydev_imports.execfile(文件,全局变量,本地人)#执行脚本文件“ C:\\ Users \\ Eddie \\ Downloads \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ runfiles.py”,行234,位于main()文件“ C: \\ Users \\ Eddie \\ Downloads \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ runfiles.py“,第78行,主要返回pydev_runfiles.main(configuration)#注意:仍然不会返回正确的值。 File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\pydev_runfiles.py", line 835, in main PydevTestRunner(configuration).run_tests() File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\pydev_runfiles.py", line 762, in run_tests file_and_modules_and_module_name = self.find_modules_from_files(files) File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\pydev_runfiles.py", line 517, in find_modules_from_files mod = self.__get_module_from_str(import_str, print_exception, pyfile) File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\pydev_runfiles.py", line 476, in __get_module_from_str buf_err = pydevd_io.StartRedirect(keep_original_redirection=True, std='stderr') File "C:\\Users\\Eddie\\Downloads\\eclipse\\plugins\\org.python.pydev_4.0.0.201504132356\\pysrc\\pydevd_io.py", line 72, in StartRedirect import sys MemoryError 文件``C:\\ Users \\ Eddie \\ Downloads \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ pydev_runfiles.py'',行835,位于主PydevTestRunner(configuration).run_tests()文件“ C:\\ Users \\ Eddie \\ Downloads \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ pydev_runfiles.py“,第762行,位于run_tests file_and_modules_and_module_name = self.find_modules_from_files(files)文件” C:\\ Users \\ Eddie \\ Download \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ pydev_runfiles.py“,第517行,位于find_modules_from_files mod = self .__ get_module_from_str(import_str,print_exception,pyfile)文件” C:\\ Users \\ Eddie \\ Downloads \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ pydev_runfiles.py“,第476行,位于__get_module_from_str buf_err = pydevd_io.StartRedirect(keep_original_redirection = True,std ='stderr')文件” C:\\ User StartRedirect导入sys MemoryError中的\\ Downloads \\ eclipse \\ plugins \\ org.python.pydev_4.0.0.201504132356 \\ pysrc \\ pydevd_io.py“,第72行

The main problem in your script is that you don't save altered XML back to disk. 脚本中的主要问题是您不会将更改后的XML保存回磁盘。 You need to store reference to root element and then call ElementTree.write : 您需要存储对根元素的引用,然后调用ElementTree.write

from xml.etree import ElementTree as Et

context = Et.iterparse('input.xml')
root = None
for event, elem in context:
    if elem.tag == 'DescriptorRecord':
        for e in list(elem.getchildren()):  # Don't use _children, it's a private field
            if e.tag in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
                elem.remove(e)  # You need remove(), not clear()
    root = elem

with open('output.xml', 'wb') as file:
    Et.ElementTree(root).write(file, encoding='utf-8', xml_declaration=True)

Note: here I use an awkward (and probably unsafe) way to get a root element - I assume that it's always the last element in iterparse output. 注意:这里我使用笨拙(可能不安全)的方法来获取根元素-我假设它始终是iterparse输出中的最后一个元素。 If anyone knows a better way, please tell. 如果有人知道更好的方法,请告诉。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM