简体   繁体   English

在python中更改和解析大型XML文件的内存有效方式

[英]memory efficient way to change and parse a large XML file in python

I want to parse a large XML file (25 GB) in python, and change some of its elements. 我想用python解析大型XML文件(25 GB),并更改其某些元素。

I tried ElementTree from xml.etree but it takes too much time at the first step (ElementTree.parse). 我从xml.etree中尝试了ElementTree,但是第一步(ElementTree.parse)花费了太多时间。

I read somewhere that SAX is fast and do not load the entire file into the memory but it just for parsing not modifying. 我在某处读到SAX很快,并且没有将整个文件加载到内存中,但这只是为了解析而不是修改。

'iterparse' should also be just for parsing not modifying. “ iterparse”也应仅用于解析而不是修改。

Is there any other option which is fast and memory efficient? 还有其他快速且高效存储的选择吗?

What is important for you here is that you need a streaming parser, which is what sax is. 对您而言重要的是,您需要一个流解析器,即sax。 (There is a built in sax implementation in python and lxml provides one.) The problem is that since you are trying to modify the xml file, you will have to rewrite the xml file as you read it. (在python中有一个内置的sax实现,而lxml提供了一个。)问题在于,由于您试图修改xml文件,因此您在阅读时必须重写xml文件。

An XML file is a text file, You can't go and change some data in the middle of the text file without rewriting the entire text file (unless the data is the exact same size which is unlikely) XML文件是一个文本文件,您不能在不重写整个文本文件的情况下去更改文本文件中间的某些数据(除非数据大小完全相同,这是不可能的)

You can use SAX to read in each element and register an event to write back each element after it is been read and modified. 您可以使用SAX读取每个元素,并注册一个事件以在读取和修改每个元素后写回。 If your changes are really simple it may be even faster to not even bother with the XML parsing and just match text for what you are looking for. 如果您所做的更改确实很简单,那么甚至不用理会XML解析,只需匹配文本即可找到所需的内容,甚至更快。

If you are doing any signinficant work with this large of an XML file, then I would say you shouldn't be using an XML file, you should be using a database. 如果您正在使用如此大的XML文件进行任何重要的工作,那么我会说您不应该使用XML文件,而应该使用数据库。

The problem you have run into here is the same issue that Cobol programmers on mainframes had when they were working with File based data 您在这里遇到的问题与大型机上的Cobol程序员使用基于文件的数据时遇到的问题相同

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM