使用lxml修改大型xml文件

Question

Language :- Python 2.7.6 语言：-Python 2.7.6

File Size :- 1.5 GB 档案大小：-1.5 GB

XML Format XML格式

<myfeed>
    <product>
        <id>876543</id>
        <name>ABC</name>
        ....
     </product>

    <product>
        <id>876567</id>
        <name>DEF</name>
        ....
     </product>

    <product>
        <id>986543</id>
        <name>XYZ</name>
        ....
     </product>

I have to 我必须

A) Read all the nodes <product> A）读取所有节点<product>

B) Delete some of these nodes ( if the <id> attribute's text is in python set() B）删除其中一些节点（如果<id>属性的文本在python set（）中

C) Update/Alter few nodes ( if the <id> attribute's text is in python dict C）更新/更改几个节点（如果<id>属性的文本在python dict中

D) Append/Write some new nodes D）追加/写入一些新节点

The problem is my XML file is huge ( approx 1.5 GB ). 问题是我的XML文件很大（大约1.5 GB）。 I did some research and decide to use lxml for all these purposes. 我进行了一些研究，并决定将lxml用于所有这些目的。

I am trying to use iterparse() with element.clear() to achieve this because it will not consume all my memory. 我正在尝试将iterparse（）与element.clear（）配合使用，因为它不会消耗我的所有内存。

for event, element in etree.iterparse(big_xml_file,tag = 'product'):
        for child in element:
            if child.tag == unique_tag:
                if child.text in products_id_hash_set_to_delete: #python set()
                    #delete this element node

                else:
                    if child.text in products_dict_to_update:
                        #update this element node  
                        else:
                            print child.text
        element.clear()

Note:- I want to achieve all these 4 task in one scan of the XML file 注意：-我想一次扫描XML文件来完成所有这4个任务

Questions 问题

1) Can I achieve all this in one scan of the file ? 1）我可以一次扫描文件来实现所有这些功能吗？

2) If yes, how to delete and update the element nodes I am processing? 2）如果是，如何删除和更新我正在处理的元素节点？

3) Should I use tree.xpath() instead ? 3）我应该改用tree.xpath（）吗？ If yes, how much memory will it consume for 1.5 GB file or does it works in same way as iterparse() 如果是，则将为1.5 GB的文件消耗多少内存，或者它与iterparse（）的工作方式相同

I am not very experienced in python. 我对python不太有经验。 I am from Java background. 我来自Java背景。

Answer 1

You can't edit an XML file in-place. 您不能就地编辑XML文件。 You have to write the output to a new (temporary) file, and then replace the original file with the new file. 您必须将输出写入新的（临时）文件，然后用新文件替换原始文件。

So the basic algorithm is: 因此，基本算法是：

Loop over all elements. 循环遍历所有元素。
If the node is one to delete, proceed to the next element 如果该节点是要删除的节点，则继续下一个元素
If the node is one to change, change its value 如果该节点是要更改的节点，请更改其值
Write out the node ««« This is the crucial bit you are missing 写出节点«««这是您缺少的关键点
When you are about to finish processing a node which is a parent of one of the new nodes, write out the new node, and remove it from the collection of new nodes. 当您要完成对作为新节点之一的父节点的节点的处理时，请写出新节点，并将其从新节点集合中删除。
Close the output file 关闭输出文件
Rename. 改名。

To answer the supplemental question: You need to realize that an XML file is a (long) string of characters. 要回答补充问题：您需要认识到XML文件是一个（长）字符串。 If you want to insert a character, you have to shuffle all the other ones up; 如果要插入一个字符，则必须将其他所有字符都洗牌； if you want to delete a character, you have to shuffle all the other ones down. 如果要删除一个字符，则必须将其他所有字符都洗掉。 You can't do that with a file; 您不能使用文件来执行此操作； you can't just delete a character from the middle of a file. 您不能只是从文件中间删除字符。

If you have millions of elements (and this is a real problem, not an exercise for a class), then you need to use a database. 如果您有数百万个元素（这是一个实际的问题，而不是一个类的练习），那么您需要使用数据库。 SQLite is my first thought when somebody says "database", but as Charles Duffy points out below, an XQuery database would probably be a better place to start given you already have XML. 当有人说“数据库”时，我首先想到了SQLite，但是正如Charles Duffy在下面指出的那样，如果您已经有了XML，那么XQuery数据库可能是一个更好的起点。 See BaseX or eXist for some open-source implementations. 有关某些开源实现，请参见BaseX或eXist。

使用lxml修改大型xml文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-12-16 08:19:50

使用lxml修改大型xml文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-12-16 08:19:50

解决方案1
2 已采纳 2015-12-16 08:19:50