简体   繁体   English

使用lxml修改大型xml文件

[英]Modify large xml file using lxml

Language :- Python 2.7.6 语言:-Python 2.7.6

File Size :- 1.5 GB 档案大小:-1.5 GB

XML Format XML格式

<myfeed>
    <product>
        <id>876543</id>
        <name>ABC</name>
        ....
     </product>

    <product>
        <id>876567</id>
        <name>DEF</name>
        ....
     </product>

    <product>
        <id>986543</id>
        <name>XYZ</name>
        ....
     </product>

I have to 我必须

A) Read all the nodes <product> A)读取所有节点<product>

B) Delete some of these nodes ( if the <id> attribute's text is in python set() B)删除其中一些节点(如果<id>属性的文本在python set()中

C) Update/Alter few nodes ( if the <id> attribute's text is in python dict C)更新/更改几个节点(如果<id>属性的文本在python dict中

D) Append/Write some new nodes D)追加/写入一些新节点

The problem is my XML file is huge ( approx 1.5 GB ). 问题是我的XML文件很大(大约1.5 GB)。 I did some research and decide to use lxml for all these purposes. 我进行了一些研究,并决定将lxml用于所有这些目的。

I am trying to use iterparse() with element.clear() to achieve this because it will not consume all my memory. 我正在尝试将iterparse()与element.clear()配合使用,因为它不会消耗我的所有内存。

for event, element in etree.iterparse(big_xml_file,tag = 'product'):
        for child in element:
            if child.tag == unique_tag:
                if child.text in products_id_hash_set_to_delete: #python set()
                    #delete this element node

                else:
                    if child.text in products_dict_to_update:
                        #update this element node  
                        else:
                            print child.text
        element.clear()

Note:- I want to achieve all these 4 task in one scan of the XML file 注意:-我想一次扫描XML文件来完成所有这4个任务

Questions 问题

1) Can I achieve all this in one scan of the file ? 1)我可以一次扫描文件来实现所有这些功能吗?

2) If yes, how to delete and update the element nodes I am processing? 2)如果是,如何删除和更新我正在处理的元素节点?

3) Should I use tree.xpath() instead ? 3)我应该改用tree.xpath()吗? If yes, how much memory will it consume for 1.5 GB file or does it works in same way as iterparse() 如果是,则将为1.5 GB的文件消耗多少内存,或者它与iterparse()的工作方式相同

I am not very experienced in python. 我对python不太有经验。 I am from Java background. 我来自Java背景。

You can't edit an XML file in-place. 您不能就地编辑XML文件。 You have to write the output to a new (temporary) file, and then replace the original file with the new file. 您必须将输出写入新的(临时)文件,然后用新文件替换原始文件。

So the basic algorithm is: 因此,基本算法是:

  • Loop over all elements. 循环遍历所有元素。
  • If the node is one to delete, proceed to the next element 如果该节点是要删除的节点,则继续下一个元素
  • If the node is one to change, change its value 如果该节点是要更改的节点,请更改其值
  • Write out the node ««« This is the crucial bit you are missing 写出节点«««这是您缺少的关键点
  • When you are about to finish processing a node which is a parent of one of the new nodes, write out the new node, and remove it from the collection of new nodes. 当您要完成对作为新节点之一的父节点的节点的处理时,请写出新节点,并将其从新节点集合中删除。
  • Close the output file 关闭输出文件
  • Rename. 改名。

To answer the supplemental question: You need to realize that an XML file is a (long) string of characters. 要回答补充问题:您需要认识到XML文件是一个(长)字符串。 If you want to insert a character, you have to shuffle all the other ones up; 如果要插入一个字符,则必须将其他所有字符都洗牌; if you want to delete a character, you have to shuffle all the other ones down. 如果要删除一个字符,则必须将其他所有字符都洗掉。 You can't do that with a file; 您不能使用文件来执行此操作; you can't just delete a character from the middle of a file. 您不能只是从文件中间删除字符。

If you have millions of elements (and this is a real problem, not an exercise for a class), then you need to use a database. 如果您有数百万个元素(这是一个实际的问题,而不是一个类的练习),那么您需要使用数据库。 SQLite is my first thought when somebody says "database", but as Charles Duffy points out below, an XQuery database would probably be a better place to start given you already have XML. 当有人说“数据库”时,我首先想到了SQLite,但是正如Charles Duffy在下面指出的那样,如果您已经有了XML,那么XQuery数据库可能是一个更好的起点。 See BaseX or eXist for some open-source implementations. 有关某些开源实现,请参见BaseX或eXist。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM