使用 LXML Python 处理 XML 文件极其缓慢

Question

I'm processing XML documents like the following.我正在处理 XML 文档，如下所示。

<tok lemma="i" xpos="CC">e</tok> 
<tok lemma="que" xpos="CS">que</tok> 
<tok lemma="aquey" xpos="PD0MP0">aqueys</tok> 
<tok lemma="marit" xpos="NCMP000">marits</tok> 
<tok lemma="estar" xpos="VMIP3P0">stiguen</tok>  
[...]
<tok lemma="habitar" xpos="VMIP3P0">habiten</tok> 
<tok lemma="en" xpos="SPS00">en</tok>
<tok lemma="aquex" xpos="PD0FS0">aqueix</tok> 
<tok lemma="terra" xpos="NCMS000">món</tok>
[...]
<tok lemma="viure" xpos="VMIP3P0">viuen</tok> 
<tok lemma="en" xpos="SPS00">en</tok>
<tok lemma="aquex" xpos="PD0FP0">aqueixes</tok> 
<tok lemma="casa" xpos="NCFP000">cases</tok>

I need to change the attributes of certain elements whenever certain conditions are met.每当满足某些条件时，我都需要更改某些元素的属性。 With the help of @LMC (see: https://stackoverflow.com/questions/73545510/python-and-lxml-extremely-slow-more-efficient-code/73545789 )I optimized the initial code I had to proces the xml files.在@LMC 的帮助下（请参阅： https://stackoverflow.com/questions/73545510/python-and-lxml-extremely-slow-more-efficient-code/73545789 ）我优化了我必须处理 Z0F635D0E0F3874FFF8B581C13 的初始代码文件。 Here's an exact copy of the code I'm using now.这是我现在使用的代码的精确副本。

# coding: utf-8
import os
import lxml.etree as et


ROOT = '/Path-to-input-xml-files'
ext = ('.xml')


def xml_change(root_element):


    for el in root.xpath('//tok[following-sibling::tok[1][starts-with(@xpos, "N")]]'):        
                          
        if el.text == 'aquest' or el.text == 'Aquest' or el.text == 'AQUEST' or el.text == 'aquast' or el.text == 'Aquast' or el.text == 'AQUAST' or el.text == 'aqast' or el.text == 'Aqast' or el.text == 'AQAST' or el.text == 'aqax' or el.text == 'Aqax' or el.text == 'AQAX' or el.text == 'aqest' or el.text == 'Aqest' or el.text == 'AQEST' or el.text == 'aqet' or el.text == 'Aqet' or el.text == 'AQET' or el.text == 'aquet' or el.text == 'Aquet' or el.text == 'AQUET':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0MS0')
            el.set('lemma', 'aquest')



        elif el.text == 'aquel' or el.text == 'Aquel' or el.text == 'AQUEL' or el.text == 'aquell' or el.text == 'Aquell' or el.text == 'AQUELL' or el.text == 'aqal' or el.text == 'Aqal' or el.text == 'AQAL' or el.text == 'aqual' or el.text == 'Aqual' or el.text == 'AQUAL' or el.text == 'aqueyl' or el.text == 'Aqueyl' or el.text == 'AQUEYL' or el.text == 'aqueil' or el.text == 'Aqueil' or el.text == 'AQUEIL':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0MS0')
            el.set('lemma', 'aquell')
       

        elif el.text == 'aquests' or el.text == 'Aquests' or el.text == 'AQUESTS' or el.text == 'aquets' or el.text == 'Aquets' or el.text == 'AQUETS' or el.text == 'aquetz' or el.text == 'Aquetz' or el.text == 'AQUETZ':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0MP0')
            el.set('lemma', 'aquest')

        elif el.text == 'aquells' or el.text == 'Aquells' or el.text == 'AQUELLS' or el.text == 'aqueys' or el.text == 'Aqueys'  or el.text == 'AQUEYS' or el.text == 'aqueyls'  or el.text == 'Aqueyls'  or el.text == 'AQUEYLS':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0MP0')
            el.set('lemma', 'aquell')

        elif el.text == 'aquestas' or el.text == 'Aquestas' or el.text == 'AQUESTAS' or el.text == 'aquestes' or el.text == 'Aquestes' or el.text == 'AQUESTES' or el.text == 'aquetes' or el.text == 'Aquetes' or el.text == 'AQUETES' or el.text == 'aquastes' or el.text == 'Aquastes' or el.text == 'AQUASTES' or el.text == 'aquastas' or el.text == 'Aquastas' or el.text == 'AQUASTAS'  or el.text == 'aqastas' or el.text == 'Aqastas' or el.text == 'AQASTAS' or el.text == 'aquexas' or el.text == 'Aquexas' or el.text == 'AQUEXAS':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0FP0')
            el.set('lemma', 'aquest')
        
        elif el.text == 'aqualas' or el.text == 'Aqualas' or el.text == 'AQUALAS' or el.text == 'aquelas' or el.text == 'Aquelas' or el.text == 'AQUELAS' or el.text == 'aqueles' or el.text == 'Aqueles' or el.text == 'AQUELES' or el.text == 'aquellas' or el.text == 'Aquellas' or el.text == 'AQUELLAS' or el.text == 'aquelles' or el.text == 'Aquelles' or el.text == 'AQUELLES' or el.text == 'aquales' or el.text == 'Aquales' or el.text == 'AQUALES' or el.text == 'aqueylas' or el.text == 'Aqueylas' or el.text == 'AQUEYLAS' or el.text == 'aqueyles' or el.text == 'Aqueyles' or el.text == 'AQUEYLES':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0FP0')
            el.set('lemma', 'aquell')

        elif el.text == 'aquesta' or el.text == 'Aquesta' or el.text == 'AQUESTA' or el.text == 'aquasta' or el.text == 'Aquasta' or el.text == 'AQUASTA' or el.text == 'aquaste' or el.text == 'Aquaste' or el.text == 'AQUASTE' or el.text == 'aqasta' or el.text == 'Aqasta' or el.text == 'AQASTA' or el.text == 'aquetes' or el.text == 'aqaste' or el.text == 'Aqaste' or el.text == 'AQASTE' or el.text == 'aquaxa' or el.text == 'Aquaxa' or el.text == 'AQUAXA' or el.text == 'aqexa' or el.text == 'Aqexa'  or el.text == 'AQEXA' or el.text == 'aquexa' or el.text == 'Aquexa' or el.text == 'AQUEXA':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0FS0')
            el.set('lemma', 'aquest')

        elif el.text == 'aquala' or el.text == 'Aquala' or el.text == 'AQUALA' or el.text == 'aquale' or el.text == 'Aquale' or el.text == 'AQUALE'  or el.text == 'aquela' or el.text == 'Aquela' or el.text == 'AQUELA' or el.text == 'aqueyla' or el.text == 'Aqueyla' or el.text == 'AQUEYLA' or el.text == 'aqueila' or el.text == 'Aqueila' or el.text == 'AQUEILA':

            print('Current value is:', el.get('lemma'), el.get('xpos'))
            el.set('xpos', 'DD0FS0')
            el.set('lemma', 'aquell')
# iterate all dirs
for root, dirs, files in os.walk(ROOT):

    # iterate all files
    for file in files:
        if file.endswith(ext):
            # join root dir and file name
            file_path = os.path.join(ROOT, file)

            # load root element from file
            root = et.parse(file_path).getroot()

            # recursively change  elements from xml
            xml_change(root)
    
        

            # init tree object from root
            tree = et.ElementTree(root)

            # save cleaned xml tree object to file. Important to specify encoding
                
            tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)

@LMC's advice was indeed useful and with a test run involving a few xml documents to process I noticed that the optimization resulted in a slight increase of speed. @LMC 的建议确实很有用，通过涉及一些 xml 文档的测试运行，我注意到优化导致速度略有提高。 I think, however, that there is something fundamentally wrong with what I'm doing because it's been already 38 hours and the process still has not finished.但是，我认为我所做的事情存在根本性的问题，因为已经 38 小时了，而这个过程还没有完成。 Granted, there are a lot of conditions that have to be checked and processing these kinds of texts documents is supposed to be slow.当然，有很多条件需要检查，处理这类文本文档应该很慢。 But 38 hours and counting on a pretty powerful computer (Mac Studio with M1 max chip)?但是 38 小时并指望一台功能强大的计算机（带有 M1 max 芯片的 Mac Studio）？ I have never experienced something like this.我从来没有经历过这样的事情。

I provide some more information that could be useful to people who have some experience working on similar projects.我提供了一些更多的信息，这些信息可能对在类似项目上有一些工作经验的人有用。 The total amount of XML documents I'm processing is 395 with a total size of 585 MB.我正在处理的 XML 文档的总量为 395，总大小为 585 MB。 The largest document is 34MB and the smallest is 3KB but most documents are between 100KB and 4MB.最大的文档为 34MB，最小的为 3KB，但大多数文档在 100KB 到 4MB 之间。

Now, here's the odd thing.现在，奇怪的事情来了。 The speed of the process does not seem to be related to the length of the processed documents.处理速度似乎与处理文件的长度无关。 It is as if the processing is done in bursts.就好像处理是在突发中完成的。 All of a sudden I get a bunch of print statements (from print('Current value is:', el.get('lemma'), el.get('xpos'))) indicating that matches are found and a bunch of output documents of different sizes are generated.突然间，我得到一堆打印语句（来自print('Current value is:', el.get('lemma'), el.get('xpos'))) ，表明找到了匹配项和一堆生成不同大小的output文档。

However, after that a lot of hours can go by without any new print statements or output documents being generated.但是，在这之后的很多小时，go 都不会生成任何新的打印语句或 output 文档。 Here are a couple of screenshots of the directory where the output files are created so that you can see the time gaps between the creation of new files.这是创建 output 文件的目录的几个屏幕截图，以便您可以看到创建新文件之间的时间间隔。

I cannot see much of a correlation between the size of the files and the times it takes to process them.我看不出文件大小与处理它们所需的时间之间有很大的相关性。 At any rate, even if the file is large, it seems to me that 17 hours to process a single file is a bit too much.无论如何，即使文件很大，在我看来 17 小时处理一个文件也有点太多了。 What do you think?你怎么看？ Am I wrong and this is what should be expected with these kinds of jobs or there is something I'm doing wrong?我错了，这是这类工作应该预期的，还是我做错了什么？ Is there anything I could do to make this faster?我能做些什么来加快速度吗？

Answer 1

There's something pathological going on here, there's no way it should take this long.这里发生了一些病态的事情，不可能花这么长时间。 Things I would try to isolate the cause:我会尝试找出原因的事情：

(a) see if there is any network traffic generated. (a) 查看是否有任何网络流量产生。

(b) take a look at memory consumption to see if there's excessive paging or garbage collection (b) 看一下memory的消耗看是否有过多的分页或者垃圾回收

(c) reduce the processing you're doing on each document to something trivial to see if the problem is with parsing/saving the documents, or with the processing you are doing on each document. (c) 将您对每个文档所做的处理减少到微不足道的程度，以查看问题是与解析/保存文档有关，还是与您对每个文档进行的处理有关。

Answer 2

There might be a problem with variable naming since root variable has 2 meanings in the code which could cause a memory problem .变量命名可能存在问题，因为root变量在代码中有 2 个含义，这可能会导致memory 问题。
Given the example below给出下面的例子

>>> t = os.walk('/home/lmc/tmp/a')
>>> for root, dirs, files in t:
...     print(root)
...     root= uuid.uuid4()
...     print(root)
... 
/home/lmc/tmp/a
ab5839a8-43b5-4d9d-bbb3-4836c612abaf
/home/lmc/tmp/a/b
7a8ba22e-7a02-45d6-82ce-538e11b70e7d
/home/lmc/tmp/a/b/c
de7c0e08-edc4-43e6-9bc1-9b1d7dd7e9db
/home/lmc/tmp/a/b/c/f
2536e2dc-11d1-4b41-86fd-128c3eeaddbc
/home/lmc/tmp/a/b/c/f/g
7d7e61b0-31d4-4af4-9097-540fc2bbac1c
/home/lmc/tmp/a/b/d
1a671eb2-7efe-4dc4-891b-94d1710ef638
/home/lmc/tmp/a/b/d/e
420d5228-44f1-493d-9dae-e2005c4e0f61

So instead of a directory name root might be holding an xml element on each instance of that list.因此， root可能会在该列表的每个实例上保存一个 xml 元素，而不是目录名称。
Removing withespace from parsed tree could also reduce the number of nodes in the tree从解析树中删除 withespace 也可以减少树中的节点数

for root, dirs, files in os.walk(ROOT):

    # iterate all files
    for file in files:
        if file.endswith(ext):
            # join root dir and file name
            file_path = os.path.join(ROOT, file)

            # load root element from file
            parser = etree.XMLParser(remove_blank_text=True)
            root_ele = et.parse(file_path, parser).getroot()

            # recursively change  elements from xml
            xml_change(root_ele)

Finally, as suggested, changing the xpath search strategy also makes a difference最后，按照建议，更改 xpath 搜索策略也会有所不同

for el in root.xpath('//tok[starts-with(@xpos, "N")]/preceding-sibling::tok[1]'):

使用 LXML Python 处理 XML 文件极其缓慢

问题描述

2 个解决方案

解决方案1
0 2022-09-01 14:37:23

解决方案2
0 2022-09-03 18:05:09

使用 LXML Python 处理 XML 文件极其缓慢

问题描述

2 个解决方案

解决方案1 0 2022-09-01 14:37:23

解决方案2 0 2022-09-03 18:05:09

解决方案1
0 2022-09-01 14:37:23

解决方案2
0 2022-09-03 18:05:09