简体   繁体   中英

More efficient way to search large XML file in Python

So I have 2 XML file (A and B) with around 90k record on each file.

Format of the files are as follows:

<trips>
    <trip id="" speed=""/>
              .
              .
              .
              .
</trips>

I need to compare the speed attribute from both files with the same id attribute. But the id in both files are not at the same position. For example the following won't work:

A = minidom.parse('A.xml')
B = minidom.parse('B.xml')

triplistA = A.getElememtByTagName('trip')
triplistB = B.getElementByTagName('trip')

i = 0

for i in range(len(triplistA)):  #A and B has same number of trip tag
    tripA = triplistA[i]
    tripB = triplistB[i]

    #get the speed from tripA and tripB and compare, then do something

That means I have to search through file B to get the same id, only then I can compare the speed. In worst case scenario it will take n^2 time which is very long for 90k records.

I thought that after matching one pair of trip, I remove the record from file B so that it will take lesser time to search B in the next iteration. I have tried removed the node using minidom but it somehow took longer time. Therefore I am using element tree to do the node removal.

Then I have:

A = minidom.parse('A.xml')
triplist = A.getElementByTagName('trip')
B = ET.parse("B.xml")
rootB = B.getroot()


for tripA in triplist:
    for tripB in rootB.findall('trip'):
        if (tripB.get('id') == str(tripA.attributes['id'].value)):
            #take speed from both nodes and do something
            rootB.remove(tripB)
            break

The process got faster and faster as time pass due to the reduction of nodes in file B, but it still took half an hour to finish the whole process.

My project requires me to do the comparison many times, and after comparing the speed there is process which takes half an hour as well (some simulation, this part of time wastage is inevitable). So I would like to know if there is more efficient way out there to search a large XML file.

Thank you all in advance.

Cast both the trees into dicts, then compare them:

trips_a = {}
for trip in A.getElementByTagName('trip'):
    trips_a[trip.attributes['id']] = trip.attributes['id'].value
for trip in B.getElementByTagName('trip'):
    trip_value_from_B = trip.attributes['id'].value
    trip_value_from_A = trips_a[trip.attributes['id']
    # Do something with trip_value_from_A and trip_value_from_B

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM