简体   繁体   中英

Sorting XML document with Python and ElementTree

I'm trying to reorganize some xml files that contain several segments of a complete route structured as:

<trk>
    <name>GPSRoute.XML</name>
    <trkseg>
        <trkpt lat="37.077882" lon="-112.242785">
            <ele>1688.00</ele>
            <time>2020-04-18T01:56:39.80Z</time>
        </trkpt>
        <extensions>
            <name>14</name>
            <gte:color>#00ce00</gte:color>
        </extensions>
    </trkseg>
    <trkseg>
        <trkpt lat="37.077888" lon="-112.242783">
            <ele>1688.00</ele>
            <time>2020-04-18T01:56:39.80Z</time>
        </trkpt>
        <extensions>
            <name>1</name>
            <gte:color>#00ce00</gte:color>
        </extensions>
    </trkseg>
</trk>

I'm trying to sort the file by name instead of time as it currently is and write the result to a new file. So far this is how far I've gotten, it successfully captures the names in a list, but it errors on data.sort() with:

"TypeError: '<' not supported between instances of 'xml.etree.ElementTree.Element' and 'xml.etree.ElementTree.Element'"

If anyone could point me in the right direction it would be much appreciated!

import xml.etree.ElementTree as ET

tree = ET.parse('Filename.xml')

root = tree.getroot()
data = []
for track in root:
    for segment in track:
        for extension in segment:
            for name in extension.findall('name'):
                print(name.text)
                data.append((name))
            data.sort()


tree.write('Sorted.xml')

There is no real way to sort xml until you get to xpath 3.1, I think, but it's possible to kludge your way around that.

Note that, since the xml in your question is invalid (you have undeclared namespaces), I used a more forgiving html parser. With your actual code you should use an xml parser, as indicated below.

What this code does, is collect the node values of each <name> child node (ie, your target number) from each <trkseg> parent node, saves them to a list, sorts the list, uses the sorted list to again select the <trkseg> nodes in that sorted order, and uses them (together with the opening and closing tags) to create a new xml.

import lxml.html as lh # with actual xml you would probably use "from lxml import etree"
trk = """your xml above"""

doc = lh.fromstring(trk) # with actual xml you should probably use "doc = etree.XML(trk)"

names = []
new_trk = """<trk>
    <name>GPSRoute.XML</name>""" # this is the preamble which is left untouched
for nam in doc.xpath('//extensions//name'):
    names.append(nam.text) #grab the numbers
for name in sorted(names): #sort the grabbed numbers
    target = doc.xpath(f'//trkseg[.//name/text()={name}]')
    for t in target:
        new_trk += lh.tostring(t).decode()
new_trk += '</trk>' # append the closing tag, which is also left untouched
print(new_trk)

Output:

<trk>
    <name>GPSRoute.XML</name><trkseg>
        <trkpt lat="37.077888" lon="-112.242783">
            <ele>1688.00</ele>
            <time>2020-04-18T01:56:39.80Z</time>
        </trkpt>
        <extensions>
            <name>1</name>
            <color>#00ce00</color>
        </extensions>
    </trkseg>
<trkseg>
        <trkpt lat="37.077882" lon="-112.242785">
            <ele>1688.00</ele>
            <time>2020-04-18T01:56:39.80Z</time>
        </trkpt>
        <extensions>
            <name>14</name>
            <color>#00ce00</color>
        </extensions>
    </trkseg>
    </trk>

An Element object can be treated as an iterable with the child elements as members. This makes it easy to sort the children of the root element. In this case we need to make an exception for the first child ( <name>GPSRoute.XML</name> ), which is not involved in the sorting.

There is an undeclared namespace prefix in the XML document, so to make it work I changed gte:color to color .

import xml.etree.ElementTree as ET

tree = ET.parse('Filename.xml')
root = tree.getroot()

# Temporarily remove the 'name' element
name = root.find("name")
root.remove(name)

# Sort the 'trkseg' elements using 'extensions/name' as key
root[:] = sorted(root, key=lambda trkseg: int(trkseg.findtext("extensions/name")))

# Put the 'name' element back
root.insert(0, name)

print(ET.tostring(root).decode())

Result:

<trk>
  <name>GPSRoute.XML</name>
  <trkseg>
    <trkpt lat="37.077888" lon="-112.242783">
      <ele>1688.00</ele>
      <time>2020-04-18T01:56:39.80Z</time>
    </trkpt>
    <extensions>
      <name>1</name>
      <color>#00ce00</color>
    </extensions>
  </trkseg>
<trkseg>
    <trkpt lat="37.077882" lon="-112.242785">
      <ele>1688.00</ele>
      <time>2020-04-18T01:56:39.80Z</time>
    </trkpt>
    <extensions>
      <name>14</name>
      <color>#00ce00</color>
    </extensions>
  </trkseg>
  </trk>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM