简体   繁体   中英

Using lxml to get xml data

Should be an easy question to answer for you python masters!

I have this XML information I'm trying to parse(it's from a URL)

<calculateRouteResponse xmlns="http://api.tomtom.com/routing" formatVersion="0.0.12">
<copyright>...</copyright>
<privacy>...</privacy>
<route>
<summary>
<lengthInMeters>5144</lengthInMeters>
<travelTimeInSeconds>687</travelTimeInSeconds>
<trafficDelayInSeconds>0</trafficDelayInSeconds>
<departureTime>2018-01-16T11:16:06+11:00</departureTime>
<arrivalTime>2018-01-16T11:27:33+11:00</arrivalTime>
<noTrafficTravelTimeInSeconds>478</noTrafficTravelTimeInSeconds>
<historicTrafficTravelTimeInSeconds>687</historicTrafficTravelTimeInSeconds>
<liveTrafficIncidentsTravelTimeInSeconds>687</liveTrafficIncidentsTravelTimeInSeconds>
</summary>
<leg>
<summary>
<lengthInMeters>806</lengthInMeters>
<travelTimeInSeconds>68</travelTimeInSeconds>
<trafficDelayInSeconds>0</trafficDelayInSeconds>
<departureTime>2018-01-16T11:16:06+11:00</departureTime>
<arrivalTime>2018-01-16T11:17:14+11:00</arrivalTime>
<noTrafficTravelTimeInSeconds>59</noTrafficTravelTimeInSeconds>
<historicTrafficTravelTimeInSeconds>68</historicTrafficTravelTimeInSeconds>
<liveTrafficIncidentsTravelTimeInSeconds>68</liveTrafficIncidentsTravelTimeInSeconds>
</summary>
<points>...</points>
</leg>
<leg>
<summary>
<lengthInMeters>958</lengthInMeters>
<travelTimeInSeconds>114</travelTimeInSeconds>
<trafficDelayInSeconds>0</trafficDelayInSeconds>
<departureTime>2018-01-16T11:17:14+11:00</departureTime>
<arrivalTime>2018-01-16T11:19:08+11:00</arrivalTime>
<noTrafficTravelTimeInSeconds>77</noTrafficTravelTimeInSeconds>
<historicTrafficTravelTimeInSeconds>114</historicTrafficTravelTimeInSeconds>
<liveTrafficIncidentsTravelTimeInSeconds>114</liveTrafficIncidentsTravelTimeInSeconds>
</summary>
<points>...</points>
</leg>
<leg>
<summary>
<lengthInMeters>1798</lengthInMeters>
<travelTimeInSeconds>224</travelTimeInSeconds>
<trafficDelayInSeconds>0</trafficDelayInSeconds>
<departureTime>2018-01-16T11:19:08+11:00</departureTime>
<arrivalTime>2018-01-16T11:22:53+11:00</arrivalTime>
<noTrafficTravelTimeInSeconds>181</noTrafficTravelTimeInSeconds>
<historicTrafficTravelTimeInSeconds>224</historicTrafficTravelTimeInSeconds>
<liveTrafficIncidentsTravelTimeInSeconds>224</liveTrafficIncidentsTravelTimeInSeconds>
</summary>
<points>...</points>
</leg>
<leg>
<summary>
<lengthInMeters>1582</lengthInMeters>
<travelTimeInSeconds>280</travelTimeInSeconds>
<trafficDelayInSeconds>0</trafficDelayInSeconds>
<departureTime>2018-01-16T11:22:53+11:00</departureTime>
<arrivalTime>2018-01-16T11:27:33+11:00</arrivalTime>
<noTrafficTravelTimeInSeconds>160</noTrafficTravelTimeInSeconds>
<historicTrafficTravelTimeInSeconds>280</historicTrafficTravelTimeInSeconds>
<liveTrafficIncidentsTravelTimeInSeconds>280</liveTrafficIncidentsTravelTimeInSeconds>
</summary>
<points>...</points>
</leg>
<sections>
<section>
<startPointIndex>0</startPointIndex>
<endPointIndex>139</endPointIndex>
<sectionType>TRAVEL_MODE</sectionType>
<travelMode>car</travelMode>
</section>
</sections>
</route>
</calculateRouteResponse>

And I have this script that I'm trying to use to take specific information with.

from lxml import etree
import urllib.request

def parseXML(xmlFile):
    """
    Parse the xml
    """
    with urllib.request.urlopen("https://api.tomtom.com/routing/1/calculateRoute/-37.79205923474775,145.03010268799338:-37.798883995180496,145.03040309540322:-37.807106781970354,145.02895470253526:-37.80320743019992,145.01021142594075:-37.79990,144.99318476311566:?routeType=shortest&key=xxxx&computeTravelTimeFor=all") as fobj:
        xml = fobj.read()
#Look at Parent and Child XML organisation as this is where the data is going wrong at the moment
    root = etree.fromstring(xml)

    for appt in root.getchildren():
        for elem in appt.getchildren():
            if not elem.text:
                text = "None"
            else:
                text = elem[0][0].text

            ##This is doing something with the xml based on it's tag and value.
            if elem.tag == 'travelTimeInSeconds' and int(text) > 700:
                print('******** Do something with ', elem.tag, ' : ', text)
            print(elem.tag + " => " + text)

if __name__ == "__main__":
    parseXML("example.xml")

The output I am getting is just from the summary and leg tab.

So EG,

Desired output is this

<summary>
<lengthInMeters>5144</lengthInMeters>
<travelTimeInSeconds>687</travelTimeInSeconds>
<trafficDelayInSeconds>0</trafficDelayInSeconds>
<departureTime>2018-01-16T11:16:06+11:00</departureTime>
<arrivalTime>2018-01-16T11:27:33+11:00</arrivalTime>
<noTrafficTravelTimeInSeconds>478</noTrafficTravelTimeInSeconds>
<historicTrafficTravelTimeInSeconds>687</historicTrafficTravelTimeInSeconds>
<liveTrafficIncidentsTravelTimeInSeconds>687</liveTrafficIncidentsTravelTimeInSeconds>
</summary>

And each leg if possible ( so ->

<leg>
<summary>
<lengthInMeters>958</lengthInMeters>
<travelTimeInSeconds>114</travelTimeInSeconds>
<trafficDelayInSeconds>0</trafficDelayInSeconds>
<departureTime>2018-01-16T11:17:14+11:00</departureTime>
<arrivalTime>2018-01-16T11:19:08+11:00</arrivalTime>
<noTrafficTravelTimeInSeconds>77</noTrafficTravelTimeInSeconds>
<historicTrafficTravelTimeInSeconds>114</historicTrafficTravelTimeInSeconds>
<liveTrafficIncidentsTravelTimeInSeconds>114</liveTrafficIncidentsTravelTimeInSeconds>
</summary>
</leg>

The data between XML tags, for example 1582 for length in meters

How do I change this script to take information from lengthinmeters, traveltimeinseconds and those specific children? Especially want what is within the summary tabs, most valueable information, thanks!

Appreciate your time!

This was the solution based on the answer I got and my own interpretation.

Now on to formatting the data and then learning how to pickle it!

from lxml import etree
import urllib.request

def handleLeg(leg):
    # print this leg as text, or save it to file maybe...
    text = etree.tostring(leg, pretty_print=True)
    print (text)
    # also process individual elements of interest here if we want
    tagsOfInterest=["noTrafficTravelTimeInSeconds", "lengthInMeters", "departureTime", "trafficDelayInSeconds"]  # whatever
    for child in leg:
        if 'summary' in child.tag:
           for elem in child:
               for item in tagsOfInterest:
                   if item in elem.tag:
                       print (item + " : " + elem.text)

def parseXML(xmlFile):
    """
    Parse the xml
    """
    with urllib.request.urlopen("https://api.tomtom.com/routing/1/calculateRoute/-37.79205923474775,145.03010268799338:-37.798883995180496,145.03040309540322:-37.807106781970354,145.02895470253526:-37.80320743019992,145.01021142594075:-37.79990,144.99318476311566:?routeType=shortest&key=xxxxx&computeTravelTimeFor=all") as fobj:
        xml = fobj.read()
#Look at Parent and Child XML organisation as this is where the data is going wrong at the moment
    root = etree.fromstring(xml)

    for child in root:
        if 'route' in child.tag:
            for elem in child:
                if 'leg' in elem.tag:
                    handleLeg(elem)



if __name__ == "__main__":
    parseXML("example.xml")




'''
import pickle
favorite_color = { "lion": "yellow", "kitty": "red" }

pickle.dump( favorite_color, open( "save.p", "wb" ) )
'''

My understanding is that parseXML takes the data from the website, then is turned into a etree which then is searched through for 'route', and then 'leg' before being parsed. The tags of interest is used to find the correct text to bring up in the interpreter.

Trying to make sure I have the summary tab in there as well.

Next stage is to put this information into a class / object / dictionary and collate it for use in the future.

I don't have access to the TomTom API, so I can't run all your code as posted, but I did have a look at the XML string you posted.

Below is some code I used to extract the individual "leg" elements and process them. I've just printed them as text (could save them to file instead), and also extracted specific children and printed them.

It's not clear from your question exactly what you wanted to do with the data, but maybe this gives you a starting point to work from.

from lxml import etree
import urllib.request

xml = '<calculateRouteResponse xmlns="http://api.tomtom.com/routing" formatVersion="0.0.12">\
<copyright>...</copyright>\
<privacy>...</privacy>\
<route>\
<summary>\
<lengthInMeters>5144</lengthInMeters>\
<travelTimeInSeconds>687</travelTimeInSeconds>\
<trafficDelayInSeconds>0</trafficDelayInSeconds>\
<departureTime>2018-01-16T11:16:06+11:00</departureTime>\
<arrivalTime>2018-01-16T11:27:33+11:00</arrivalTime>\
<noTrafficTravelTimeInSeconds>478</noTrafficTravelTimeInSeconds>\
<historicTrafficTravelTimeInSeconds>687</historicTrafficTravelTimeInSeconds>\
<liveTrafficIncidentsTravelTimeInSeconds>687</liveTrafficIncidentsTravelTimeInSeconds>\
</summary>\
<leg>\
<summary>\
<lengthInMeters>806</lengthInMeters>\
<travelTimeInSeconds>68</travelTimeInSeconds>\
<trafficDelayInSeconds>0</trafficDelayInSeconds>\
<departureTime>2018-01-16T11:16:06+11:00</departureTime>\
<arrivalTime>2018-01-16T11:17:14+11:00</arrivalTime>\
<noTrafficTravelTimeInSeconds>59</noTrafficTravelTimeInSeconds>\
<historicTrafficTravelTimeInSeconds>68</historicTrafficTravelTimeInSeconds>\
<liveTrafficIncidentsTravelTimeInSeconds>68</liveTrafficIncidentsTravelTimeInSeconds>\
</summary>\
<points>...</points>\
</leg>\
<leg>\
<summary>\
<lengthInMeters>958</lengthInMeters>\
<travelTimeInSeconds>114</travelTimeInSeconds>\
<trafficDelayInSeconds>0</trafficDelayInSeconds>\
<departureTime>2018-01-16T11:17:14+11:00</departureTime>\
<arrivalTime>2018-01-16T11:19:08+11:00</arrivalTime>\
<noTrafficTravelTimeInSeconds>77</noTrafficTravelTimeInSeconds>\
<historicTrafficTravelTimeInSeconds>114</historicTrafficTravelTimeInSeconds>\
<liveTrafficIncidentsTravelTimeInSeconds>114</liveTrafficIncidentsTravelTimeInSeconds>\
</summary>\
<points>...</points>\
</leg>\
<leg>\
<summary>\
<lengthInMeters>1798</lengthInMeters>\
<travelTimeInSeconds>224</travelTimeInSeconds>\
<trafficDelayInSeconds>0</trafficDelayInSeconds>\
<departureTime>2018-01-16T11:19:08+11:00</departureTime>\
<arrivalTime>2018-01-16T11:22:53+11:00</arrivalTime>\
<noTrafficTravelTimeInSeconds>181</noTrafficTravelTimeInSeconds>\
<historicTrafficTravelTimeInSeconds>224</historicTrafficTravelTimeInSeconds>\
<liveTrafficIncidentsTravelTimeInSeconds>224</liveTrafficIncidentsTravelTimeInSeconds>\
</summary>\
<points>...</points>\
</leg>\
<leg>\
<summary>\
<lengthInMeters>1582</lengthInMeters>\
<travelTimeInSeconds>280</travelTimeInSeconds>\
<trafficDelayInSeconds>0</trafficDelayInSeconds>\
<departureTime>2018-01-16T11:22:53+11:00</departureTime>\
<arrivalTime>2018-01-16T11:27:33+11:00</arrivalTime>\
<noTrafficTravelTimeInSeconds>160</noTrafficTravelTimeInSeconds>\
<historicTrafficTravelTimeInSeconds>280</historicTrafficTravelTimeInSeconds>\
<liveTrafficIncidentsTravelTimeInSeconds>280</liveTrafficIncidentsTravelTimeInSeconds>\
</summary>\
<points>...</points>\
</leg>\
<sections>\
<section>\
<startPointIndex>0</startPointIndex>\
<endPointIndex>139</endPointIndex>\
<sectionType>TRAVEL_MODE</sectionType>\
<travelMode>car</travelMode>\
</section>\
</sections>\
</route>\
</calculateRouteResponse>'


def handleLeg(leg):
    """
    Handle a single leg element pulled from the main xml block
    """
    # now that we have a leg element, we can handle it as we want.
    # first, print this leg as text, so as we can see what it contains
    # NB we could also just append this text block to a file of "legs"
    text = etree.tostring(leg, pretty_print=True) 
    print (text)
    # we can see that there are individual elements of interest,
    # held within the "summary" child element
    # for each element of interest, extract the data and print it
    tagsOfInterest=["noTrafficTravelTimeInSeconds", "lengthInMeters", "departureTime"]  # whatever
    for child in leg:
        if 'summary' in child.tag:
            # we've found the "summary" child
            # so inspect each of its child element tags
            # to see if it is of interest
            for elem in child:
               for item in tagsOfInterest:
                   if item in elem.tag:
                       # its of interest...
                       # print it here
                       print (item + " : " + elem.text)

def parseXML(xml):
    """
    Parse the xml
    """
    root = etree.fromstring(xml)
    # look for the main "route" element, there should only be one...
    # do this by checking if the text "route" appears in the element tag
    for child in root:
        if 'route' in child.tag:
            # OK we found a/the route element. Now iterate over its "leg"
            # elements and handle each one
            for elem in child:
                if 'leg' in elem.tag:
                    # this is a "leg" element so handle it 
                    handleLeg(elem)    

if __name__ == "__main__":
    parseXML(xml)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM