简体   繁体   中英

Parsing xml with etree

I am trying to parse an XML response from Amazon's Product Advertising API, this is the xml

<?xml version="1.0" ?>
    <ItemLookupResponse xmlns="http://webservices.amazon.com/AWSECommerceService/2010-11-01"> <OperationRequest>
        <HTTPHeaders>
            <Header Name="UserAgent" Value="TSN (Language=Python)"></Header>
        </HTTPHeaders>
        <RequestId>96ef9bc3-68a8-4bf3-a2c7-c98b8aeae00f</RequestId>
        <Arguments>
            <Argument Name="Operation" Value="ItemLookup"></Argument>
            <Argument Name="Service" Value="AWSECommerceService"></Argument>
            <Argument Name="Signature" Value="gjc4wRNum3YT82app1d06vMIDM7v44fOmZTP8Uh3LqE="></Argument><Argument Name="AssociateTag" Value="sneakick-20"></Argument>
            <Argument Name="Version" Value="2010-11-01"></Argument>
            <Argument Name="ItemId" Value="810056013349,810056013264"></Argument>
            <Argument Name="IdType" Value="UPC"></Argument>
            <Argument Name="AWSAccessKeyId" Value="AKIAIFMUMJLJOOINRVRA"></Argument>
            <Argument Name="Timestamp" Value="2012-01-03T21:26:39Z"></Argument>
            <Argument Name="ResponseGroup" Value="ItemIds"></Argument>
            <Argument Name="SearchIndex" Value="Apparel"></Argument>
        </Arguments>
       <RequestProcessingTime>0.0595830000000000</RequestProcessingTime>
      </OperationRequest>
      <Items>
          <Request>
              <IsValid>True</IsValid>
              <ItemLookupRequest>
                  <IdType>UPC</IdType>
                  <ItemId>810056013349</ItemId>
                  <ItemId>810056013264</ItemId>
                  <ResponseGroup>ItemIds</ResponseGroup>
                  <SearchIndex>Apparel</SearchIndex>
                  <VariationPage>All</VariationPage>
              </ItemLookupRequest>
          </Request>
          <Item>
              <ASIN>B000XR4K6U</ASIN>
          </Item>
          <Item>
              <ASIN>B000XR2UU8</ASIN>
          </Item>
       </Items>
    </ItemLookupResponse>

All i am interested in is the Item tags inside Items , so basically all that xml was returned by amazon in a string which i parsed like so:

from xml.etree.ElementTree import fromstring

response = "xml string returned by amazon"
parsed = fromstring(response)
items = parsed[1] # This is how i get the Items element

# These were my attempts at getting the Item element
items.find('Item')
items.findall('Item')

items being the Items element, but so far no success, it keeps returning None/Empty , im i missing something , or is there another way to go about this ?

It is a namespace issue. This works:

from xml.etree import ElementTree as ET

XML = """<?xml version="1.0" ?>
    <ItemLookupResponse xmlns="http://webservices.amazon.com/AWSECommerceService/2010-11-01"> 
      <OperationRequest>
        <HTTPHeaders>
            <Header Name="UserAgent" Value="TSN (Language=Python)"></Header>
        </HTTPHeaders>
        <RequestId>96ef9bc3-68a8-4bf3-a2c7-c98b8aeae00f</RequestId>
        <Arguments>
            <Argument Name="Operation" Value="ItemLookup"></Argument>
            <Argument Name="Service" Value="AWSECommerceService"></Argument>
            <Argument Name="Signature" Value="gjc4wRNum3YT82app1d06vMIDM7v44fOmZTP8Uh3LqE="></Argument>
            <Argument Name="AssociateTag" Value="sneakick-20"></Argument>
            <Argument Name="Version" Value="2010-11-01"></Argument>
            <Argument Name="ItemId" Value="810056013349,810056013264"></Argument>
            <Argument Name="IdType" Value="UPC"></Argument>
            <Argument Name="AWSAccessKeyId" Value="AKIAIFMUMJLJOOINRVRA"></Argument>
            <Argument Name="Timestamp" Value="2012-01-03T21:26:39Z"></Argument>
            <Argument Name="ResponseGroup" Value="ItemIds"></Argument>
            <Argument Name="SearchIndex" Value="Apparel"></Argument>
        </Arguments>
       <RequestProcessingTime>0.0595830000000000</RequestProcessingTime>
      </OperationRequest>
      <Items>
          <Request>
              <IsValid>True</IsValid>
              <ItemLookupRequest>
                  <IdType>UPC</IdType>
                  <ItemId>810056013349</ItemId>
                  <ItemId>810056013264</ItemId>
                  <ResponseGroup>ItemIds</ResponseGroup>
                  <SearchIndex>Apparel</SearchIndex>
                  <VariationPage>All</VariationPage>
              </ItemLookupRequest>
          </Request>
          <Item>
              <ASIN>B000XR4K6U</ASIN>
          </Item>
          <Item>
              <ASIN>B000XR2UU8</ASIN>
          </Item>
       </Items>
    </ItemLookupResponse>"""

NS = "{http://webservices.amazon.com/AWSECommerceService/2010-11-01}"

doc = ET.fromstring(XML)
Item_elems = doc.findall(".//" + NS + "Item")  # All Item elements in document

print Item_elems

Output:

[<Element '{http://webservices.amazon.com/AWSECommerceService/2010-11-01}Item' at 0xbf0c50>, 
<Element '{http://webservices.amazon.com/AWSECommerceService/2010-11-01}Item' at 0xbf0cd0>]

Variation closer to your own code:

NS = "{http://webservices.amazon.com/AWSECommerceService/2010-11-01}"
doc = ET.fromstring(XML)
items = doc[1]                           # Items element

first_item = items.find(NS + 'Item')     # First direct Item child
all_items =  items.findall(NS + 'Item')  # List of all direct Item children

Namespace issue.

You can put the namespace in front of all of your items as spelled out in the first answer to either this question or this question . A possibly simpler solution is to ignore the namespace with a quick hack like this:

xml_hacked_namespace = raw_xml.replace(' xmlsn=', ' xmlnamespace=')
doc = fromstring(xml_hacked_namespace)
item_list = doc.findall('.//Item')

If you find that you are doing a lot of work with xml you may also be interested in checking out lxml . It is faster and provides a few extra methods that some find nice to have.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM