使用lxml.etree解析python alexa结果

Question

I am using alexa api from aws but I find difficult in parse the result to get what I want 我正在使用来自aws的alexa api，但我发现难以解析结果以获得我想要的东西

alexa api return an object tree <type 'lxml.etree._ElementTree'> alexa api返回一个对象树<type 'lxml.etree._ElementTree'>

I use this code to print the tree 我用这段代码打印树

from lxml import etree
root = tree.getroot()
print etree.tostring(root)

I get xml below 我在下面得到xml

<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"><aws:OperationRequest><aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId></aws:OperationRequest><aws:UrlInfoResult><aws:Alexa>

  <aws:ContentData>
    <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
    <aws:SiteData>
      <aws:Title>Google</aws:Title>
      <aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
      <aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
    </aws:SiteData>
    <aws:LinksInCount>3453627</aws:LinksInCount>
  </aws:ContentData>
  <aws:TrafficData>
    <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
    <aws:Rank>1</aws:Rank>
  </aws:TrafficData>
</aws:Alexa></aws:UrlInfoResult><aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:StatusCode>Success</aws:StatusCode></aws:ResponseStatus></aws:Response></aws:UrlInfoResponse>

I use root.find('LinksInCount').text to get value of element but it does not work. 我使用root.find('LinksInCount').text来获取元素的值，但它不起作用。

I want to know how to get the text 3453627 of aws:LinksInCount 我想知道如何获取aws:LinksInCount的文本3453627 aws:LinksInCount

Answer 1

You run into two challenges: 你遇到两个挑战：

XML using namespaces 使用命名空间的XML
two namespaces sharing the same namespace prefix 两个名称空间共享相同的名称空间前缀

XML document with reused prefix for 2 different namespaces 带有2个不同命名空间的重用前缀的XML文档

You see "aws:" prefix, but it is used for two different namespaces: 您会看到"aws:"前缀，但它用于两个不同的名称空间：

xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"
xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"

Using the same namespace prefix in XML is completely legal. 在XML中使用相同的命名空间前缀是完全合法的。 The rule is, the later one is valid. 规则是，后者有效。

xmlstr = """
<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
  <aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
    <aws:OperationRequest>
      <aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId>
    </aws:OperationRequest>
    <aws:UrlInfoResult>
      <aws:Alexa>
        <aws:ContentData>
          <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
          <aws:SiteData>
            <aws:Title>Google</aws:Title>
            <aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
            <aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
          </aws:SiteData>
          <aws:LinksInCount>3453627</aws:LinksInCount>
        </aws:ContentData>
        <aws:TrafficData>
          <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
          <aws:Rank>1</aws:Rank>
        </aws:TrafficData>
      </aws:Alexa>
    </aws:UrlInfoResult>
    <aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
      <aws:StatusCode>Success</aws:StatusCode>
    </aws:ResponseStatus>
  </aws:Response>
</aws:UrlInfoResponse>
"""

Next challenge is, how to search for namespaced elements. 接下来的挑战是，如何搜索命名空间元素。

I prefer using xpath , and for it, you can use whatever namespace you like in the xpath expression, but you have to tell the xpath call what you meant by those prefixes. 我更喜欢使用xpath ，为此，你可以在xpath表达式中使用你喜欢的任何命名空间，但是你必须告诉xpath调用你对这些前缀的意思。 This is done by namespaces dictionary: 这是由namespaces字典完成的：

from lxml import etree
doc = etree.fromstring(xmlstr.strip())

namespaces = {"aws": "http://awis.amazonaws.com/doc/2005-07-11"}
texts = doc.xpath("//aws:LinksInCount/text()", namespaces=namespaces)
print texts[0]

使用lxml.etree解析python alexa结果

问题描述

1 个解决方案

解决方案1
3 已采纳 2014-06-24 10:06:07

XML document with reused prefix for 2 different namespaces 带有2个不同命名空间的重用前缀的XML文档

使用lxml.etree解析python alexa结果

问题描述

1 个解决方案

解决方案1 3 已采纳 2014-06-24 10:06:07

XML document with reused prefix for 2 different namespaces 带有2个不同命名空间的重用前缀的XML文档

解决方案1
3 已采纳 2014-06-24 10:06:07