简体   繁体   English

使用lxml.etree解析python alexa结果

[英]python alexa result parsing with lxml.etree

I am using alexa api from aws but I find difficult in parse the result to get what I want 我正在使用来自aws的alexa api,但我发现难以解析结果以获得我想要的东西

alexa api return an object tree <type 'lxml.etree._ElementTree'> alexa api返回一个对象树<type 'lxml.etree._ElementTree'>

I use this code to print the tree 我用这段代码打印树

from lxml import etree
root = tree.getroot()
print etree.tostring(root)

I get xml below 我在下面得到xml

<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"><aws:OperationRequest><aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId></aws:OperationRequest><aws:UrlInfoResult><aws:Alexa>

  <aws:ContentData>
    <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
    <aws:SiteData>
      <aws:Title>Google</aws:Title>
      <aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
      <aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
    </aws:SiteData>
    <aws:LinksInCount>3453627</aws:LinksInCount>
  </aws:ContentData>
  <aws:TrafficData>
    <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
    <aws:Rank>1</aws:Rank>
  </aws:TrafficData>
</aws:Alexa></aws:UrlInfoResult><aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:StatusCode>Success</aws:StatusCode></aws:ResponseStatus></aws:Response></aws:UrlInfoResponse>

I use root.find('LinksInCount').text to get value of element but it does not work. 我使用root.find('LinksInCount').text来获取元素的值,但它不起作用。

I want to know how to get the text 3453627 of aws:LinksInCount 我想知道如何获取aws:LinksInCount的文本3453627 aws:LinksInCount

You run into two challenges: 你遇到两个挑战:

  • XML using namespaces 使用命名空间的XML
  • two namespaces sharing the same namespace prefix 两个名称空间共享相同的名称空间前缀

XML document with reused prefix for 2 different namespaces 带有2个不同命名空间的重用前缀的XML文档

You see "aws:" prefix, but it is used for two different namespaces: 您会看到"aws:"前缀,但它用于两个不同的名称空间:

xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"
xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"

Using the same namespace prefix in XML is completely legal. 在XML中使用相同的命名空间前缀是完全合法的。 The rule is, the later one is valid. 规则是,后者有效。

xmlstr = """
<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
  <aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
    <aws:OperationRequest>
      <aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId>
    </aws:OperationRequest>
    <aws:UrlInfoResult>
      <aws:Alexa>
        <aws:ContentData>
          <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
          <aws:SiteData>
            <aws:Title>Google</aws:Title>
            <aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
            <aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
          </aws:SiteData>
          <aws:LinksInCount>3453627</aws:LinksInCount>
        </aws:ContentData>
        <aws:TrafficData>
          <aws:DataUrl type="canonical">google.com/</aws:DataUrl>
          <aws:Rank>1</aws:Rank>
        </aws:TrafficData>
      </aws:Alexa>
    </aws:UrlInfoResult>
    <aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
      <aws:StatusCode>Success</aws:StatusCode>
    </aws:ResponseStatus>
  </aws:Response>
</aws:UrlInfoResponse>
"""

Next challenge is, how to search for namespaced elements. 接下来的挑战是,如何搜索命名空间元素。

I prefer using xpath , and for it, you can use whatever namespace you like in the xpath expression, but you have to tell the xpath call what you meant by those prefixes. 我更喜欢使用xpath ,为此,你可以在xpath表达式中使用你喜欢的任何命名空间,但是你必须告诉xpath调用你对这些前缀的意思。 This is done by namespaces dictionary: 这是由namespaces字典完成的:

from lxml import etree
doc = etree.fromstring(xmlstr.strip())

namespaces = {"aws": "http://awis.amazonaws.com/doc/2005-07-11"}
texts = doc.xpath("//aws:LinksInCount/text()", namespaces=namespaces)
print texts[0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM