[英]python alexa result parsing with lxml.etree
I am using alexa api from aws but I find difficult in parse the result to get what I want 我正在使用来自aws的alexa api,但我发现难以解析结果以获得我想要的东西
alexa api return an object tree <type 'lxml.etree._ElementTree'>
alexa api返回一个对象树<type 'lxml.etree._ElementTree'>
I use this code to print the tree 我用这段代码打印树
from lxml import etree
root = tree.getroot()
print etree.tostring(root)
I get xml below 我在下面得到xml
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"><aws:OperationRequest><aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId></aws:OperationRequest><aws:UrlInfoResult><aws:Alexa>
<aws:ContentData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:SiteData>
<aws:Title>Google</aws:Title>
<aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
<aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
</aws:SiteData>
<aws:LinksInCount>3453627</aws:LinksInCount>
</aws:ContentData>
<aws:TrafficData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:Rank>1</aws:Rank>
</aws:TrafficData>
</aws:Alexa></aws:UrlInfoResult><aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:StatusCode>Success</aws:StatusCode></aws:ResponseStatus></aws:Response></aws:UrlInfoResponse>
I use root.find('LinksInCount').text
to get value of element but it does not work. 我使用root.find('LinksInCount').text
来获取元素的值,但它不起作用。
I want to know how to get the text 3453627
of aws:LinksInCount
我想知道如何获取aws:LinksInCount
的文本3453627
aws:LinksInCount
You run into two challenges: 你遇到两个挑战:
You see "aws:"
prefix, but it is used for two different namespaces: 您会看到"aws:"
前缀,但它用于两个不同的名称空间:
xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"
xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"
Using the same namespace prefix in XML is completely legal. 在XML中使用相同的命名空间前缀是完全合法的。 The rule is, the later one is valid. 规则是,后者有效。
xmlstr = """
<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11">
<aws:OperationRequest>
<aws:RequestId>ccf3f263-ab76-ab63-db99-244666044e85</aws:RequestId>
</aws:OperationRequest>
<aws:UrlInfoResult>
<aws:Alexa>
<aws:ContentData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:SiteData>
<aws:Title>Google</aws:Title>
<aws:Description>Enables users to search the world's information, including webpages, images, and videos. Offers unique features and search technology.</aws:Description>
<aws:OnlineSince>15-Sep-1997</aws:OnlineSince>
</aws:SiteData>
<aws:LinksInCount>3453627</aws:LinksInCount>
</aws:ContentData>
<aws:TrafficData>
<aws:DataUrl type="canonical">google.com/</aws:DataUrl>
<aws:Rank>1</aws:Rank>
</aws:TrafficData>
</aws:Alexa>
</aws:UrlInfoResult>
<aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/">
<aws:StatusCode>Success</aws:StatusCode>
</aws:ResponseStatus>
</aws:Response>
</aws:UrlInfoResponse>
"""
Next challenge is, how to search for namespaced elements. 接下来的挑战是,如何搜索命名空间元素。
I prefer using xpath
, and for it, you can use whatever namespace you like in the xpath expression, but you have to tell the xpath
call what you meant by those prefixes. 我更喜欢使用xpath
,为此,你可以在xpath表达式中使用你喜欢的任何命名空间,但是你必须告诉xpath
调用你对这些前缀的意思。 This is done by namespaces
dictionary: 这是由namespaces
字典完成的:
from lxml import etree
doc = etree.fromstring(xmlstr.strip())
namespaces = {"aws": "http://awis.amazonaws.com/doc/2005-07-11"}
texts = doc.xpath("//aws:LinksInCount/text()", namespaces=namespaces)
print texts[0]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.