[英]problems with xpath in python using lxml on xml file
I'm trying to parse some data from a rss-feed. 我正在尝试从rss提要中解析一些数据。 This is an example of how it looks
这是外观的一个例子
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:admin="http://webns.net/mvcb/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="http://somelink.com">
<!-- ordinary stuff goes here -->
</channel>
<item rdf:about="http://www.some/random/link/123">
<title>title</title>
<link>
http://www.some/random/link/123
</link>
<description>
<![CDATA[
..description..
]]>
</description>
<dc:date>the date</dc:date>
</item>
</rdf:RDF>
Now, i'm trying to get every item element from the rss feed, which is no problem with a normal feed, but I can't seem to get anything from this one. 现在,我正在尝试从rss提要中获取每个item元素,这对于普通的提要来说是没有问题的,但是我似乎无法从该提要中获得任何东西。 It just returns an empty list.
它只是返回一个空列表。
This is the code I'm using: 这是我正在使用的代码:
from lxml import etree
tree = etree.parse(url)
items = tree.xpath("//item")
Does it have to do with the rdf:RDF at the start, or the rdf:about=.... in every item tag? 它与开始处的rdf:RDF或每个项目标签中的rdf:about = ....有关吗?
Just in case: 以防万一:
-The file is at least loading becuase etree.tostring(tree)
does yield the whole file. -文件至少正在加载,因为
etree.tostring(tree)
确实产生了整个文件。
-I've tried using nsmap = tree.getroot().nsmap()
, but I don't know if I did it right -我尝试使用
nsmap = tree.getroot().nsmap()
,但是我不知道我是否做对了
-On a regular rss feed, the tree.getroot()
yields -> <Element rss at 0x2fa4260>
, but on this file, it yields -> <Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at 0x2fa4288>
-在常规的rss提要中,
tree.getroot()
产生-> <Element rss at 0x2fa4260>
,但是在此文件上,它产生-> <Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at 0x2fa4288>
As soon as you start using namespaces (even for empty namespace prefix), you must be very explicit in xpath
what namespace you are talking about. 一旦开始使用名称空间(甚至是空的名称空间前缀),您就必须在
xpath
非常明确地说明您所使用的名称空间。
For this purpose, lxml
provides a dictionary where keys are namespace prefixes (whatever you like) and values are respective namespaces (fully qualified names): 为此,
lxml
提供了一个字典,其中键是名称空间前缀(随便您喜欢什么),值是各自的名称空间(全限定名):
from lxml import etree
xmlstr = """
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="http://somelink.com">
<!-- ordinary stuff goes here -->
</channel>
<item rdf:about="http://www.some/random/link/123">
<title>title</title>
<link>
http://www.some/random/link/123
</link>
<description>
<![CDATA[
..description..
]]>
</description>
<dc:date>the date</dc:date>
</item>
</rdf:RDF>"""
xmldoc = etree.fromstring(xmlstr)
nsmap = {"purl": "http://purl.org/rss/1.0/"}
res = xmldoc.xpath("//purl:item", namespaces=nsmap)
print res
print "xml", etree.tostring(res[0])
Running such code prints: 运行这样的代码打印:
[<Element {http://purl.org/rss/1.0/}item at 0x7fc8fb20af80>]
xml <item xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:admin="http://webns.net/mvcb/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" rdf:about="http://www.some/random/link/123">
<title>title</title>
<link>
http://www.some/random/link/123
</link>
<description>
..description..
</description>
<dc:date>the date</dc:date>
</item>
The lesson is: 本课是:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.