I'm trying to parse some data from a rss-feed. This is an example of how it looks
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:admin="http://webns.net/mvcb/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="http://somelink.com">
<!-- ordinary stuff goes here -->
</channel>
<item rdf:about="http://www.some/random/link/123">
<title>title</title>
<link>
http://www.some/random/link/123
</link>
<description>
<![CDATA[
..description..
]]>
</description>
<dc:date>the date</dc:date>
</item>
</rdf:RDF>
Now, i'm trying to get every item element from the rss feed, which is no problem with a normal feed, but I can't seem to get anything from this one. It just returns an empty list.
This is the code I'm using:
from lxml import etree
tree = etree.parse(url)
items = tree.xpath("//item")
Does it have to do with the rdf:RDF at the start, or the rdf:about=.... in every item tag?
Just in case:
-The file is at least loading becuase etree.tostring(tree)
does yield the whole file.
-I've tried using nsmap = tree.getroot().nsmap()
, but I don't know if I did it right
-On a regular rss feed, the tree.getroot()
yields -> <Element rss at 0x2fa4260>
, but on this file, it yields -> <Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at 0x2fa4288>
As soon as you start using namespaces (even for empty namespace prefix), you must be very explicit in xpath
what namespace you are talking about.
For this purpose, lxml
provides a dictionary where keys are namespace prefixes (whatever you like) and values are respective namespaces (fully qualified names):
from lxml import etree
xmlstr = """
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
<channel rdf:about="http://somelink.com">
<!-- ordinary stuff goes here -->
</channel>
<item rdf:about="http://www.some/random/link/123">
<title>title</title>
<link>
http://www.some/random/link/123
</link>
<description>
<![CDATA[
..description..
]]>
</description>
<dc:date>the date</dc:date>
</item>
</rdf:RDF>"""
xmldoc = etree.fromstring(xmlstr)
nsmap = {"purl": "http://purl.org/rss/1.0/"}
res = xmldoc.xpath("//purl:item", namespaces=nsmap)
print res
print "xml", etree.tostring(res[0])
Running such code prints:
[<Element {http://purl.org/rss/1.0/}item at 0x7fc8fb20af80>]
xml <item xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:admin="http://webns.net/mvcb/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" rdf:about="http://www.some/random/link/123">
<title>title</title>
<link>
http://www.some/random/link/123
</link>
<description>
..description..
</description>
<dc:date>the date</dc:date>
</item>
The lesson is:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.