简体   繁体   中英

problems with xpath in python using lxml on xml file

I'm trying to parse some data from a rss-feed. This is an example of how it looks

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:admin="http://webns.net/mvcb/"     xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
    <channel rdf:about="http://somelink.com">
        <!-- ordinary stuff goes here -->
    </channel>
    <item rdf:about="http://www.some/random/link/123">
        <title>title</title>
        <link>
        http://www.some/random/link/123
        </link>
        <description>
            <![CDATA[
                ..description..
                ]]>
        </description>
        <dc:date>the date</dc:date>
    </item>
</rdf:RDF>

Now, i'm trying to get every item element from the rss feed, which is no problem with a normal feed, but I can't seem to get anything from this one. It just returns an empty list.

This is the code I'm using:

from lxml import etree
tree = etree.parse(url)
items = tree.xpath("//item")

Does it have to do with the rdf:RDF at the start, or the rdf:about=.... in every item tag?

Just in case:
-The file is at least loading becuase etree.tostring(tree) does yield the whole file.
-I've tried using nsmap = tree.getroot().nsmap() , but I don't know if I did it right
-On a regular rss feed, the tree.getroot() yields -> <Element rss at 0x2fa4260> , but on this file, it yields -> <Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at 0x2fa4288>

As soon as you start using namespaces (even for empty namespace prefix), you must be very explicit in xpath what namespace you are talking about.

For this purpose, lxml provides a dictionary where keys are namespace prefixes (whatever you like) and values are respective namespaces (fully qualified names):

from lxml import etree

xmlstr = """
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns="http://purl.org/rss/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
    xmlns:admin="http://webns.net/mvcb/"
    xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
    <channel rdf:about="http://somelink.com">
        <!-- ordinary stuff goes here -->
    </channel>
    <item rdf:about="http://www.some/random/link/123">
        <title>title</title>
        <link>
        http://www.some/random/link/123
        </link>
        <description>
            <![CDATA[
                ..description..
                ]]>
        </description>
        <dc:date>the date</dc:date>
    </item>
</rdf:RDF>"""

xmldoc = etree.fromstring(xmlstr)
nsmap = {"purl": "http://purl.org/rss/1.0/"}
res = xmldoc.xpath("//purl:item", namespaces=nsmap)
print res

print "xml", etree.tostring(res[0])

Running such code prints:

[<Element {http://purl.org/rss/1.0/}item at 0x7fc8fb20af80>]
xml <item xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:admin="http://webns.net/mvcb/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" rdf:about="http://www.some/random/link/123">
        <title>title</title>
        <link>
        http://www.some/random/link/123
        </link>
        <description>

                ..description..

        </description>
        <dc:date>the date</dc:date>
    </item>

The lesson is:

  • feel free to ignore namespace prefixes in your document, they are in fact secondary information. Note, that XML allows reusing the same namespace prefix multiple times in one document for different fully qualified namespaces (scary idea, but true).
  • Do care (understand well), what fully qualified namespace you are really going to work with.
  • Dictionary with namespace prefix and qualified names may use whatever namespace prefix you like. It has nothing to do with prefixes in source XML files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM