简体   繁体   English

python中在XML文件上使用lxml的xpath问题

[英]problems with xpath in python using lxml on xml file

I'm trying to parse some data from a rss-feed. 我正在尝试从rss提要中解析一些数据。 This is an example of how it looks 这是外观的一个例子

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:admin="http://webns.net/mvcb/"     xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
    <channel rdf:about="http://somelink.com">
        <!-- ordinary stuff goes here -->
    </channel>
    <item rdf:about="http://www.some/random/link/123">
        <title>title</title>
        <link>
        http://www.some/random/link/123
        </link>
        <description>
            <![CDATA[
                ..description..
                ]]>
        </description>
        <dc:date>the date</dc:date>
    </item>
</rdf:RDF>

Now, i'm trying to get every item element from the rss feed, which is no problem with a normal feed, but I can't seem to get anything from this one. 现在,我正在尝试从rss提要中获取每个item元素,这对于普通的提要来说是没有问题的,但是我似乎无法从该提要中获得任何东西。 It just returns an empty list. 它只是返回一个空列表。

This is the code I'm using: 这是我正在使用的代码:

from lxml import etree
tree = etree.parse(url)
items = tree.xpath("//item")

Does it have to do with the rdf:RDF at the start, or the rdf:about=.... in every item tag? 它与开始处的rdf:RDF或每个项目标签中的rdf:about = ....有关吗?

Just in case: 以防万一:
-The file is at least loading becuase etree.tostring(tree) does yield the whole file. -文件至少正在加载,因为etree.tostring(tree)确实产生了整个文件。
-I've tried using nsmap = tree.getroot().nsmap() , but I don't know if I did it right -我尝试使用nsmap = tree.getroot().nsmap() ,但是我不知道我是否做对了
-On a regular rss feed, the tree.getroot() yields -> <Element rss at 0x2fa4260> , but on this file, it yields -> <Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at 0x2fa4288> -在常规的rss提要中, tree.getroot()产生-> <Element rss at 0x2fa4260> ,但是在此文件上,它产生-> <Element {http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF at 0x2fa4288>

As soon as you start using namespaces (even for empty namespace prefix), you must be very explicit in xpath what namespace you are talking about. 一旦开始使用名称空间(甚至是空的名称空间前缀),您就必须在xpath非常明确地说明您所使用的名称空间。

For this purpose, lxml provides a dictionary where keys are namespace prefixes (whatever you like) and values are respective namespaces (fully qualified names): 为此, lxml提供了一个字典,其中键是名称空间前缀(随便您喜欢什么),值是各自的名称空间(全限定名):

from lxml import etree

xmlstr = """
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns="http://purl.org/rss/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
    xmlns:admin="http://webns.net/mvcb/"
    xmlns:syn="http://purl.org/rss/1.0/modules/syndication/">
    <channel rdf:about="http://somelink.com">
        <!-- ordinary stuff goes here -->
    </channel>
    <item rdf:about="http://www.some/random/link/123">
        <title>title</title>
        <link>
        http://www.some/random/link/123
        </link>
        <description>
            <![CDATA[
                ..description..
                ]]>
        </description>
        <dc:date>the date</dc:date>
    </item>
</rdf:RDF>"""

xmldoc = etree.fromstring(xmlstr)
nsmap = {"purl": "http://purl.org/rss/1.0/"}
res = xmldoc.xpath("//purl:item", namespaces=nsmap)
print res

print "xml", etree.tostring(res[0])

Running such code prints: 运行这样的代码打印:

[<Element {http://purl.org/rss/1.0/}item at 0x7fc8fb20af80>]
xml <item xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:admin="http://webns.net/mvcb/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" rdf:about="http://www.some/random/link/123">
        <title>title</title>
        <link>
        http://www.some/random/link/123
        </link>
        <description>

                ..description..

        </description>
        <dc:date>the date</dc:date>
    </item>

The lesson is: 本课是:

  • feel free to ignore namespace prefixes in your document, they are in fact secondary information. 随意忽略文档中的名称空间前缀,它们实际上是辅助信息。 Note, that XML allows reusing the same namespace prefix multiple times in one document for different fully qualified namespaces (scary idea, but true). 请注意,XML允许在一个文档中多次重复使用相同的名称空间前缀,以用于不同的完全限定的名称空间(吓人的主意,但事实如此)。
  • Do care (understand well), what fully qualified namespace you are really going to work with. 注意(好理解)您真正要使用的完全限定名称空间。
  • Dictionary with namespace prefix and qualified names may use whatever namespace prefix you like. 具有名称空间前缀和限定名称的字典可以使用您喜欢的任何名称空间前缀。 It has nothing to do with prefixes in source XML files. 它与源XML文件中的前缀无关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM