简体   繁体   English

scrapy response.xpath使用默认命名空间返回xml文档上的空数组,而response.re工作

[英]scrapy response.xpath returns empty array on xml document with default namespace, while response.re works

I am new to scrappy and I was playing with the scrapy shell trying to crawl this site: www.spiegel.de/sitemap.xml 我是新手,我正在玩scrapy shell尝试抓取这个网站: www.spiegel.de/sitemap.xml

I did it with 我做到了

scrapy shell "http://www.spiegel.de/sitemap.xml"

and it works all fine, when i use 当我使用时,它工作得很好

response.body 

i can see the whole page including xml tags 我可以看到整个页面包括xml标签

however for instance this: 但是例如:

response.xpath('//loc') 

simply wont work. 根本不会工作。

The result i get is an empty array 我得到的结果是一个空数组

while

response.selector.re('somevalidregexpexpression') 

would work 会工作

any idea what could be the reason? 任何想法可能是什么原因? could be related to encoding or so? 可能与编码有关吗? the site is not utf-8 该网站不是utf-8

I am using python 2.7 on Win 7. I tried the xpath() on another site (dmoz) and it worked fine. 我在Win 7上使用python 2.7。我在另一个站点(dmoz)上尝试了xpath(),它工作正常。

The problem was due to the default namespace declared at the root element of the XML : 问题是由于在XML的根元素处声明的默认名称空间

xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"

So in that XML, the root element and its descendants without prefix inherits the same namespace, implicitly . 因此,在该XML中,根元素及其没有前缀的后代隐式地继承了相同的名称空间。

On the other hand, in XPath, you need to use prefix that bound to a namespace URI to reference element in that namespace, there is no such default namespace implied. 另一方面,在XPath中,您需要使用绑定到命名空间URI的前缀来引用该命名空间中的元素,并不存在隐含的此类默认命名空间

You can use selector.register_namespace() to bind a namespace prefix to the default namespace URI, and then use the prefix in your XPath : 您可以使用selector.register_namespace()将名称空间前缀绑定到默认名称空间URI,然后在XPath中使用前缀:

response.selector.register_namespace('d', 'http://www.sitemaps.org/schemas/sitemap/0.9')
response.xpath('//d:loc')

You can also use xpath with local namespace such as in: 您还可以将xpath与本地命名空间一起使用,例如:

response.xpath("//*[local-name()='loc']")

This is especially useful if you are parsing responses from multiple heterogeneous sources and you don't want to register each and every namespace. 如果您正在解析来自多个异构源的响应并且您不想注册每个命名空间,则此功能尤其有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM