scrapy xpath选择器可在浏览器中使用，但不适用于爬网或shell

Question

I'm crawling the following page: http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/ 我正在抓取以下页面： http : //www.worldfootball.net/all_matches/eng-premier-league-2015-2016/

The first parse goes through and should get all the links with scores as the text. 第一次解析将通过，并且应该获得所有带有分数的链接作为文本。 I first loop through all the match rows: 我首先遍历所有匹配行：

for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr'):

And then get the links in the 6th column of the table 然后在表格的第6列中获取链接

    matchHref = sel.xpath('.//td[6]/a/@href').extract()

This however returns nothing. 但是，这什么也不会返回。 I tried the same selector in Chrome (with the addition of 'tbody' between table and tr selector) though and I get results. 我尝试在Chrome中使用相同的选择器（在表和tr选择器之间添加了“ tbody”），但得到了结果。 But, if I try the same selector (without the tbody) in scrapy shell, I only get results from the first response.xpath, while nothing with the following link extraction. 但是，如果我在scrapy shell中尝试使用相同的选择器（不带tbody），则只能从第一个response.xpath获得结果，而对于以下链接提取则什么也没有。

I've done a handful of these loops before but this simple thing has me stumped. 在此之前，我已经完成了一些循环，但是这个简单的事情让我很头疼。 Is there a better way to debug this? 有更好的方法来调试吗？ Here is some shell output where I just try and simplify my second selection to just select any td 这是一些shell输出，我在这里尝试简化第二个选择，以选择任何td

In [36]: for sel in response.xpath('(//table[@class="standard_tabelle"])[1]/tr'):
   ....:     sel.xpath('.//td')
   ....:

Nothing. 没有。 Ideas? 想法？

Answer 1

What I would do is to use the fact that these links in the 6th column contain the report in the href attribute value. 我要做的是利用第6列中的这些链接在href属性值中包含report的事实。 Demo from the shell: 外壳演示：

$ scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/"
>>> for row in response.xpath('(//table[@class="standard_tabelle"])[1]/tr[not(th)]'):
...     print(row.xpath(".//a[contains(@href, 'report')]/@href").extract_first())
... 
/report/premier-league-2015-2016-manchester-united-tottenham-hotspur/
/report/premier-league-2015-2016-afc-bournemouth-aston-villa/
/report/premier-league-2015-2016-everton-fc-watford-fc/
...
/report/premier-league-2015-2016-stoke-city-west-ham-united/
/report/premier-league-2015-2016-swansea-city-manchester-city/
/report/premier-league-2015-2016-watford-fc-sunderland-afc/
/report/premier-league-2015-2016-west-bromwich-albion-liverpool-fc/

Also note this part: tr[not(th)] - this helps to skip header rows with no relevant links. 还要注意这一部分： tr[not(th)] -这有助于跳过没有相关链接的标题行。

scrapy xpath选择器可在浏览器中使用，但不适用于爬网或shell

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-03-02 19:21:04

scrapy xpath选择器可在浏览器中使用，但不适用于爬网或shell

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-03-02 19:21:04

解决方案1
1 已采纳 2016-03-02 19:21:04