简体   繁体   English

Scrapy Xpath获得正确的分页

[英]Scrapy Xpath getting the correct pagination

First of all thank you if you are reading this. 首先,如果您正在阅读本文,则谢谢。

I have been scraping away for some time to collect minor data, however I want to pull in some additional information but I got stuck on a pagination. 我花了一些时间来收集次要数据,但是我想获取一些其他信息,但是却陷入了分页中。

I would like to get the data-href of the link, however it needs to consist the 我想获取链接的data-href,但是它需要包含

i have been using [contains()] when however how do you get data-href when i needs to contain an object with a specific class 我一直在使用[contains()],但是当我需要包含具有特定类的对象时,如何获取数据-href

<li><a class="cursor" data-type="js" data-href="test"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li>

I have been using the following: 我一直在使用以下内容:

next_page_url = response.selector.xpath('//*[@class="text-center"]/ul/li/a[contains(@class,"cursor")]/@data-href').extract_first()

which works but not for the correct data-href 起作用但不适用于正确的数据href

Many thanks for the help 非常感谢您的帮助

Full source code: 完整的源代码:

<div class="pagination-container margin-bottom-20"> <div class="text-center"><ul class="pagination"><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html"><i class="fa fa-chevron-left" aria-hidden="true"></i></a></li><li><a href="/used-truck/1-32/truck-ads.html">1</a></li><li class="active"><a>2</a></li><li><a href="/used-truck/1-32/truck-ads.html?p=3">3</a></li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs"><a href="/used-truck/1-32/truck-ads.html?p=12">12</a></li><li class="hidden-xs no-link"><a>...</a></li><li class="hidden-xs"><a href="/used-truck/1-32/truck-ads.html?p=22">22</a></li><li><a class="cursor" data-type="js" data-href="/used-truck/1-32/truck-ads.html?p=3"><i class="fa fa-chevron-right" aria-hidden="true"></i></a></li></ul></div> </div> </div>

Huh... Turned out to be such a simple case (: 呵呵...原来是这么简单的情况(:

Your mistake is .extract_first() while you should extract last item to get next page. 当您应该提取最后一个项目以获取下一页时,您的错误是.extract_first()

next_page = response.xpath('//a[@class="cursor"]/@data-href').extract()[-1]

This will do the trick. 这将达到目的。 But I'd recommend you to extract all the links from pagination list, since scrapy is managing duplication crawling. 但我建议您从分页列表中提取所有链接,因为scrapy正在管理重复爬网。 This will do a better job and having less chances for mistake: 这将做得更好,并且出错的机会更少:

pages = response.xpath('//ul[@class="pagination"]//a/@href').extract()
for url in pages:
    yield scrapy.Request(url=response.urljoin(url), callback=self.whatever)

And so on.. 等等..

试试看:

next_page_url = response.selector.xpath('//*[@class="text-center"]/ul/li/a[@class="cursor")]/@data-href').extract_first()

I'd suggest you to make sure that your element exists in initial html first: 我建议您先确保您的元素存在于初始html中:

just Ctlr+U in Chrome and then Ctrl+F to find element.. 只需在Chrome中Ctrl+F Ctlr+U ,然后Ctrl+F即可查找元素。

If element can be found there - something's wrong with your xpath selector. 如果可以在其中找到element-您的xpath选择器出了点问题。 Else element is generated by javascript and you have to use another way to get the data. 其他元素是由javascript生成的,您必须使用另一种方式来获取数据。

PS. PS。 You shouldn't use Chrome Devtools "Elements" tab to check if element exists or not, because this tab contains elements with JS code already applied. 您不应使用Chrome Devtools的“元素”标签来检查元素是否存在,因为该标签包含已应用JS代码的元素。 So check source only( ctrl+U ) 因此,仅检查源代码( ctrl+U

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM