Xpath 或 css 选择器——scrapy

Question

Am trying to select a "next" navigation link and cannot seem to find the right combination selector in scrapy.我正在尝试选择“下一个”导航链接，但似乎无法在 scrapy 中找到正确的组合选择器。

This is the web url: search page on boat listing site这是网址：船舶列表网站上的搜索页面

the link I'm trying to select is this tag:我试图选择的链接是这个标签：

<a rel="nofollow" class="icon-chevron-right " href="/boats-for-sale/condition-used/type-power/class-power-sport-fishing/?year=2006-2014&amp;length=40-65&amp;page=2"><span class="aria-fixes">2</span></a>

I've tried many combinations of response.xpath and response.css selectors but can't seem to find the right combination.我尝试了许多 response.xpath 和 response.css 选择器的组合，但似乎找不到正确的组合。

Using google chrome inspector, I get this xpath: //*[@id="root"]/div[2]/div[2]/div[2]/div/div[3]/a[9]使用谷歌浏览器检查器，我得到这个 xpath： //*[@id="root"]/div[2]/div[2]/div[2]/div/div[3]/a[9]

Ultimately, I'm trying to get the href attribute of the tag which contains the URL I want to follow.最终，我试图获取包含我想要关注的 URL 的标签的 href 属性。

Am I running into problems with the rel='nofollow' attribute and a scrapy setting?我是否遇到了 rel='nofollow' 属性和scrapy 设置的问题？

EDIT - this code used to work but now get an error on the css selector:编辑 - 此代码曾经可以工作，但现在在 css 选择器上出现错误：

def parse(self, response):

        listing_objs =  response.xpath("//div[@class = 'listings-container']/a")
        for listing in listing_objs:

            yield response.follow(listing.attrib['href'], callback= self.parse_detail)

        next_page = response.css("a.icon-chevron-right").attrib['href']

        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)

Answer 1

In this case you can access any page of the website bye adding &page=# at the end of URL, this approach will satisfy accessing next page content after current page have been crawled.在这种情况下，您可以通过在 URL 末尾添加&page=#来访问网站的任何页面，这种方法将满足当前页面被抓取后访问下一页内容。
For instance you can do something like this:例如，您可以执行以下操作：

def start_request(self):
    main_url = "https://www.yachtworld.com/boats-for-sale/condition-used/type-power" \
        "/class-power-sport-fishing/?year=2006-2014&length=40-65&page=%(page)s"
    for i in range(pages):
        yield scrapy.Request(main_url % {'page': i}, callback=self.parse)

Answer 2

@Piron's answer above is probably the easiest way to iterate over pages, but should you still want to go the Xpath route:上面@Piron 的回答可能是遍历页面的最简单方法，但您是否仍想走 Xpath 路线：

response.xpath(".//div[@class='search-page-nav']/a[@class='icon-chevron-right']/@href/text()")

Where search-page-nav is the parent div class of the other page links, icon-chevron-right is the particular class of the a tag you're looking for, @href selects the link of that a tag, and text() converts the attribute to text.其中search-page-nav是其他页面链接的父 div 类， icon-chevron-right是您要查找的 a 标签的特定类，@ href选择该 a 标签的链接，而text()将属性转换为文本。

Xpath 或 css 选择器——scrapy

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-01-04 22:59:24

解决方案2
0 2020-01-04 23:10:10

Xpath 或 css 选择器——scrapy

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-01-04 22:59:24

解决方案2 0 2020-01-04 23:10:10

解决方案1
0 已采纳 2020-01-04 22:59:24

解决方案2
0 2020-01-04 23:10:10