简体   繁体   English

Xpath 或 css 选择器——scrapy

[英]Xpath or css selector - scrapy

Am trying to select a "next" navigation link and cannot seem to find the right combination selector in scrapy.我正在尝试选择“下一个”导航链接,但似乎无法在 scrapy 中找到正确的组合选择器。

This is the web url: search page on boat listing site这是网址: 船舶列表网站上的搜索页面

the link I'm trying to select is this tag:我试图选择的链接是这个标签:

<a rel="nofollow" class="icon-chevron-right " href="/boats-for-sale/condition-used/type-power/class-power-sport-fishing/?year=2006-2014&amp;length=40-65&amp;page=2"><span class="aria-fixes">2</span></a>

I've tried many combinations of response.xpath and response.css selectors but can't seem to find the right combination.我尝试了许多 response.xpath 和 response.css 选择器的组合,但似乎找不到正确的组合。

Using google chrome inspector, I get this xpath: //*[@id="root"]/div[2]/div[2]/div[2]/div/div[3]/a[9]使用谷歌浏览器检查器,我得到这个 xpath: //*[@id="root"]/div[2]/div[2]/div[2]/div/div[3]/a[9]

Ultimately, I'm trying to get the href attribute of the tag which contains the URL I want to follow.最终,我试图获取包含我想要关注的 URL 的标签的 href 属性。

Am I running into problems with the rel='nofollow' attribute and a scrapy setting?我是否遇到了 rel='nofollow' 属性和scrapy 设置的问题?

EDIT - this code used to work but now get an error on the css selector:编辑 - 此代码曾经可以工作,但现在在 css 选择器上出现错误:

def parse(self, response):

        listing_objs =  response.xpath("//div[@class = 'listings-container']/a")
        for listing in listing_objs:

            yield response.follow(listing.attrib['href'], callback= self.parse_detail)

        next_page = response.css("a.icon-chevron-right").attrib['href']

        if next_page is not None:
            yield response.follow(next_page, callback = self.parse)

In this case you can access any page of the website bye adding &page=# at the end of URL, this approach will satisfy accessing next page content after current page have been crawled.在这种情况下,您可以通过在 URL 末尾添加&page=#来访问网站的任何页面,这种方法将满足当前页面被抓取后访问下一页内容。
For instance you can do something like this:例如,您可以执行以下操作:

def start_request(self):
    main_url = "https://www.yachtworld.com/boats-for-sale/condition-used/type-power" \
        "/class-power-sport-fishing/?year=2006-2014&length=40-65&page=%(page)s"
    for i in range(pages):
        yield scrapy.Request(main_url % {'page': i}, callback=self.parse)

@Piron's answer above is probably the easiest way to iterate over pages, but should you still want to go the Xpath route:上面@Piron 的回答可能是遍历页面的最简单方法,但您是否仍想走 Xpath 路线:

response.xpath(".//div[@class='search-page-nav']/a[@class='icon-chevron-right']/@href/text()")

Where search-page-nav is the parent div class of the other page links, icon-chevron-right is the particular class of the a tag you're looking for, @href selects the link of that a tag, and text() converts the attribute to text.其中search-page-nav是其他页面链接的父 div 类, icon-chevron-right是您要查找的 a 标签的特定类,@ href选择该 a 标签的链接,而text()将属性转换为文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM