简体   繁体   English

无法使用Scrapy抓取下一页内容

[英]Can't scrape next page contents using Scrapy

I want to scrape the contents from the next pages too but it didn't go to the next page. 我也想从下一页上抓取内容,但没有转到下一页。 My code is: 我的代码是:

import scrapy
class AggregatorSpider(scrapy.Spider):
name = 'aggregator'
allowed_domains = ['startech.com.bd/component/processor']
start_urls = ['https://startech.com.bd/component/processor']

def parse(self, response):
    processor_details = response.xpath('//*[@class="col-xs-12 col-md-4 product-layout grid"]')
    for processor in processor_details:
        name = processor.xpath('.//h4/a/text()').extract_first()
        price = processor.xpath('.//*[@class="price space-between"]/span/text()').extract_first()
        print ('\n')
        print (name)
        print (price)
        print ('\n')
    next_page_url = response.xpath('//*[@class="pagination"]/li/a/@href').extract_first()
    # absolute_next_page_url = response.urljoin(next_page_url)
    yield scrapy.Request(next_page_url)

I didn't use the urljoin because the next_page_url is giving me the whole url. 我没有使用urljoin,因为next_page_url给了我整个URL。 I also tried the dont_filter=true argument in the yield function which gives me an infinite loop through the 1st page. 我还尝试了yield函数中的dont_filter = true参数,该参数使我在第一个页面上无限循环。 The message I'm getting from the terminal is [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.startech.com.bd': https://www.startech.com.bd/component/processor?page=2> 我从终端收到的消息是[scrapy.spidermiddlewares.offsite]调试:过滤到“ www.startech.com.bd”的异地请求:https://www.startech.com.bd/component/processor?page = 2>

This is because your allowed_domains variable is wrong, use allowed_domains = ['www.startech.com.bd'] instead (see the doc) . 这是因为您的allowed_domains变量是错误的,请改用allowed_domains = ['www.startech.com.bd'] (请参阅文档)

You can also modify your next page selector in order to avoid going to page one again: 您还可以修改您的下一页选择器,以避免再次进入第一个页面:

import scrapy
class AggregatorSpider(scrapy.Spider):
    name = 'aggregator'
    allowed_domains = ['www.startech.com.bd']
    start_urls = ['https://startech.com.bd/component/processor']

    def parse(self, response):
        processor_details = response.xpath('//*[@class="col-xs-12 col-md-4 product-layout grid"]')
        for processor in processor_details:
            name = processor.xpath('.//h4/a/text()').extract_first()
            price = processor.xpath('.//*[@class="price space-between"]/span/text()').extract_first()
            yield({'name': name, 'price': price})
        next_page_url = response.css('.pagination li:last-child a::attr(href)').extract_first()
        if next_page_url:
            yield scrapy.Request(next_page_url)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM