[英]yield scrapy.Request() not working properly for crawling next page
The same code works for the different site but not with this one! 相同的代码可用于其他站点,但不适用于此站点!
site = http://quotes.toscrape.com/ 网站= http://quotes.toscrape.com/
It doesn't give any error and successfully craws 8 pages (or count pages) import scrapy 它没有给出任何错误,并成功抓取了8页(或计数页)导入scrapy
count = 8
class QuotesSpiderSpider(scrapy.Spider):
name = 'quotes_spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.xpath('//*[@class="quote"]')
for quote in quotes:
text = quote.xpath('.//*[@class="text"]/text()').extract_first()
author = quote.xpath('.//*[@class="author"]/text()').extract_first()
yield{
'Text' : text,
'Author' : author
}
global count
count = count - 1
if(count > 0):
next_page = response.xpath('//*[@class="next"]/a/@href').extract_first()
absolute_next_page = response.urljoin(next_page)
yield scrapy.Request(absolute_next_page)
But it crawls only 1st page for this site 但该网站仅抓取第一页
site https://www.goodreads.com/list/show/7 网站https://www.goodreads.com/list/show/7
# -*- coding: utf-8 -*-
import scrapy
count = 5
class BooksSpider(scrapy.Spider):
name = 'books'
allowed_domains = ["goodreads.com/list/show/7"]
start_urls = ["https://goodreads.com/list/show/7/"]
def parse(self, response):
books = response.xpath('//tr/td[3]')
for book in books:
bookTitle = book.xpath('.//*[@class="bookTitle"]/span/text()').extract_first()
authorName = book.xpath('.//*[@class="authorName"]/span/text()').extract_first()
yield{
'BookTitle' : bookTitle,
'AuthorName' : authorName
}
global count
count = count - 1
if (count > 0):
next_page_url = response.xpath('//*[@class="pagination"]/a[@class="next_page"]/@href').extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url = absolute_next_page_url)
I want to crawl certain limited pages or all pages of 2nd site. 我想抓取第二站点的某些受限页面或所有页面。
You are using a domain with path in allowed_domains
. 您正在使用一个路径位于allowed_domains
的域。
allowed_domains = ["goodreads.com/list/show/7"]
should be 应该
allowed_domains = ["goodreads.com"]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.