[英]next page crawl in Scrapy
我正在嘗試從網站獲取一些數據,但即使在正確的分頁鏈接之后,我的蜘蛛也沒有爬到下一頁。
import scrapy
class NspiderSpider(scrapy.Spider):
name = "nspider"
allowed_domains = ["elimelechlab.yale.edu/"]
start_urls = ["https://elimelechlab.yale.edu/pub"]
def parse(self, response):
title = response.xpath(
'//*[@class="views-field views-field-title"]/span/text()'
).extract()
doi_link = response.xpath(
'//*[@class="views-field views-field-field-doi-link"]//a[1]/@href'
).extract()
yield {"paper_title": title, "doi_link": doi_link}
next_page = response.xpath(
'//*[@title="Go to next page"]/@href'
).extract_first() # extracting next page link
if next_page:
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse)
PS:我不想使用 LinkExtractor。 任何幫助,將不勝感激。
您的 next_page 邏輯沒有問題,代碼只是沒有達到這一點,因為該項目的產量處於相同的標識級別。 嘗試以下方法:
import scrapy
class NspiderSpider(scrapy.Spider):
name = "nspider"
allowed_domains = ["elimelechlab.yale.edu"]
start_urls = ["https://elimelechlab.yale.edu/pub"]
def parse(self, response):
for view in response.css('div.views-row'):
yield {
'paper_title': view.css('div.views-field-title span.field-content::text').get(),
'doi_link': view.css('div.views-field-field-doi-link div.field-content a::attr(href)').get()
}
next_page = response.xpath(
'//*[@title="Go to next page"]/@href'
).extract_first() # extracting next page link
if next_page:
yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.