[英]response 405 from Scrapy
我試圖從http://quotes.toscrape.com/抓取作者數據,但不幸的是,當我運行蜘蛛時,作者頁面返回 405; 而在瀏覽器中或通過在 Scrapy shell 中Scrapy shell
它返回200
。
class AuthorsSpider(scrapy.Spider):
name = 'authors'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
custom_settings = {
'CONCURRENT_REQUESTS': 50,
'DOWNLOAD_DELAY': 0.1,
'FEED_URI': f'output/authors_{datetime.datetime.today().strftime("%Y-%m-%d %H-%M-%S")}.csv',
'FEED_FORMAT': 'csv',
'FEED_EXPORTERS': {'csv': 'scrapy.exporters.CsvItemExporter'},
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_EXPORT_FIELDS': ('name','birth_date','birth_location','description',)
}
def parse(self, response):
for _ in response.xpath("//div[@class='quote']"):
author_page = response.xpath("//a[text()='(about)']/@href").get()
yield response.follow(author_page,
method="GET",
callback=self.parse_author)
next_page = response.xpath("//li[@class='next']/a/@href").get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_author(self, response):
yield {
'name': response.xpath("//h3[@class='author-title']/text()").get(),
'birth_date': response.xpath("//span[@class='author-born-date']/text()").get(),
'birth_location': response.xpath("//span[@class='author-born-location']/text()").get(),
'description': response.xpath("//div[@class='author-description']/text()").get()
}
這是我運行scrapy crawl authors
時的部分響應:
2023-01-02 10:53:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (referer: http://quotes.toscrape.com/page/9/)
2023-01-02 10:53:33 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/Suzanne-Collins/> (referer: http://quotes.toscrape.com/page/7/)
2023-01-02 10:53:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 http://quotes.toscrape.com/author/Suzanne-Collins/>: HTTP status code is not handled or not allowed
2023-01-02 10:53:34 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/W-C-Fields/> (referer: http://quotes.toscrape.com/page/8/)
2023-01-02 10:53:34 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (308) to <NONE http://quotes.toscrape.com/author/John-Lennon/> from <GET http://quotes.toscrape.com/author/John-Lennon>
2023-01-02 10:53:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 http://quotes.toscrape.com/author/W-C-Fields/>: HTTP status code is not handled or not allowed
2023-01-02 10:53:34 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/Alfred-Tennyson/> (referer: http://quotes.toscrape.com/page/8/)
基本上,使用 response.follow() 你要求解析function 再次關注這個 url。 如果你想將 url 傳遞給另一個 function 那么你需要使用 Scrapy.Request() 而不是 response.follow()。 如果你想將作者的頁面 url 傳遞給parse_author那么你的代碼應該是這樣的。
class AuthorsSpider(scrapy.Spider):
name = 'authors'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
custom_settings = {
'CONCURRENT_REQUESTS': 50,
'DOWNLOAD_DELAY': 0.1,
'FEED_URI': f'output/authors_{datetime.datetime.today().strftime("%Y-%m-%d %H-%M-%S")}.csv',
'FEED_FORMAT': 'csv',
'FEED_EXPORTERS': {'csv': 'scrapy.exporters.CsvItemExporter'},
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_EXPORT_FIELDS': ('name','birth_date','birth_location','description',)
}
def parse(self, response):
for _ in response.xpath("//div[@class='quote']"):
author_page = response.xpath("//a[text()='(about)']/@href").get()
yield scrapy.Request(author_page,
method="GET",
callback=self.parse_author)
next_page = response.xpath("//li[@class='next']/a/@href").get()
if next_page:
yield response.follow(next_page, self.parse)
def parse_author(self, response):
yield {
'name': response.xpath("//h3[@class='author-title']/text()").get(),
'birth_date': response.xpath("//span[@class='author-born-date']/text()").get(),
'birth_location': response.xpath("//span[@class='author-born-location']/text()").get(),
'description': response.xpath("//div[@class='author-description']/text()").get()
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.