简体   繁体   中英

response 405 from Scrapy

I was trying to scrape the authors data from http://quotes.toscrape.com/ , but unfortunatly the author pages return 405 when I run the spider; whereas in the browser or by fetching the url in Scrapy shell it returns 200 .

class AuthorsSpider(scrapy.Spider):
    name = 'authors'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    custom_settings = {
        'CONCURRENT_REQUESTS': 50,
        'DOWNLOAD_DELAY': 0.1,
        'FEED_URI': f'output/authors_{datetime.datetime.today().strftime("%Y-%m-%d %H-%M-%S")}.csv',
        'FEED_FORMAT': 'csv',
        'FEED_EXPORTERS': {'csv': 'scrapy.exporters.CsvItemExporter'},
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_EXPORT_FIELDS': ('name','birth_date','birth_location','description',) 
    }

    def parse(self, response):
        for _ in response.xpath("//div[@class='quote']"):
            author_page = response.xpath("//a[text()='(about)']/@href").get()
            yield response.follow(author_page,
                                method="GET",
                                callback=self.parse_author)

        next_page = response.xpath("//li[@class='next']/a/@href").get()
        if next_page:
            yield response.follow(next_page, self.parse)


    def parse_author(self, response):
        yield {
            'name': response.xpath("//h3[@class='author-title']/text()").get(),
            'birth_date': response.xpath("//span[@class='author-born-date']/text()").get(),
            'birth_location': response.xpath("//span[@class='author-born-location']/text()").get(),
            'description': response.xpath("//div[@class='author-description']/text()").get()
        }

here is part of the response when I run scrapy crawl authors :

2023-01-02 10:53:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (referer: http://quotes.toscrape.com/page/9/)
2023-01-02 10:53:33 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/Suzanne-Collins/> (referer: http://quotes.toscrape.com/page/7/)
2023-01-02 10:53:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 http://quotes.toscrape.com/author/Suzanne-Collins/>: HTTP status code is not handled or not allowed
2023-01-02 10:53:34 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/W-C-Fields/> (referer: http://quotes.toscrape.com/page/8/)
2023-01-02 10:53:34 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (308) to <NONE http://quotes.toscrape.com/author/John-Lennon/> from <GET http://quotes.toscrape.com/author/John-Lennon>
2023-01-02 10:53:34 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 http://quotes.toscrape.com/author/W-C-Fields/>: HTTP status code is not handled or not allowed
2023-01-02 10:53:34 [scrapy.core.engine] DEBUG: Crawled (405) <NONE http://quotes.toscrape.com/author/Alfred-Tennyson/> (referer: http://quotes.toscrape.com/page/8/)

Basically, with response.follow() you are asking parse function to follow this url again. if you want to pass url to another function then you need to use Scrapy.Request() instead of response.follow(). if you want to pass author's page urls to parse_author then your code should look like this.

class AuthorsSpider(scrapy.Spider):
    name = 'authors'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    custom_settings = {
        'CONCURRENT_REQUESTS': 50,
        'DOWNLOAD_DELAY': 0.1,
        'FEED_URI': f'output/authors_{datetime.datetime.today().strftime("%Y-%m-%d %H-%M-%S")}.csv',
        'FEED_FORMAT': 'csv',
        'FEED_EXPORTERS': {'csv': 'scrapy.exporters.CsvItemExporter'},
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_EXPORT_FIELDS': ('name','birth_date','birth_location','description',) 
    }

    def parse(self, response):
        for _ in response.xpath("//div[@class='quote']"):
            author_page = response.xpath("//a[text()='(about)']/@href").get()
            yield scrapy.Request(author_page,
                                method="GET",
                                callback=self.parse_author)

        next_page = response.xpath("//li[@class='next']/a/@href").get()
        if next_page:
            yield response.follow(next_page, self.parse)


    def parse_author(self, response):
        yield {
            'name': response.xpath("//h3[@class='author-title']/text()").get(),
            'birth_date': response.xpath("//span[@class='author-born-date']/text()").get(),
            'birth_location': response.xpath("//span[@class='author-born-location']/text()").get(),
            'description': response.xpath("//div[@class='author-description']/text()").get()
        }

attached image截屏 if you still have any question please reply to this answer. Happy learning..

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM