scrapy 503服务在starturl上不可用

Question

我修改了这个蜘蛛，但它给出了这个错误

Gave up retrying <GET https://lib.maplelegends.com/robots.txt> (failed 3 times): 503 Service Unavailable 
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/robots.txt> (referer: None)
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 1 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 2 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://lib.maplelegends.com/?p=etc&id=4004003> (failed 3 times): 503 Service Unavailable
2019-01-06 23:43:56 [scrapy.core.engine] DEBUG: Crawled (503) <GET https://lib.maplelegends.com/?p=etc&id=4004003> (referer: None)
2019-01-06 23:43:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://lib.maplelegends.com/?p=etc&id=4004003>: HTTP status code is not handled or not allowed

检索器代码：

#!/usr/bin/env python3

import scrapy
import time

start_url = 'https://lib.maplelegends.com/?p=etc&id=4004003'


class MySpider(scrapy.Spider):
    name = 'MySpider'

    start_urls = [start_url]

    def parse(self, response):
        # print('url:', response.url)

        products = response.xpath('.//div[@class="table-responsive"]/table/tbody')

        for product in products:
            item = {
                #'name': product.xpath('./tr/td/b[1]/a/text()').extract(),
                'link': product.xpath('./tr/td/b[1]/a/@href').extract(),
            }

            # url = response.urljoin(item['link'])
            # yield scrapy.Request(url=url, callback=self.parse_product, meta={'item': item})

            yield response.follow(item['link'], callback=self.parse_product, meta={'item': item})

        time.sleep(5)

        # execute with low
        yield scrapy.Request(start_url, dont_filter=True, priority=-1)

    def parse_product(self, response):
        # print('url:', response.url)

        # name = response.xpath('(//strong)[1]/text()').re(r'(\w+)')

        hp = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "image", " " ))] | //img').re(r':(\d+)')

        scrolls = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "image", " " ))] | //strong+//a//img/@title').re(r'\bScroll\b')

        for price, hp, scrolls in zip(name, hp, scrolls):
            yield {'name': name.strip(), 'hp': hp.strip(), 'scroll':scrolls.strip()}

---它无需项目即可运行并保存在`output.csv` -

from scrapy.crawler import CrawlerRunner

def _run_crawler(spider_cls, settings):
    """
    spider_cls: Scrapy Spider class
    returns: Twisted Deferred
    """
    runner = CrawlerRunner(settings)
    return runner.crawl(spider_cls)     # return Deferred


def test_scrapy_crawler():
    deferred = _run_crawler(MySpider, settings)

    @deferred.addCallback
    def _success(results):
        """
        After crawler completes, this function will execute.
        Do your assertions in this function.
        """

    @deferred.addErrback
    def _error(failure):
        raise failure.value

    return deferred

Answer 1

robots.txt的

您的抓取工具正在尝试检查robots.txt文件，但该网站上没有该文件。

为避免这种情况，您可以在settings.py文件中将ROBOTSTXT_OBEY设置设置为false。
默认情况下为False，但是使用scrapy startproject命令生成的新的scrapy项目的ROBOTSTXT_OBEY = True是从模板生成的。

503个回应

此外，该网站似乎在每个首次请求中均以503响应。 该网站正在使用某种机器人保护：

第一个请求是503，然后正在执行一些javascript发出AJAX请求以生成__shovlshield cookie：

似乎正在使用https://shovl.io/ ddos保护。

为了解决这个问题，您需要对javascript如何生成cookie或采用selenium或splash等javascript渲染技术/服务进行反向工程。

scrapy 503服务在starturl上不可用

问题描述

---它无需项目即可运行并保存在`output.csv` -

1 个解决方案

解决方案1
2 已采纳 2019-01-07 13:58:11

robots.txt的

503个回应

scrapy 503服务在starturl上不可用

问题描述

---它无需项目即可运行并保存在output.csv -

1 个解决方案

解决方案1 2 已采纳 2019-01-07 13:58:11

robots.txt的

503个回应

---它无需项目即可运行并保存在`output.csv` -

解决方案1
2 已采纳 2019-01-07 13:58:11