繁体   English   中英

Scrapy API-Spider类的init参数设为None

[英]Scrapy API - Spider class init argument turned to None

在Windows 7上全新安装了适用于Windows的Miniconda 64位exe安装程序以及Windows 7上的python 2.7之后,通过Scrapy安装了以下软件:

  • Python 2.7.12
  • Scrapy 1.1.1
  • 扭曲的16.4.1

这个最小的代码从“ python scrapy_test.py”运行(使用Scrapy API):

#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-

import scrapy.spiders.crawl
import scrapy.crawler
import scrapy.utils.project

class MySpider(scrapy.spiders.crawl.CrawlSpider) :
    name = "stackoverflow.com"
    allowed_domains = ["stackoverflow.com"]
    start_urls = ["http://stackoverflow.com/"]
    download_delay = 1.5

    def __init__(self, my_arg = None) :
        print "def __init__"

        self.my_arg = my_arg
        print "self.my_arg"
        print self.my_arg

    def parse(self, response) :
        pass

def main() :
    my_arg = "Value"

    process = scrapy.crawler.CrawlerProcess(scrapy.utils.project.get_project_settings())
    process.crawl(MySpider(my_arg))
    process.start()

if __name__ == "__main__" :
    main()

给出此输出:

[scrapy] INFO: Scrapy 1.1.1 started (bot: scrapy_project)
[scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_project.spiders', 'SPIDER_MODULES': ['scrapy_project.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'scrapy_project'}
def __init__
self.my_arg
Value
[scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
def __init__
self.my_arg
None
[...]

请注意,init方法如何运行两次,以及第二次运行后存储的参数如何变为None,这不是我想要的。 这应该发生吗?

如果我更改:

def __init__(self, my_arg = None) :

至:

def __init__(self, my_arg) :

输出为:

[...]
Unhandled error in Deferred:
[twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "scrapy_test.py", line 28, in main
    process.crawl(MySpider(my_arg))
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\crawler.py", line 163, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\crawler.py", line 167, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\twisted\internet\defer.py", line 1331, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\twisted\internet\defer.py", line 1185, in _inlineCallbacks
    result = g.send(result)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\crawler.py", line 90, in crawl
    six.reraise(*exc_info)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\crawler.py", line 71, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\crawler.py", line 94, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\spiders\crawl.py", line 96, in from_crawler
    spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\spiders\__init__.py", line 50, in from_crawler
    spider = cls(*args, **kwargs)
exceptions.TypeError: __init__() takes exactly 2 arguments (1 given)
[twisted] CRITICAL:
Traceback (most recent call last):
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\twisted\internet\defer.py", line 1185, in _inlineCallbacks
    result = g.send(result)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\crawler.py", line 90, in crawl
    six.reraise(*exc_info)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\crawler.py", line 71, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\crawler.py", line 94, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\spiders\crawl.py", line 96, in from_crawler
    spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
  File "C:\Users\XYZ\Miniconda2\lib\site-packages\scrapy\spiders\__init__.py", line 50, in from_crawler
    spider = cls(*args, **kwargs)
TypeError: __init__() takes exactly 2 arguments (1 given)

不知道如何解决这个问题。 任何想法?

这是scrapy.crawler.CrawlerProcess.crawl()的方法定义:

crawl(crawler_or_spidercls, *args, **kwargs)

  • crawler_or_spidercls( Crawler例如, Spider子类或字符串) -已创建的履带式或蜘蛛类或蜘蛛的名字里面的项目来创建它
  • argslist )–初始化蜘蛛的参数
  • kwargsdict )–初始化蜘蛛的关键字参数

这意味着您应该将Spider的名称与初始化所述Spider所需的kwargs分开传递,如下所示:

process.crawl(MySpider, my_arg = 'Value')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM