繁体   English   中英

从Python脚本将参数传递给Scrapy Spider

[英]Passing Argument to Scrapy Spider from Python Script

我只提到我在发布这个问题之前提到的一些问题(在发布这个问题之前,我目前没有链接到我提到过的所有问题) - :

我可以完全运行此代码,如果我没有传递参数并要求用户从BBSpider类输入(没有主函数 - 在name =“dmoz”行下方),或者将它们作为预定义(即静态)参数。

我的代码在这里

我基本上试图从Python脚本执行Scrapy蜘蛛而不需要任何其他文件(甚至是设置文件)。 这就是为什么我在代码本身内部也指定了设置。

这是我执行此脚本时的输出 - :

http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)

我目前面临的问题 - :

  • 如果你仔细看到我的输出的第1行和第6行,我传递给我的蜘蛛的start_url被打印两次,即使我在我的代码的第31行(我上面给出的链接)上只写了一次print语句。 为什么会发生这种情况,并且具有不同的值(第1行(我的输出)的初始打印语句输出给出正确的结果,虽然第6行(我的输出)的print语句输出?不仅如此,即使我写 - 打印'hi' - 然后它也被打印两次。为什么会这样?
  • 接下来,如果你看到我输出的这一行 - :TypeError:请求url必须是str或unicode,得到NoneType:为什么会出现这种情况(即使我上面发布的问题的链接写了相同的东西)? 我不知道如何解决它? 我甚至试过`self.start_urls = [str(kwargs.get('start_url'))]` - 然后它给出了以下输出:
http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)

请帮我解决以上2个错误。

您需要在CrawlerProcesscrawl方法上传递参数,因此您需要像这样运行它:

crawler = CrawlerProcess(Settings())
crawler.crawl(BBSpider, start_url=url)
crawler.start()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM