简体   繁体   English

在脚本中使用CrawlSpider时未遵循已定义的规则

[英]CrawlSpider not following defined rules when used in a script

I have this scraper that works perfectly fine when I call it from command line. 我有这个刮板,当我从命令行调用它时,它工作得非常好。 like, 喜欢,

scrapy crawl generic

and this is how my scraper looks. 这就是我的刮板外观。

import scrapy  
from scrapy.spiders import Rule,CrawlSpider  
from scrapy.linkextractors import LinkExtractor  

class MySpider(CrawlSpider):  
    name='generic'  
    rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)  
    start_urls=["someurl"]
    allowed_domains=["somedomain"]


    def parse_item(self,response):
        extract some data and store it somewhere

I'm trying to use this spider from a python script. 我正在尝试从python脚本使用此蜘蛛。 and I followed the documentation http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script 并且我遵循了文档http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

and this is what the script looks like, 这就是脚本的样子,

from scrapy.settings import Settings
from scrapy.crawler import CrawlerProcess
import scrapy  
from scrapy.spiders import Rule,CrawlSpider  
from scrapy.linkextractors import LinkExtractor  

class MySpider(CrawlSpider):  
    name='generic'  
    rules = (Rule(LinkExtractor(allow=(r'.{22}.+')),callback='parse_item', follow=True),)  
    start_urls=["someurl"]
    allowed_domains=["somedomain"]




    def parse_item(self,response):
        extract some data and store it somewhere

settings=Settings()
settings.set('DEPTH_LIMIT',1)

process = CrawlerProcess(settings)
process.crawl(MySpider)
process.start()

This is what I see on the terminal when i run from script 这是我从脚本运行时在终端上看到的内容

Desktop $ python newspider.py  
2015-10-14 21:46:39 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2015-10-14 21:46:39 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 21:46:39 [scrapy] INFO: Overridden settings: {'DEPTH_LIMIT': 1}
2015-10-14 21:46:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 21:46:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 21:46:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 21:46:39 [scrapy] INFO: Enabled item pipelines: 
2015-10-14 21:46:39 [scrapy] INFO: Spider opened
2015-10-14 21:46:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 21:46:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 21:46:39 [scrapy] DEBUG: Redirecting (302) to <GET http://thevine.com.au/> from <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'thevine.com.au': <GET http://thevine.com.au/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.twitter.com': <GET http://www.twitter.com/thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/sharer.php?u=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/intent/tweet?text=Leonardo+DiCaprio+is+Producing+A+Movie+About+The+Volkswagen+Emissions+Scandal&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F&via=thevineonline>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'plus.google.com': <GET http://plus.google.com/share?url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] DEBUG: Filtered offsite request to 'pinterest.com': <GET http://pinterest.com/pin/create/button/?media=http%3A%2F%2Fs3-ap-southeast-2.amazonaws.com%2Fthevine-online%2Fwp-content%2Fuploads%2F2015%2F10%2F13202447%2FScreen-Shot-2015-10-14-at-7.24.25-AM.jpg&url=http%3A%2F%2Fthevine.com.au%2Fentertainment%2Fcelebrity%2Fleonardo-dicaprio-is-producing-a-movie-about-the-volkswagen-emissions-scandal%2F>
2015-10-14 21:46:41 [scrapy] INFO: Closing spider (finished)
2015-10-14 21:46:41 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 424,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 28536,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 14, 16, 16, 41, 270707),
 'log_count/DEBUG': 10,
 'log_count/INFO': 7,
 'offsite/domains': 7,
 'offsite/filtered': 139,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2015, 10, 14, 16, 16, 39, 454120)}

In this case, the start_url was http://thevine.com.au/ and allowed_domains: thevine.com.au 在这种情况下,起始网址为http://thevine.com.au/,allowed_domains:thevine.com.au
The same starturl and domain when given to the spider running as a scrapy project gives this, 当将相同的起始网址和域分配给作为抓取项目运行的Spider时,会得到此结果,

$ scrapy crawl generic -a start="http://thevine.com.au/" -a domains="thevine.com.au"
2015-10-14 22:14:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: mary)
2015-10-14 22:14:45 [scrapy] INFO: Optional features available: ssl, http11
2015-10-14 22:14:45 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mary.spiders', 'SPIDER_MODULES': ['mary.spiders'], 'DEPTH_LIMIT': 1, 'BOT_NAME': 'mary'}
2015-10-14 22:14:45 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-14 22:14:46 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-10-14 22:14:46 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-10-14 22:14:46 [scrapy] INFO: Enabled item pipelines:
2015-10-14 22:14:46 [scrapy] INFO: Spider opened
2015-10-14 22:14:46 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-14 22:14:46 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-14 22:14:47 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/> (referer: None)
2015-10-14 22:14:47 [scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET https://www.pinterest.com/thevineonline/>
.
.
2015-10-14 22:14:48 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/category/entertainment/> (referer: http://thevine.com.au/)

2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/ 
2015-10-14 22:16:10 [scrapy] DEBUG: Ignoring link (depth > 1): http://thevine.com.au/category/entertainment/viral/
.
.

2015-10-14 22:16:10 [scrapy] DEBUG: Crawled (200) <GET http://thevine.com.au/gear/tech/elon-musk-plans-to-launch-4000-satellites-to-bring-wi-fi-to-most-remote-locations-on-earth/> (referer: http://thevine.com.au/)  
2015-10-14 22:19:31 [scrapy] INFO: Crawled 26 pages (at 16 pages/min), scraped 0 items (at 0 items/min)

and so on, it just keeps going. 依此类推,它一直在继续。

So basically this is what i understand about what happens when i run from the script 所以基本上这就是我从脚本运行时所发生的事情
Rule is not followed at all. 根本不遵守Rule My parse_item callback doesn't work.And any callback other than the default parse doesn't work. 我的parse_item回调不起作用,除默认parse外的任何回调均不起作用。 It only crawls the urls in start_urls and only calls back to the default parse method if included. 它仅对start_urls的URL进行爬网,并且仅回调默认的parse方法( 如果包括)。

you need to pass an instance of the Spider Class to the .crawl method. 您需要将Spider类的实例传递给.crawl方法。

...
spider = MySpider()
process.crawl(spider)
...

but it should still work as you are doing it. 但是它仍然可以在您执行操作时正常工作。

Logs show that you are doing offsite requests, try removing allowed_domains from the Spider (if you don't care about it) but you could also pass domain on the process.crawl : 日志显示您正在执行异地请求,请尝试从Spider中删除allowed_domains (如果您不关心它的话),但也可以在process.crawl上传递domain:

process.crawl(spider, domain="mydomain")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM