简体   繁体   中英

python scrapy unable to find spider while trying to use arguments

I have successfully created a spider retrieving links for every web pages of a domain.

I would like to do the same thing, but for as many domains as I host, and for this, I prefer using my spider, simply adding it as argument the domain to be monitored.

The documentation here explains that we should explicitly define the constructor and add parameters in it, launching then the spider with the command scrapy crawl myspider.

Here is my code:

class MySpider(BaseSpider):
    name= 'spider'

    def __init__(self, domain='some_domain.net'):
        self.domain = domain
        self.allowed_domains = [self.domain]
        self.start_urls = [ 'http://'+self.domain ]

    def parse(self, response):
        hxs = HtmlPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not url.startswith('http://'):
                url= URL + url 
            print url
        yield Request(url, callback=self.parse)

However, launching

scrapy crawl spider -a domain='mydomain.my_extension'

returns:

ERROR: unable to find spider: spider

When I lauch the same code, but without explicit constructor, I can't do this with crawl, I have to use this command instead:

scrapy runspider /path/to/spider/spider.py

and I can't use parameters with runspider, I have to run crawl instead

Why is it not possible to use scrapy crawl spider? Why the Spider's name is never getting read by scrapy crawl, as it is with scrapy runspider?

Scrapy looks great, but quite disturbing on second sight :/

Many thanks for your help

If you run scrapy 0.14 You should set the variables at the class level and not at the instance level. I think this has changed in 0.15

I recommend reading the documentation: http://doc.scrapy.org/en/0.14/topics/spiders.html

class MySpider(BaseSpider):
        name= 'spider'
        domain = domain
        allowed_domains = [self.domain]
        start_urls = [ 'http://'+self.domain ]


    def parse(self, response):
        hxs = HtmlPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not url.startswith('http://'):
                url= URL + url 
            print url
        yield Request(url, callback=self.parse)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM