I have successfully created a spider retrieving links for every web pages of a domain.
I would like to do the same thing, but for as many domains as I host, and for this, I prefer using my spider, simply adding it as argument the domain to be monitored.
The documentation here explains that we should explicitly define the constructor and add parameters in it, launching then the spider with the command scrapy crawl myspider.
Here is my code:
class MySpider(BaseSpider):
name= 'spider'
def __init__(self, domain='some_domain.net'):
self.domain = domain
self.allowed_domains = [self.domain]
self.start_urls = [ 'http://'+self.domain ]
def parse(self, response):
hxs = HtmlPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not url.startswith('http://'):
url= URL + url
print url
yield Request(url, callback=self.parse)
However, launching
scrapy crawl spider -a domain='mydomain.my_extension'
returns:
ERROR: unable to find spider: spider
When I lauch the same code, but without explicit constructor, I can't do this with crawl, I have to use this command instead:
scrapy runspider /path/to/spider/spider.py
and I can't use parameters with runspider, I have to run crawl instead
Why is it not possible to use scrapy crawl spider? Why the Spider's name is never getting read by scrapy crawl, as it is with scrapy runspider?
Scrapy looks great, but quite disturbing on second sight :/
Many thanks for your help
If you run scrapy 0.14 You should set the variables at the class level and not at the instance level. I think this has changed in 0.15
I recommend reading the documentation: http://doc.scrapy.org/en/0.14/topics/spiders.html
class MySpider(BaseSpider):
name= 'spider'
domain = domain
allowed_domains = [self.domain]
start_urls = [ 'http://'+self.domain ]
def parse(self, response):
hxs = HtmlPathSelector(response)
for url in hxs.select('//a/@href').extract():
if not url.startswith('http://'):
url= URL + url
print url
yield Request(url, callback=self.parse)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.