简体   繁体   中英

Running multiple instances of a CrawlSpider

i'm just getting started using scrapy and i'd like to do the following

Have a list of n domains
i=0
loop for i to n
Use a (mostly) generic CrawlSpider to get all links (a href) of domain[i]
Save results as json lines

to do this, the Spider needs to receive the domain it has to crawl as an argument.

I already successfully created the CrawlSpider:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess

class MyItem(Item):
    #MyItem Fields


class SubsiteSpider(CrawlSpider):
    name = "subsites"
    start_urls = []
    allowed_domains = []
    rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),)

    def __init__(self, starturl, allowed, *args, **kwargs):
        print(args)
        self.start_urls.append(starturl)
        self.allowed_domains.append(allowed)
        super().__init__(**kwargs)

    def parse_obj(self, response):
        item = MyItem()
        #fill Item Fields
        return item


process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(SubsiteSpider)
process.start()

If i call it with scrapy crawl subsites -a starturl=http://example.com -a allowed=example.com -o output.jl the result is exactly as i want it, so this part is fine already.

What i fail to do is create multiple instances of SubsiteSpider , each with a different domain as argument.

I tried (in SpiderRunner.py)

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('subsites', ['https://example.com', 'example.com'])
process.start()

Variant:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

allowed = ["example.com"]
start = ["https://example.com"]
process.crawl('subsites', start, allowed)
process.start()

But i get an error that occurs, i presume, because the argument is not properly passed to __init__ , for example TypeError: __init__() missing 1 required positional argument: 'allowed' or TypeError: __init__() missing 2 required positional arguments: 'starturl' and 'allowed' (Loop is yet to be implemented)

So, here are my questions: 1) What is the proper way to pass arguments to init , if i do not start crawling via scrapy shell, but from within python code? 2) How can i also pass the -o output.jl argument? (or maybe, use allowed argument as filename?) 3) I am fine with this running each spider after another - would it still be considered best / good practice to do it that way? Could you point to a more extensive tutorial about "running the same spider again and again, with different arguments(=target domains), optionally parallel", if there is one?

Thank you all very much in advance! If there are any spelling mistakes (not an english native speaker), or if question / details are not precise enough, please tell me how to correct them.

There are a few problems with your code:

  1. start_urls and allowed_domains are class attributes which you modify in __init__() , making them shared across all instances of your class.
    What you should do instead is make them instance attributes:

     class SubsiteSpider(CrawlSpider): name = "subsites" rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),) def __init__(self, starturl, allowed, *args, **kwargs): self.start_urls = [starturl] self.allowed_domains = [allowed] super().__init__(*args, **kwargs) 
  2. Those last 3 lines should not be in the file with you spider class, since you probably don't want to run that code each time your spider is imported.

  3. Your calling of CrawlProcess.crawl() is slightly wrong. You can use it like this, passing the arguments in the same manner you'd pass them to the spider class' __init__() .

     process = CrawlerProcess(get_project_settings()) process.crawl('subsites', 'https://example.com', 'example.com') process.start() 

How can i also pass the -o output.jl argument? (or maybe, use allowed argument as filename?

You can achieve the same effect using custom_settings , giving each instance a different FEED_URI setting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM