简体   繁体   English

运行CrawlSpider的多个实例

[英]Running multiple instances of a CrawlSpider

i'm just getting started using scrapy and i'd like to do the following 我刚刚开始使用scrapy,我想执行以下操作

Have a list of n domains
i=0
loop for i to n
Use a (mostly) generic CrawlSpider to get all links (a href) of domain[i]
Save results as json lines

to do this, the Spider needs to receive the domain it has to crawl as an argument. 为此,Spider需要接收必须爬网的域作为参数。

I already successfully created the CrawlSpider: 我已经成功创建了CrawlSpider:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess

class MyItem(Item):
    #MyItem Fields


class SubsiteSpider(CrawlSpider):
    name = "subsites"
    start_urls = []
    allowed_domains = []
    rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),)

    def __init__(self, starturl, allowed, *args, **kwargs):
        print(args)
        self.start_urls.append(starturl)
        self.allowed_domains.append(allowed)
        super().__init__(**kwargs)

    def parse_obj(self, response):
        item = MyItem()
        #fill Item Fields
        return item


process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(SubsiteSpider)
process.start()

If i call it with scrapy crawl subsites -a starturl=http://example.com -a allowed=example.com -o output.jl the result is exactly as i want it, so this part is fine already. 如果我使用scrapy crawl subsites -a starturl=http://example.com -a allowed=example.com -o output.jl进行调用,则结果完全符合我的要求,因此这部分已经很好。

What i fail to do is create multiple instances of SubsiteSpider , each with a different domain as argument. 我无法做的是创建多个SubsiteSpider实例,每个实例都以不同的域作为参数。

I tried (in SpiderRunner.py) 我尝试过(在SpiderRunner.py中)

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('subsites', ['https://example.com', 'example.com'])
process.start()

Variant: 变体:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

allowed = ["example.com"]
start = ["https://example.com"]
process.crawl('subsites', start, allowed)
process.start()

But i get an error that occurs, i presume, because the argument is not properly passed to __init__ , for example TypeError: __init__() missing 1 required positional argument: 'allowed' or TypeError: __init__() missing 2 required positional arguments: 'starturl' and 'allowed' (Loop is yet to be implemented) 但是我想会发生错误,因为该参数未正确传递给__init__ ,例如TypeError: __init__() missing 1 required positional argument: 'allowed'TypeError: __init__() missing 2 required positional arguments: 'starturl' and 'allowed' (循环尚未实现)

So, here are my questions: 1) What is the proper way to pass arguments to init , if i do not start crawling via scrapy shell, but from within python code? 所以,这是我的问题:1)如果我不是通过scrapy shell而是从python代码中开始爬网,将参数传递给init的正确方法是什么? 2) How can i also pass the -o output.jl argument? 2)我怎么也可以传递-o output.jl参数? (or maybe, use allowed argument as filename?) 3) I am fine with this running each spider after another - would it still be considered best / good practice to do it that way? (或者,也许使用允许的参数作为文件名?)3)每次运行每个蜘蛛我都很好-这样做仍然被认为是最佳/好习惯吗? Could you point to a more extensive tutorial about "running the same spider again and again, with different arguments(=target domains), optionally parallel", if there is one? 如果有的话,您能否指向一个更广泛的教程,内容涉及“使用不同的参数(=目标域),可选地并行运行一次又一次地运行同一蜘蛛”?

Thank you all very much in advance! 提前非常感谢大家! If there are any spelling mistakes (not an english native speaker), or if question / details are not precise enough, please tell me how to correct them. 如果有任何拼写错误(不是英语为母语的人),或者问题/细节不够精确,请告诉我如何纠正。

There are a few problems with your code: 您的代码存在一些问题:

  1. start_urls and allowed_domains are class attributes which you modify in __init__() , making them shared across all instances of your class. start_urlsallowed_domains是您在__init__()修改的类属性,使它们在您的类的所有实例之间共享。
    What you should do instead is make them instance attributes: 相反,您应该使它们成为实例属性:

     class SubsiteSpider(CrawlSpider): name = "subsites" rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),) def __init__(self, starturl, allowed, *args, **kwargs): self.start_urls = [starturl] self.allowed_domains = [allowed] super().__init__(*args, **kwargs) 
  2. Those last 3 lines should not be in the file with you spider class, since you probably don't want to run that code each time your spider is imported. 最后三行不应与Spider类一起放在文件中,因为您可能不想在每次导入Spider时都运行该代码。

  3. Your calling of CrawlProcess.crawl() is slightly wrong. 您对CrawlProcess.crawl()调用略有错误。 You can use it like this, passing the arguments in the same manner you'd pass them to the spider class' __init__() . 您可以像这样使用它,以与将它们传递给蜘蛛类的__init__()相同的方式传递参数。

     process = CrawlerProcess(get_project_settings()) process.crawl('subsites', 'https://example.com', 'example.com') process.start() 

How can i also pass the -o output.jl argument? 我还如何传递-o output.jl参数? (or maybe, use allowed argument as filename? (或者,也许使用允许的参数作为文件名?

You can achieve the same effect using custom_settings , giving each instance a different FEED_URI setting. 您可以使用custom_settings达到相同的效果,为每个实例指定不同的FEED_URI设置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM