简体   繁体   English

Scrapy搜寻器-创建10,000个蜘蛛还是一个蜘蛛爬行10,000个域?

[英]Scrapy crawler - creating a 10,000 spiders or one spider crawling 10,000 domains?

I need to crawl upto 10,000 websites 我需要抓取多达10,000个网站

since every website is unique with its own HTML structure and requires its own logic of XPATH & creating and delegating Request objects. 因为每个网站都具有自己的HTML结构,因此是唯一的,并且需要自己的XPATH逻辑以及创建和委派Request对象的逻辑。 I'm tempted to create a unique spider for each website 我很想为每个网站创建一个独特的蜘蛛

But is this the best way forward?. 但是,这是最好的方法吗? Should i perhaps have a single spider and add all the 10,000 websites in the start_urls and allowed_domains , write scraping libraries and go for it? 我是否应该只有一个蜘蛛,然后将所有10,000个网站添加到start_urlsallowed_domains ,编写抓取库并开始使用?

which is the best practice in regards to this? 关于这方面的最佳实践是什么?

I faced a similar problem, and I took a middle road. 我遇到了类似的问题,我走了一条中间路。

Much of the data you will encounter will (likely) be handled the same way when you finally process it. 当最终处理数据时,您将遇到的许多数据(可能)将以相同的方式处理。 That means much of the logic you need can be reused. 这意味着您需要的许多逻辑都可以重用。 The specifics include where to look for data and how to transform it into a common format. 具体包括在哪里寻找数据以及如何将其转换为通用格式。 I suggest the following: 我建议以下内容:

Create your MainSpider class, containing most of the logic and tasks that you need. 创建MainSpider类,其中包含所需的大多数逻辑和任务。 For each site, subclass MainSpider and define logic modules as required. 对于每个站点,子类化MainSpider并根据需要定义逻辑模块。

main_spider.py main_spider.py

class MainSpider(object):
# Do things here
    def get_links(url)

        return links

spider_mysite.py spider_mysite.py

from main_spider import MainSpider
class SpiderMysite(MainSpider):
    def get_data(links):
        for link in links:

            # Do more stuff. 

Hope it helps. 希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM