简体   繁体   English

多线程网络爬虫的最快架构

[英]Fastest architecture for multithreaded web crawler

There should be a frontier object - Holding a set of visited and waiting to crawl URL's. 应该有一个前沿对象 - 持有一组访问过的并等待抓取URL。 There should be some thread responsible for crawling web pages. 应该有一些线程负责抓取网页。 There would be also some kind of controller object to create crawling threads. 还会有某种控制器对象来创建爬网线程。

I don't know what architecture would be faster, easier to extend. 我不知道哪种架构会更快,更容易扩展。 How to divide responsibilities to make as as few synchronization as possible and also minimize number of checking if current URL has been already visited. 如何将责任划分为尽可能少的同步,并最小化检查当前URL是否已被访问的次数。

Should controller object be responsible of providing new URL's to working threads - this mean working threads will need to crawl all given URL's and then sleep for undefined time. 控制器对象应该负责向工作线程提供新的URL - 这意味着工作线程需要抓取所有给定的URL,然后在未定义的时间内休眠。 Controller will be interpreting this threads so crawling thread should handle InterruptedException (How expensive it is in Java - it seems that exception handling is not very fast ). Controller将解释这个线程,因此爬行线程应该处理InterruptedException (它在Java中有多贵 - 似乎异常处理不是很快)。 Or maybe controller should only starts the threads and let crawling threads to fetch frontier themselves? 或者控制器应该只启动线程并让爬行线程自己获取边界?

create a shared, thread-safe list with the URL's to be crawled. 使用要爬网的URL创建一个共享的,线程安全的列表。 create an Executor with the number of threads corresponding to the number of crawlers you desire to run concurrently. 创建一个Executor,其中包含与您希望并发运行的爬网程序数相对应的线程数。 start your crawlers as Runnables with a reference to the shared list and submit each of them to the Executor. 使用对共享列表的引用将您的爬网程序作为Runnables启动,并将它们中的每一个提交给Executor。 each crawler removes the next URL from the list and does whatever you need it to do, looping until the list is empty. 每个爬网程序从列表中删除下一个URL并执行您需要执行的任何操作,循环直到列表为空。

Its been a few years since this question was asked, but in Nov 2015 we are currently using frontera and scrapyd 它是一个几年以来,这一问题被问过,但在2015年11月,我们目前正在使用特拉scrapyd

Scrapy uses twisted which makes it a good multithreaded crawler, and on multi-core machines that means we are only limited by the inbound bandwidth. Scrapy使用twisted来使其成为一个好的多线程爬虫,而在多核机器上,这意味着我们只受到入站带宽的限制。 Frontera-distributed uses hbase and kafka to score links and keep all the data accessible to clients. Frontera-distributed使用hbase和kafka对链接进行评分,并使客户端可以访问所有数据。

Create a central resource with a hash map that can store URL as key with last time scanned. 使用哈希映射创建一个中心资源,该哈希映射可以将URL存储为上次扫描时的密钥。 Make this thread safe. 使这个线程安全。 Then just spawn threads with links in a queue which can be picked up by the crawlers as starting point. 然后只生成队列中的链接的线程,这些链接可以被爬虫作为起点拾取。 Each thread would then carry on crawling and updating the resource. 然后,每个线程将继续爬行和更新资源。 A thread in the resource clears up outdated crawls. 资源中的线程清除过时的爬网。 The in memory resource can be serialised at start or it could be in a db depending on your app needs. 内存资源可以在开始时序列化,也可以在db中,具体取决于您的应用程序需求。

You could make this resource accessible via remote services to allow multiple machines. 您可以通过远程服务访问此资源以允许多台计算机。 You could make the resource itself spread over several machines by segregating urls. 您可以通过隔离网址使资源本身分布在多台计算机上。 Etc... 等等...

You should use a blocking queue, that contains urls that need to be fetched. 您应该使用阻塞队列,其中包含需要获取的URL。 In this case you could create multiple consumers that will fetch urls in multiple threads. 在这种情况下,您可以创建多个将在多个线程中获取URL的使用者。 If queue is empty, than all fetchers will be locked. 如果queue为空,则将锁定所有fetchers。 In this case you should run all threads at the beginning and should not controll them later. 在这种情况下,您应该在开头运行所有线程,不应该在以后控制它们。 Also you need to maintain a list of already downloaded pages in some persistent storage and check before adding to the queue. 此外,您需要在某些持久存储中维护已下载页面的列表,并在添加到队列之前进行检查。

如果你不想重新发明轮子,为什么不看看Apache Nutch

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM