简体繁体 English

在Java中的分布式体系结构上实现Web爬网程序

[英]Implementing Web Crawler on a distributed architecture in Java

原文 2013-02-26 06:08:20 4 1 java/ multithreading/ web-crawler/ distributed-computing

Friends, I have implemented a multi threaded Web Crawler in Java . 朋友们，我已经实现了在Java中的多线程网络爬虫。 In order to make it more efficient I want to convert it into distributed architecture ie on 3 machines. 为了提高效率，我想将其转换为分布式架构，即在3台计算机上。 As far as i have searched master-slave architecture is best. 据我搜索，主从架构是最好的。 Can anyone provide some insight into Which is the best architecture and how i can implement it in Java? 任何人都可以提供一些洞察力，以了解最好的架构是什么以及如何用Java实现它？

1 个解决方案

You could compute hashcode for each domain being crawled and use this hash to determine which node should crawl that domain. 您可以为每个要爬网的域计算哈希码，然后使用该哈希值来确定哪个节点应对该域进行爬网。 That way all nodes could work in parallel without much interaction. 这样，所有节点都可以并行工作而无需太多交互。

You also need some code to merge the crawled results after the crawling is complete, or periodically. 您还需要一些代码在爬网完成后或定期合并爬网的结果。 Probably also it is better just to copy some produced archives from the nodes and process in a central location. 也许最好只是从节点复制一些生成的归档文件并在中央位置进行处理。

A cloud of virtual machines looks like a good deployment platform as crawling is not very CPU or memory intensive. 虚拟机云看起来像是一个不错的部署平台，因为爬网不会占用大量CPU或内存。