简体   繁体   中英

Nutch v Solr v Nutch+Solr

A related Question on Stackoverflow exists but it was asked six and a half year ago. A lot has changed especially in Nutch since then. Basically I have two questions.

  1. How do we compare Nutch to Solr?

  2. In what circumstances do we need and why it is better to integrate both of these and use for crawling? How it would be different from using any of them in standalone mode (or with hadoop)?

At the current stage Nutch is only responsible for crawling the web, meaning visit a web page, extract the content, find more links and repeat the process (I'm skipping a lot of complicated stuff in between, but hopefuly you get the idea).

The last stage of the crawling process is to store the data in your backend (ES/Solr are the supported data storages on the 1.x branch). So in this step is where Solr comes to play, after Nutch have completed its work you need to store the data somewhere to be able to execute queries on top of it: This is Solr job.

Some time ago Nutch included the ability to write the inverted index (as explained in the question), but the decision (also some time ago) was to deprecate this in favor of using Solr/ES (or any other storage that you can write an indexer plugin for). Right now the indexing plugins are plugable and you can write a plugin for any data storage that you want.

Summary: Nutch is a crawler and Solr is the search engine where Nutch stores the data that is crawled.

  1. Nutch and Solr are two different things. Nutch just crawls the web and parses the contents of the web pages while Solr is responsible for indexing ie storing the contents crawled by Nutch when Solr is Integrated with Nutch.

  2. You need to integrate Solr with Nutch when you have to retrieve and store data while crawling the web. If you don't have to store or index anything, then you don't need Solr. Solr is useful when you want to store the data Nutch crawls and then perform a search on the data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM