简体   繁体   中英

What is the role of NUTCH if we are going to make a search engine using Hadoop and Solr?

I want to make a search engine. In which i want to crawl some sites and stored their indexes and info in Hadoop. And then using Solr search will be done. But I am facing lots of issues. If search over google then different people give different suggestions and different configuring ways for setup a hadoop based search engine. These are my some questions :

1) How the crawling will be done? Is there any use of NUTCH for completing the crawling or not? If yes then how Hadoop and NUTCH communicate with each other?

2) What is the use of Solr? If NUTCH done Crawling and stored their crawled indexes and their information into the Hadoop then what's the role of Solr?

3) Can we done searching using Solr and Nutch? If yes then where they will saved their crawled indexes?

4) How Solr communicate with Hadoop?

5) Please explain me one by one steps if possible, that how can i crawl some sites and save their info into DB(Hadoop or any other) and then do search .

I am really really stuck with this. Any help will really appreciated.

A very big Thanks in advance. :)

Please help me to sort out my huge issue please

We are using Nutch as a webcrawler and Solr for searching in some productive environments. So I hope I can give you some information about 3).

How does this work? Nutch has it's own crawling db and some websites where it starts crawling. It has some plugins where you can configure different things like pdf crawling, which fields will be extracted of html sites and so on. When crawling Nutch stores all links extracted from a website and will follow them in the next cycle. All crawling results will be stored in a crawl db. In Nutch you configure an intervall where crawled results will be outdated and the crawler begins from the defined startsites.

The results inside the crawl db will be synchronized to the solr index. So you are searching on the solr index. Nutch is in this constallation only to get data from websites and providing them for solr.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM