简体繁体中英

How to auto-index data using solr and nutch?

原文 2015-05-28 06:08:36 6 2 apache/ solr/ nutch/ solrj/ moss2007enterprisesearch

i want to automatically index a document or a website when it is fed to apache solr . How we can achieve this ? I have seen examples of using a CRON job that need to be called via a php script , but they are not quite clear in explaination. Using java api SolrJ , is there any way that we can index data automatically , without having the need to manually do it ??

2 answers

You can write a scheduler and call the solrJ code which is doing indexing/reindexing.

For writing the scheduler please refer below links

http://www.mkyong.com/java/how-to-run-a-task-periodically-in-java/

http://archive.oreilly.com/pub/a/java/archive/quartz.html

If you are using Apache Nutch, you have to use Nutch solr-index plugin. With using this plugin you can index web documents as soon as they be crawled by Nutch. But the main question would be how can you schedule Nutch to start periodically.

As far as I know you have to use a scheduler for this purpose. I did know an old Nutch project called Nutch-base which uses Apache Quartz for the purpose of scheduling Nutch jobs. You can find the source code of Nutch-base from the following link:

https://github.com/mathieuravaux/nutchbase

If you consider this project there is a plugin called admin-scheduling. Although it is implemented for and old version of Nutch but it could be a nice start point for developing scheduler plugin for Nutch.

It is worth to say that if you are going to crawl website periodically and fetch the new arrival links you can use this tutorial .

How to see data crawled by nutch using solr?

Nutch deployment on hadoop will not index to solr

Crawl image and their metadata using nutch and index them into solr

Solr Indexing using Nutch Crawler

apache nutch to index to solr via REST

Extracting HTML meta tags in Nutch 2.x and having Solr 4 index it

how to import and index data from database in solr 6.2.1(New to solr)

how to crawl data on few topics using apache nutch?

Integrating Nutch and Solr

How to skip documents with empty content field during Nutch to Solr indexing?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to see data crawled by nutch using solr? Nutch deployment on hadoop will not index to solr Crawl image and their metadata using nutch and index them into solr Solr Indexing using Nutch Crawler apache nutch to index to solr via REST Extracting HTML meta tags in Nutch 2.x and having Solr 4 index it how to import and index data from database in solr 6.2.1(New to solr) how to crawl data on few topics using apache nutch? Integrating Nutch and Solr How to skip documents with empty content field during Nutch to Solr indexing?

Related Tags

How to auto-index data using solr and nutch?

Question

2 answers

solution1
0 2015-05-28 06:16:14

solution2
0 2015-05-28 06:46:16

How to auto-index data using solr and nutch?

Question

2 answers

solution1 0 2015-05-28 06:16:14

solution2 0 2015-05-28 06:46:16

solution1
0 2015-05-28 06:16:14

solution2
0 2015-05-28 06:46:16