简体   繁体   中英

Indexing HTML files using SOLR

Am trying to index a set of HTML files using SOLR. Basic idea is to implement a site search functionality for the website developed. Am very new to Lucene and SOLR and have tried a few samples available in the site and have indexed a few documents using that. But am not able to arrive at a conclusion as to what would be the best way of doing things. Some suggest use DataImportHandler, some places i see using ExtractingRequestHandler. A simple try from my side was using ExtractingRequestHandler. lso I will have to update the list of files for example, some HTMLs may be removed in the future and some may be added and etc etc.. Pl suggest on factors to be considered while choosing the approach

Cheers!!

I would recommend you use Nutch to crawl and index your HTML files into Solr. It has built in support for tracking the removal/addition of files to the site.

Also check out the Nutch Wiki for tutorials on getting started.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM