How can I make Apache Tika index a directory of PDF and textfiles including subdirectories and submit it to Apache Solr so that I can have a search engine for the content of this directory?
Any advice apprechiated, on Windows or Linux it doesn't matter. I have not been able to get this to work because the documentation on these two projects are mostly geared for developers, which is fine, but nevertheless, I cannot make them do this because the documentation is vague and not clear enough for a non-java developer.
So very simply: How do I build a search engine using the Apache Lucene-family of projects that can index and provide a search for /home/material or c:/material or /cygdrive/c/material
Thanks a lot in advance
What programming language are you familiar with?
As a Python guy, I would gain familiarity with urllib2
, a HTTP client library and the os
module that can handle the filesystem (list out files in a directory, open a file pointer for POSTing in a file to Solr). Also relevant is the set
data type, which can be used to compare the documents in the FS and Solr index.
So,
Solr provides ExtractingRequestHandler which helps in indexing rich documents.
The examples listing on the page uses curl to feed data to Solr.
A simple script which can iterate through the folders and subfolders and execute curl commands can create an index over all the documents.
If you are using any client for Solr like Solrj, rsolr you can easily iterate through the directory and execute the http urls to index the documents.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.