简体   繁体   中英

How do I configure Apache Tika and Apache Solr to index and search a directory of pdf files?

How can I make Apache Tika index a directory of PDF and textfiles including subdirectories and submit it to Apache Solr so that I can have a search engine for the content of this directory?

Any advice apprechiated, on Windows or Linux it doesn't matter. I have not been able to get this to work because the documentation on these two projects are mostly geared for developers, which is fine, but nevertheless, I cannot make them do this because the documentation is vague and not clear enough for a non-java developer.

So very simply: How do I build a search engine using the Apache Lucene-family of projects that can index and provide a search for /home/material or c:/material or /cygdrive/c/material

Thanks a lot in advance

What programming language are you familiar with?

As a Python guy, I would gain familiarity with urllib2 , a HTTP client library and the os module that can handle the filesystem (list out files in a directory, open a file pointer for POSTing in a file to Solr). Also relevant is the set data type, which can be used to compare the documents in the FS and Solr index.

So,

  1. learn to POST in rich documents to Solr (using a Solr library or a HTTP client library)
  2. make logic to retrieve all document names from Solr and the directory
  3. upload all missing/ changed documents to Solr.

Solr provides ExtractingRequestHandler which helps in indexing rich documents.
The examples listing on the page uses curl to feed data to Solr.
A simple script which can iterate through the folders and subfolders and execute curl commands can create an index over all the documents.
If you are using any client for Solr like Solrj, rsolr you can easily iterate through the directory and execute the http urls to index the documents.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM