简体繁体中英

How do I configure Apache Tika and Apache Solr to index and search a directory of pdf files?

原文 2012-02-17 10:22:33 1 2 pdf/ solr/ lucene/ full-text-search/ apache-tika

How can I make Apache Tika index a directory of PDF and textfiles including subdirectories and submit it to Apache Solr so that I can have a search engine for the content of this directory?

Any advice apprechiated, on Windows or Linux it doesn't matter. I have not been able to get this to work because the documentation on these two projects are mostly geared for developers, which is fine, but nevertheless, I cannot make them do this because the documentation is vague and not clear enough for a non-java developer.

So very simply: How do I build a search engine using the Apache Lucene-family of projects that can index and provide a search for /home/material or c:/material or /cygdrive/c/material

Thanks a lot in advance

2 answers

What programming language are you familiar with?

As a Python guy, I would gain familiarity with urllib2 , a HTTP client library and the os module that can handle the filesystem (list out files in a directory, open a file pointer for POSTing in a file to Solr). Also relevant is the set data type, which can be used to compare the documents in the FS and Solr index.

So,

learn to POST in rich documents to Solr (using a Solr library or a HTTP client library)
make logic to retrieve all document names from Solr and the directory
upload all missing/ changed documents to Solr.

Solr provides ExtractingRequestHandler which helps in indexing rich documents.
The examples listing on the page uses curl to feed data to Solr.
A simple script which can iterate through the folders and subfolders and execute curl commands can create an index over all the documents.
If you are using any client for Solr like Solrj, rsolr you can easily iterate through the directory and execute the http urls to index the documents.

How do i convert a pdf file to text in apache tika

ContentExtraction of PDF file in solr using Apache Tika

How to index PDF Document on Apache Solr

Apache Tika extract scanned PDF files

How can I fix my Chinese PDF parsed in Apache Tika for Python to read the characters correctly?

How to compare two pdf documents using Apache Tika

How do I Index PDF files and search for keywords?

Extract Images from PDF with Apache Tika

How to index pdf files from HDFS to Solr

SOLR tika processor not crawling my PDF files prefectly

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do i convert a pdf file to text in apache tika ContentExtraction of PDF file in solr using Apache Tika How to index PDF Document on Apache Solr Apache Tika extract scanned PDF files How can I fix my Chinese PDF parsed in Apache Tika for Python to read the characters correctly? How to compare two pdf documents using Apache Tika How do I Index PDF files and search for keywords? Extract Images from PDF with Apache Tika How to index pdf files from HDFS to Solr SOLR tika processor not crawling my PDF files prefectly

Related Tags

How do I configure Apache Tika and Apache Solr to index and search a directory of pdf files?

Question

2 answers

solution1
2 2012-02-17 14:00:43

solution2
2 2012-02-17 18:32:19

How do I configure Apache Tika and Apache Solr to index and search a directory of pdf files?

Question

2 answers

solution1 2 2012-02-17 14:00:43

solution2 2 2012-02-17 18:32:19

solution1
2 2012-02-17 14:00:43

solution2
2 2012-02-17 18:32:19