简体繁体中英

Apache Solr does not index scanned PDFs

原文 2017-01-16 11:48:16 3 1 java/ solr/ lucene/ apache-tika

I would like to index scanned PDF files. I have installed Solr 6.3.0 , tesseract 3.04 , leptonica 1.74 on Centos 6. I have configured my solrconfig according to documentation .

I have tested tesseract and solr for png, jpg and every thing looks fine. But when I try to index scanned PDF files , Solr does not index scanned image only extract pdf comment message ( sample document ). (DefaultParser and PDFParser used according to index response)

After that I Googled problem and I found this solution (I tested, it works!) however I could not convert Java code to Xml configuration. How should I set that java code to Xml configuration file?

Any help would be great!

1 answers

You can use Lucene 3.0 to index and search for scanned pdf file. I have done using Lucene 3.0 to index scanned pdf file and search most frequently repeated words in the scanned pdf .

Custom index in Apache Solr

Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content

Storing PDFs in Solr

Is it possible to do partial index on Apache Solr 4?

Index XML files in Apache Solr as plain text

Apache Solr - How to index source code files

How to index PDF Document on Apache Solr

Apache Solr DataImportHandler failes trying to index

How to index entire local Hard Drive into Apache Solr?

Apache Solr performance issue for multiple token filters at index and query time

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Custom index in Apache Solr Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content Storing PDFs in Solr Is it possible to do partial index on Apache Solr 4? Index XML files in Apache Solr as plain text Apache Solr - How to index source code files How to index PDF Document on Apache Solr Apache Solr DataImportHandler failes trying to index How to index entire local Hard Drive into Apache Solr? Apache Solr performance issue for multiple token filters at index and query time

Related Tags

Apache Solr does not index scanned PDFs

Question

1 answers

solution1 0 2017-03-21 07:10:01

solution1
0 2017-03-21 07:10:01