简体   繁体   中英

Apache Solr does not index scanned PDFs

I would like to index scanned PDF files. I have installed Solr 6.3.0 , tesseract 3.04 , leptonica 1.74 on Centos 6. I have configured my solrconfig according to documentation .

I have tested tesseract and solr for png, jpg and every thing looks fine. But when I try to index scanned PDF files , Solr does not index scanned image only extract pdf comment message ( sample document ). (DefaultParser and PDFParser used according to index response)

After that I Googled problem and I found this solution (I tested, it works!) however I could not convert Java code to Xml configuration. How should I set that java code to Xml configuration file?

Any help would be great!

You can use Lucene 3.0 to index and search for scanned pdf file. I have done using Lucene 3.0 to index scanned pdf file and search most frequently repeated words in the scanned pdf .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM