简体   繁体   English

Apache Solr不会为扫描的PDF编制索引

[英]Apache Solr does not index scanned PDFs

I would like to index scanned PDF files. 我想索引扫描的PDF文件。 I have installed Solr 6.3.0 , tesseract 3.04 , leptonica 1.74 on Centos 6. I have configured my solrconfig according to documentation . 我已经在Centos 6上安装了Solr 6.3.0tesseract 3.04leptonica 1.74 。我已经根据文档配置了solrconfig。

I have tested tesseract and solr for png, jpg and every thing looks fine. 我已经测试过tesseract和solr的png,jpg,一切看起来都不错。 But when I try to index scanned PDF files , Solr does not index scanned image only extract pdf comment message ( sample document ). 但是,当我尝试为扫描的PDF文件建立索引时,Solr不会为扫描的图像建立索引,仅提取pdf注释消息( 示例文档 )。 (DefaultParser and PDFParser used according to index response) (根据索引响应使用DefaultParser和PDFParser)

After that I Googled problem and I found this solution (I tested, it works!) however I could not convert Java code to Xml configuration. 之后,我用Google搜索问题,发现了该解决方案 (我测试了,它可以工作!),但是我无法将Java代码转换为Xml配置。 How should I set that java code to Xml configuration file? 我该如何将Java代码设置为Xml配置文件?

Any help would be great! 任何帮助将是巨大的!

You can use Lucene 3.0 to index and search for scanned pdf file. 您可以使用Lucene 3.0索引和搜索扫描的pdf文件。 I have done using Lucene 3.0 to index scanned pdf file and search most frequently repeated words in the scanned pdf . 我一直在使用做Lucene 3.0索引扫描的pdf文件,然后搜索最频繁重复的单词在扫描的pdf

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM