[英]Apache Solr does not index scanned PDFs
I would like to index scanned PDF files. 我想索引扫描的PDF文件。 I have installed Solr 6.3.0 , tesseract 3.04 , leptonica 1.74 on Centos 6. I have configured my solrconfig according to documentation .
我已经在Centos 6上安装了Solr 6.3.0 , tesseract 3.04 , leptonica 1.74 。我已经根据文档配置了solrconfig。
I have tested tesseract and solr for png, jpg and every thing looks fine. 我已经测试过tesseract和solr的png,jpg,一切看起来都不错。 But when I try to index scanned PDF files , Solr does not index scanned image only extract pdf comment message ( sample document ).
但是,当我尝试为扫描的PDF文件建立索引时,Solr不会为扫描的图像建立索引,仅提取pdf注释消息( 示例文档 )。 (DefaultParser and PDFParser used according to index response)
(根据索引响应使用DefaultParser和PDFParser)
After that I Googled problem and I found this solution (I tested, it works!) however I could not convert Java code to Xml configuration. 之后,我用Google搜索问题,发现了该解决方案 (我测试了,它可以工作!),但是我无法将Java代码转换为Xml配置。 How should I set that java code to Xml configuration file?
我该如何将Java代码设置为Xml配置文件?
Any help would be great! 任何帮助将是巨大的!
You can use Lucene 3.0
to index and search for scanned pdf
file. 您可以使用
Lucene 3.0
索引和搜索扫描的pdf
文件。 I have done using Lucene 3.0
to index scanned pdf
file and search most frequently repeated words in the scanned pdf
. 我一直在使用做
Lucene 3.0
索引扫描的pdf
文件,然后搜索最频繁重复的单词在扫描的pdf
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.