简体繁体 English

Apache Solr不会为扫描的PDF编制索引

[英]Apache Solr does not index scanned PDFs

原文 2017-01-16 11:48:16 6 1 java/ solr/ lucene/ apache-tika

I would like to index scanned PDF files. 我想索引扫描的PDF文件。 I have installed Solr 6.3.0 , tesseract 3.04 , leptonica 1.74 on Centos 6. I have configured my solrconfig according to documentation . 我已经在Centos 6上安装了Solr 6.3.0 ， tesseract 3.04 ， leptonica 1.74 。我已经根据文档配置了solrconfig。

I have tested tesseract and solr for png, jpg and every thing looks fine. 我已经测试过tesseract和solr的png，jpg，一切看起来都不错。 But when I try to index scanned PDF files , Solr does not index scanned image only extract pdf comment message ( sample document ). 但是，当我尝试为扫描的PDF文件建立索引时，Solr不会为扫描的图像建立索引，仅提取pdf注释消息（示例文档）。 (DefaultParser and PDFParser used according to index response) （根据索引响应使用DefaultParser和PDFParser）

After that I Googled problem and I found this solution (I tested, it works!) however I could not convert Java code to Xml configuration. 之后，我用Google搜索问题，发现了该解决方案（我测试了，它可以工作！），但是我无法将Java代码转换为Xml配置。 How should I set that java code to Xml configuration file? 我该如何将Java代码设置为Xml配置文件？

Any help would be great! 任何帮助将是巨大的！

1 个解决方案

You can use Lucene 3.0 to index and search for scanned pdf file. 您可以使用Lucene 3.0索引和搜索扫描的pdf文件。 I have done using Lucene 3.0 to index scanned pdf file and search most frequently repeated words in the scanned pdf . 我一直在使用做Lucene 3.0索引扫描的pdf文件，然后搜索最频繁重复的单词在扫描的pdf 。