繁体 English 中英

Apache Solr不会为扫描的PDF编制索引

[英]Apache Solr does not index scanned PDFs

原文 2017-01-16 11:48:16 1 1 java/ solr/ lucene/ apache-tika

我想索引扫描的PDF文件。 我已经在Centos 6上安装了Solr 6.3.0 ， tesseract 3.04 ， leptonica 1.74 。我已经根据文档配置了solrconfig。

我已经测试过tesseract和solr的png，jpg，一切看起来都不错。 但是，当我尝试为扫描的PDF文件建立索引时，Solr不会为扫描的图像建立索引，仅提取pdf注释消息（示例文档）。 （根据索引响应使用DefaultParser和PDFParser）

之后，我用Google搜索问题，发现了该解决方案（我测试了，它可以工作！），但是我无法将Java代码转换为Xml配置。 我该如何将Java代码设置为Xml配置文件？

任何帮助将是巨大的！

1 个解决方案

您可以使用Lucene 3.0索引和搜索扫描的pdf文件。 我一直在使用做Lucene 3.0索引扫描的pdf文件，然后搜索最频繁重复的单词在扫描的pdf 。

Apache Solr中的自定义索引

[英]Custom index in Apache Solr

apache Nutch中是否有任何插件可以对原始内容中的webHtml和pdf进行索引

[英]Is there any Plugin in apache Nutch to index both webHtml and pdfs in raw content

在Solr中存储PDF

[英]Storing PDFs in Solr

是否可以在Apache Solr 4上做部分索引？

[英]Is it possible to do partial index on Apache Solr 4?

在Apache Solr中将XML文件索引为纯文本

[英]Index XML files in Apache Solr as plain text

Apache Solr-如何索引源代码文件

[英]Apache Solr - How to index source code files

如何在 Apache Solr 上索引 PDF 文档

[英]How to index PDF Document on Apache Solr

Apache Solr DataImportHandler尝试索引失败

[英]Apache Solr DataImportHandler failes trying to index

如何将整个本地硬盘索引到Apache Solr？

[英]How to index entire local Hard Drive into Apache Solr?

索引和查询时多个令牌过滤器的Apache Solr性能问题

[英]Apache Solr performance issue for multiple token filters at index and query time

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Solr中的自定义索引 apache Nutch中是否有任何插件可以对原始内容中的webHtml和pdf进行索引在Solr中存储PDF 是否可以在Apache Solr 4上做部分索引？在Apache Solr中将XML文件索引为纯文本 Apache Solr-如何索引源代码文件如何在 Apache Solr 上索引 PDF 文档 Apache Solr DataImportHandler尝试索引失败如何将整个本地硬盘索引到Apache Solr？索引和查询时多个令牌过滤器的Apache Solr性能问题

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM