如何将PDF文件从HDFS索引到Solr

Question

I am new to Apache solr I have a requirement in my project where I have to upload pdf documents from HDFS to Solr and from there I want to get using Solr rest API's. 我是Apache solr的新手，在我的项目中有一个要求，我必须将pdf文档从HDFS上传到Solr，然后从那里我要使用Solr rest API。 I have total 40k pdf documents in my local file system, first I will push them to HDFS. 我的本地文件系统中总共有40k pdf文档，首先我将它们推送到HDFS。 But from there to Solr I really don't have any idea 但是从那里到Solr我真的不知道

Another thing is while indexing into solr, i want to read some data from pdf document and index that data also into Solr. 另一件事是在索引到Solr时，我想从pdf文档中读取一些数据，并将该数据也索引到Solr中。 Example: I want extraxt candidate name, candidate location from pdf document and push them into solr schema which looks like, 示例：我想要Extraxt候选名称，pdf文档中的候选位置，并将其推送到看起来像的solr模式中，

name: "candidate_name"
location: "candidate_location"
document: "pdf_document"

I searched for this over the internet, but couldn't find the right solution 我通过互联网搜索了此内容，但找不到正确的解决方案

Answer 1

Try using the https://github.com/lucidworks/hadoop-solr 尝试使用https://github.com/lucidworks/hadoop-solr

You should try the DirectoryIngestMapper, it has Tika parsing, but you will have to customized it. 您应该尝试使用DirectoryIngestMapper，它具有Tika解析功能，但是您必须对其进行自定义。

如何将PDF文件从HDFS索引到Solr

问题描述

1 个解决方案

解决方案1
0 2016-05-26 03:45:10

如何将PDF文件从HDFS索引到Solr

问题描述

1 个解决方案

解决方案1 0 2016-05-26 03:45:10

解决方案1
0 2016-05-26 03:45:10