简体   繁体   English

如何将PDF文件从HDFS索引到Solr

[英]How to index pdf files from HDFS to Solr

I am new to Apache solr I have a requirement in my project where I have to upload pdf documents from HDFS to Solr and from there I want to get using Solr rest API's. 我是Apache solr的新手,在我的项目中有一个要求,我必须将pdf文档从HDFS上传到Solr,然后从那里我要使用Solr rest API。 I have total 40k pdf documents in my local file system, first I will push them to HDFS. 我的本地文件系统中总共有40k pdf文档,首先我将它们推送到HDFS。 But from there to Solr I really don't have any idea 但是从那里到Solr我真的不知道

Another thing is while indexing into solr, i want to read some data from pdf document and index that data also into Solr. 另一件事是在索引到Solr时,我想从pdf文档中读取一些数据,并将该数据也索引到Solr中。 Example: I want extraxt candidate name, candidate location from pdf document and push them into solr schema which looks like, 示例:我想要Extraxt候选名称,pdf文档中的候选位置,并将其推送到看起来像的solr模式中,

name: "candidate_name"
location: "candidate_location"
document: "pdf_document"

I searched for this over the internet, but couldn't find the right solution 我通过互联网搜索了此内容,但找不到正确的解决方案

Try using the https://github.com/lucidworks/hadoop-solr 尝试使用https://github.com/lucidworks/hadoop-solr

You should try the DirectoryIngestMapper, it has Tika parsing, but you will have to customized it. 您应该尝试使用DirectoryIngestMapper,它具有Tika解析功能,但是您必须对其进行自定义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM