简体繁体 English

如何配置Apache Tika和Apache Solr来索引和搜索pdf文件目录？

[英]How do I configure Apache Tika and Apache Solr to index and search a directory of pdf files?

原文 2012-02-17 10:22:33 4 2 pdf/ solr/ lucene/ full-text-search/ apache-tika

How can I make Apache Tika index a directory of PDF and textfiles including subdirectories and submit it to Apache Solr so that I can have a search engine for the content of this directory? 如何使Apache Tika索引包含子目录的PDF和文本文件目录，并将其提交给Apache Solr，以便我可以使用搜索引擎查找该目录的内容？

Any advice apprechiated, on Windows or Linux it doesn't matter. 在Windows或Linux上找到的任何建议都没有关系。 I have not been able to get this to work because the documentation on these two projects are mostly geared for developers, which is fine, but nevertheless, I cannot make them do this because the documentation is vague and not clear enough for a non-java developer. 我无法使它正常工作，因为这两个项目的文档主要适合开发人员使用，这很好，但是尽管如此，我无法使他们这样做，因为文档含糊不清，对于非Java而言不够清晰开发人员。

So very simply: How do I build a search engine using the Apache Lucene-family of projects that can index and provide a search for /home/material or c:/material or /cygdrive/c/material 非常简单：我如何使用Apache Lucene系列项目构建搜索引擎，这些项目可以为/ home / material或c：/ material或/ cygdrive / c / material编制索引并提供搜索

Thanks a lot in advance 在此先多谢

2 个解决方案

What programming language are you familiar with? 您熟悉哪种编程语言？

As a Python guy, I would gain familiarity with urllib2 , a HTTP client library and the os module that can handle the filesystem (list out files in a directory, open a file pointer for POSTing in a file to Solr). 作为Python专家，我将熟悉urllib2 ，HTTP客户端库和可以处理文件系统的os模块（列出目录中的文件，打开文件指针以将文件发布到Solr）。 Also relevant is the set data type, which can be used to compare the documents in the FS and Solr index. set数据类型也与此相关，该数据类型可用于比较FS和Solr索引中的文档。

So, 所以，

learn to POST in rich documents to Solr (using a Solr library or a HTTP client library) 学习将丰富的文档发布到Solr（使用Solr库或HTTP客户端库）
make logic to retrieve all document names from Solr and the directory 使逻辑从Solr和目录中检索所有文档名称
upload all missing/ changed documents to Solr. 将所有丢失/更改的文档上载到Solr。

Solr provides ExtractingRequestHandler which helps in indexing rich documents. Solr提供了ExtractingRequestHandler ，它有助于索引丰富的文档。
The examples listing on the page uses curl to feed data to Solr. 页面上列出的示例使用curl将数据馈送到Solr。
A simple script which can iterate through the folders and subfolders and execute curl commands can create an index over all the documents. 一个可以遍历文件夹和子文件夹并执行curl命令的简单脚本可以在所有文档上创建索引。
If you are using any client for Solr like Solrj, rsolr you can easily iterate through the directory and execute the http urls to index the documents. 如果您对Solr使用任何客户端（例如Solrj，rsolr），则可以轻松地遍历目录并执行http url以对文档建立索引。