简体繁体 English

使用 Algolia 搜索（提取文本）PDF 文件

[英]Searching (extracting text) PDF files with Algolia

原文 2016-07-28 15:35:57 3 2 php/ search/ algolia

This is just a speculative idea for a client who has a lot of PDF files.对于拥有大量 PDF 文件的客户来说，这只是一个推测性的想法。

Algolia say in their FAQs that to search PDF files you first need to extract the text from the file. Algolia 在他们的常见问题解答中说，要搜索 PDF 文件，您首先需要从文件中提取文本。 How would you go about this?你会怎么做？

The way I envisage the a system working would be:我设想的系统工作方式是：

Client uploads PDF via CMS客户通过 CMS 上传 PDF
CMS calls some service / program to extract the text CMS 调用一些服务/程序来提取文本
Algolia indexes the extracted and it's somehow linked to the original PDF Algolia 对提取的内容进行索引，并以某种方式链接到原始 PDF

It would need to be an automated system as the client shouldn't have to tell it to index.它需要是一个自动化系统，因为客户端不应该告诉它索引。 It would be built in PHP, probably Laravel running on Ubuntu.它将用 PHP 构建，可能是在 Ubuntu 上运行的 Laravel。

What software / service could do the text extraction from the PDFs and is any magic needed to 'link' this with the PDF file?什么软件/服务可以从 PDF 中提取文本，是否需要将其与 PDF 文件“链接”？

I'm also happy to have suggestions on other search services which may handle this.我也很高兴对可以处理此问题的其他搜索服务提出建议。

2 个解决方案

Fortunately, text extraction from pdf's is a subject that has been covered multiple times.幸运的是，从 pdf 中提取文本是一个已经多次讨论过的主题。 On the command line, you could use pdftotext (available on Linux or Mac) or in your code a library as Apache Tika (for which you can find a PHP wrapper ).在命令行上，您可以使用pdftotext （在 Linux 或 Mac 上可用）或在您的代码中使用一个库作为Apache Tika （您可以找到一个PHP 包装器）。

To avoid having too much noise in your records, I'd recommend you to then split the text and create one record per paragraph.为避免记录中出现过多干扰，我建议您然后拆分文本并为每个段落创建一个记录。 You can then use Algolia's distinct feature to deduplicate the results.然后，您可以使用 Algolia 的distinct功能对结果进行重复数据删除。

You should already have the links to your files somewhere, just store them in your records and then, in your front-end you'll easily be able to create links to them using for instance autocomplete.js or instantsearch.js .您应该已经在某处拥有指向您的文件的链接，只需将它们存储在您的记录中，然后，在您的前端，您就可以轻松地使用例如autocomplete.js或instantsearch.js创建指向它们的链接。

For anyone still looking for a solution, I put together a GitHub repository that does exactly that: https://github.com/PDFTron/pdftron-document-search .对于仍在寻找解决方案的任何人，我整理了一个 GitHub 存储库，完全可以做到这一点： https : //github.com/PDFTron/pdftron-document-search 。

The text extraction happens client-side as the user uploads the document using React + Firebase + Algolia.当用户使用 React + Firebase + Algolia 上传文档时，文本提取发生在客户端。

You can check out a quick video walking you through the sample app: https://youtu.be/IQATnzHTp7Q .您可以查看带您浏览示例应用程序的快速视频： https : //youtu.be/IQATnzHTp7Q 。

Let me know if you have any questions.如果您有任何问题，请告诉我。