简体   繁体   English

Elasticsearch附件插件与自己的Tika实现

[英]Elasticsearch attachment plugin vs own tika implementation

I want to use the Tika toolkit to index content of documents files (pdf, docx...) and images (via tesseract plugin). 我想使用Tika工具包来索引文档文件(pdf,docx ...)和图像(通过tesseract插件)的内容。

I tried elastic ingest attachment plugin ( https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html ) it works pretty good but without OCR build-in. 我尝试了弹性摄取附件插件( https://www.elastic.co/guide/en/elasticsearch/plugins/master/ingest-attachment.html ),它工作得很好,但是没有内置OCR。 And I have to send base64 of my file, so high memory usage + elastic index the "data" (base64) field which is useless. 而且我必须发送我的文件的base64,因此高内存使用率+弹性索引对“ data”(base64)字段毫无用处。

I'm thinking of using directly Tika toolkit and then index the content in ElasticSearch. 我正在考虑直接使用Tika工具包,然后在ElasticSearch中索引内容。

So I'm wondering if it's a better way or not ? 所以我想知道这是否是更好的方法?

We've created a system to process files (Crawl -> OCR -> Index -> Search). 我们创建了一个处理文件的系统(抓取-> OCR->索引->搜索)。 It's called Ambar . 叫做Ambar We built it with idea to create a good and solid replacement for Ingest Attachment. 我们构想的目的是为摄取附件创建一个良好而可靠的替代品。

As a search engine we use ElasticSearch, as a context extractor: Tika + Tesseract + ImageMagick + Custom extractors for PDF. 作为搜索引擎,我们使用ElasticSearch作为上下文提取器:Tika + Tesseract + ImageMagick +用于PDF的自定义提取器。

We made it to provide a simple, but yet powerful alternative to own Tika + ES implementation. 我们为自己的Tika + ES实现提供了一个简单但功能强大的替代方案。

Check out Github to get more details. 查阅Github以获得更多详细信息。

At the time of writing, there is little to no documentation about enabling OCR via Tesseract in the elasticsearch-mapper-attachments plugin. 在撰写本文时,elasticsearch elasticsearch-mapper-attachments插件中几乎没有关于通过Tesseract启用OCR的文档。

Everything is pointing to you handling the task of OCR outside of Elasticsearch, and then indexing the content separately. 一切都指向您在Elasticsearch之外处理OCR的任务,然后分别索引内容。

Reference: https://github.com/elastic/elasticsearch-mapper-attachments/issues/10 参考: https : //github.com/elastic/elasticsearch-mapper-attachments/issues/10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM