简体   繁体   English

elasticsearch 附件插件性能改进

[英]elasticsearch attachment plugin performance improvement

I am new to elasticsearch attempting to parse pdf files via ingestion pipeline using the elasticsearch atachment plugin , but it seems it take alot of time to parse pdf depending on pdf size 1MB=2sec, 5MB=15sec, 10MB=25sec and so one, please, advice how to improve this execution time?我是 elasticsearch 的新手,试图使用 elasticsearch 附件插件通过摄取管道解析 pdf 文件,但解析 pdf 似乎需要很多时间,具体取决于 pdf 大小 1MB=2sec,请 5MB=152sec,5MB 和 10MB ,建议如何改善这个执行时间?

PUT _ingest/pipeline/attachment
{
 "description" : "Extract attachment information",
 "processors" : [
 {
  "attachment" : {
    "field" : "data"
  }
 }
]
}

PUT my-index-000001/_doc/my_id?pipeline=attachment
{
 "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

Thanks谢谢

Its an expensive operation and will cost resources, I would explore using FSCrawler ( https://fscrawler.readthedocs.io/en/fscrawler-2.9/ ) or other Tika library to off-load the whole operation from ES;这是一项昂贵的操作并且会消耗资源,我会探索使用 FSCrawler ( https://fscrawler.readthedocs.io/en/fscrawler-2.9/ ) 或其他 Tika 库从 ES 卸载整个操作; You might be able to get lot of things done in parallel or process data before its ready to index.您可能能够在数据准备好索引之前并行完成很多事情或处理数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM