简体   繁体   中英

elasticsearch attachment plugin performance improvement

I am new to elasticsearch attempting to parse pdf files via ingestion pipeline using the elasticsearch atachment plugin , but it seems it take alot of time to parse pdf depending on pdf size 1MB=2sec, 5MB=15sec, 10MB=25sec and so one, please, advice how to improve this execution time?

PUT _ingest/pipeline/attachment
{
 "description" : "Extract attachment information",
 "processors" : [
 {
  "attachment" : {
    "field" : "data"
  }
 }
]
}

PUT my-index-000001/_doc/my_id?pipeline=attachment
{
 "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

Thanks

Its an expensive operation and will cost resources, I would explore using FSCrawler ( https://fscrawler.readthedocs.io/en/fscrawler-2.9/ ) or other Tika library to off-load the whole operation from ES; You might be able to get lot of things done in parallel or process data before its ready to index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM