简体   繁体   中英

Extract text from multiple format of files and store it in Elasticsearch

I have a requirement to extract the content of all the files store in a folder which can be of format pdf,worddoc,txt,msg,ppt etc. Now I need to store the content in elasticsearch.The solution needs to be build in pipeline architecture.I am planning to extract the content using Apache TIKA and then store it in elastic. Is there any better approach to implement this solution?

You should investigate the ingest attachment plugin which bundles Apache Tika and does exactly what you need, ie extracting content from PDF, DOC, PPT, etc.

Simply install it

bin/elasticsearch-plugin install ingest-attachment

Then you can create a new pipeline

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}

Finally you can index your documents like this:

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

You can find more usage information at https://www.elastic.co/guide/en/elasticsearch/plugins/5.3/using-ingest-attachment.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM