Extract text from multiple format of files and store it in Elasticsearch

Question

I have a requirement to extract the content of all the files store in a folder which can be of format pdf,worddoc,txt,msg,ppt etc. Now I need to store the content in elasticsearch.The solution needs to be build in pipeline architecture.I am planning to extract the content using Apache TIKA and then store it in elastic. Is there any better approach to implement this solution?

Answer 1

You should investigate the ingest attachment plugin which bundles Apache Tika and does exactly what you need, ie extracting content from PDF, DOC, PPT, etc.

Simply install it

bin/elasticsearch-plugin install ingest-attachment

Then you can create a new pipeline

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}

Finally you can index your documents like this:

PUT my_index/my_type/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

You can find more usage information at https://www.elastic.co/guide/en/elasticsearch/plugins/5.3/using-ingest-attachment.html

Extract text from multiple format of files and store it in Elasticsearch

Question

1 answers

solution1
1 2017-04-04 04:07:50

Extract text from multiple format of files and store it in Elasticsearch

Question

1 answers

solution1 1 2017-04-04 04:07:50

solution1
1 2017-04-04 04:07:50