I have a requirement to extract the content of all the files store in a folder which can be of format pdf,worddoc,txt,msg,ppt etc. Now I need to store the content in elasticsearch.The solution needs to be build in pipeline architecture.I am planning to extract the content using Apache TIKA and then store it in elastic. Is there any better approach to implement this solution?
You should investigate the ingest attachment plugin which bundles Apache Tika and does exactly what you need, ie extracting content from PDF, DOC, PPT, etc.
Simply install it
bin/elasticsearch-plugin install ingest-attachment
Then you can create a new pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
Finally you can index your documents like this:
PUT my_index/my_type/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
You can find more usage information at https://www.elastic.co/guide/en/elasticsearch/plugins/5.3/using-ingest-attachment.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.