简体   繁体   中英

How to index a pdf file using Elasticsearch ingest-attachment plugin?

I have to implement a full-text based search in a pdf document using Elasticsearch ingest plugin. I'm getting an empty hit array when I'm trying to search the word someword in the pdf document.

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}

When you index your document with the second command by passing the Base64 encoded content, the document then looks like this:

        {
           "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }

So your query needs to look into the attachment.content field and not the data one (which only serves the purpose of sending the raw content during indexing)

Modify your query to this and it will work:

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}

PS: Use POST instead of GET when sending a payload

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM