如何使用Elasticsearch ingest-attachment插件索引pdf文件？

Question

I have to implement a full-text based search in a pdf document using Elasticsearch ingest plugin. 我必须使用Elasticsearch ingest插件在pdf文档中实现基于全文的搜索。 I'm getting an empty hit array when I'm trying to search the word someword in the pdf document. 当我试图在pdf文档中搜索单词someword时，我得到一个空的命中数组。

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}

Answer 1

When you index your document with the second command by passing the Base64 encoded content, the document then looks like this: 通过传递Base64编码内容使用第二个命令索引文档时，文档如下所示：

        {
           "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }

So your query needs to look into the attachment.content field and not the data one (which only serves the purpose of sending the raw content during indexing) 因此，您的查询需要查看attachment.content字段而不是data字段（仅用于在索引期间发送原始内容的目的）

Modify your query to this and it will work: 修改您的查询，它将工作：

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}

PS: Use POST instead of GET when sending a payload PS：发送有效载荷时使用POST而不是GET

如何使用Elasticsearch ingest-attachment插件索引pdf文件？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-02-11 14:38:29

如何使用Elasticsearch ingest-attachment插件索引pdf文件？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-02-11 14:38:29

解决方案1
2 已采纳 2017-02-11 14:38:29