繁体   English   中英

如何使用Elasticsearch ingest-attachment插件索引pdf文件?

[英]How to index a pdf file using Elasticsearch ingest-attachment plugin?

我必须使用Elasticsearch ingest插件在pdf文档中实现基于全文的搜索。 当我试图在pdf文档中搜索单词someword时,我得到一个空的命中数组。

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}

通过传递Base64编码内容使用第二个命令索引文档时,文档如下所示:

        {
           "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }

因此,您的查询需要查看attachment.content字段而不是data字段(仅用于在索引期间发送原始内容的目的)

修改您的查询,它将工作:

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}

PS:发送有效载荷时使用POST而不是GET

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM