简体   繁体   English

如何使用Elasticsearch ingest-attachment插件索引pdf文件?

[英]How to index a pdf file using Elasticsearch ingest-attachment plugin?

I have to implement a full-text based search in a pdf document using Elasticsearch ingest plugin. 我必须使用Elasticsearch ingest插件在pdf文档中实现基于全文的搜索。 I'm getting an empty hit array when I'm trying to search the word someword in the pdf document. 当我试图在pdf文档中搜索单词someword时,我得到一个空的命中数组。

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}

When you index your document with the second command by passing the Base64 encoded content, the document then looks like this: 通过传递Base64编码内容使用第二个命令索引文档时,文档如下所示:

        {
           "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }

So your query needs to look into the attachment.content field and not the data one (which only serves the purpose of sending the raw content during indexing) 因此,您的查询需要查看attachment.content字段而不是data字段(仅用于在索引期间发送原始内容的目的)

Modify your query to this and it will work: 修改您的查询,它将工作:

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}

PS: Use POST instead of GET when sending a payload PS:发送有效载荷时使用POST而不是GET

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用ingest-attachment插件索引Elasticsearch 5.0.0中的pdf文件? - How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin? ElasticSearch 5.0.0摄取附件插件问题将PDF编入索引 - ElasticSearch 5.0.0 ingest-attachment plugin issues to index PDF 如何使用摄取附件插件和 JavaScript 客户端在 Elasticsearch 6.1 中索引 PDF? - How to index a PDF in Elasticsearch 6.1 with ingest-attachment plugin & JavaScript Client? 如何为摄取附件弹性搜索插件禁用 base64 存储? - How disable base64 storing for ingest-attachment elasticsearch plugin? Elassandra 安装摄取附件插件 - Elassandra installing ingest-attachment plugin 安装提取附件附件错误 - install ingest-attachment plugin error 在Elasticsearch中搜索通过摄取附件索引的文档 - Searching documents indexed via ingest-attachment in elasticsearch 使用 ingest-attachment 批量索引(大约 40 k 类型的 .docx 文件)的嵌套方法是什么? - What is the nest way to bulk index(around 40 k files of type .docx) using ingest-attachment? 摄取附件需要更多权限 - ingest-attachment needs more rights 尝试在摄取附件字段中插入null时,ElasticSearch返回错误 - ElasticSearch returning error when trying to insert null to ingest-attachment field
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM