[英]How to index a pdf file using Elasticsearch ingest-attachment plugin?
I have to implement a full-text based search in a pdf document using Elasticsearch
ingest plugin. 我必须使用
Elasticsearch
ingest插件在pdf文档中实现基于全文的搜索。 I'm getting an empty hit array when I'm trying to search the word someword
in the pdf document. 当我试图在pdf文档中搜索单词
someword
时,我得到一个空的命中数组。
//Code for creating pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}
//Code for creating the index
PUT my_index/my_type/my_id?pipeline=attachment
{
"filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
"title" : "Quick",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
//Code for searching the word in pdf
GET /my_index/my_type/_search
{
"query": {
"match": {
"data" : {
"query" : "someword"
}
}
}
When you index your document with the second command by passing the Base64 encoded content, the document then looks like this: 通过传递Base64编码内容使用第二个命令索引文档时,文档如下所示:
{
"filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
},
"title": "Quick"
}
So your query needs to look into the attachment.content
field and not the data
one (which only serves the purpose of sending the raw content during indexing) 因此,您的查询需要查看
attachment.content
字段而不是data
字段(仅用于在索引期间发送原始内容的目的)
Modify your query to this and it will work: 修改您的查询,它将工作:
POST /my_index/my_type/_search
{
"query": {
"match": {
"attachment.content": { <---- change this
"query": "lorem"
}
}
}
}
PS: Use POST
instead of GET
when sending a payload PS:发送有效载荷时使用
POST
而不是GET
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.