[英]How to index a pdf file in Elasticsearch 5.0.0 with ingest-attachment plugin?
[英]How to index a pdf file using Elasticsearch ingest-attachment plugin?
我必须使用Elasticsearch
ingest插件在pdf文档中实现基于全文的搜索。 当我试图在pdf文档中搜索单词someword
时,我得到一个空的命中数组。
//Code for creating pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}
//Code for creating the index
PUT my_index/my_type/my_id?pipeline=attachment
{
"filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
"title" : "Quick",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
//Code for searching the word in pdf
GET /my_index/my_type/_search
{
"query": {
"match": {
"data" : {
"query" : "someword"
}
}
}
通过传递Base64编码内容使用第二个命令索引文档时,文档如下所示:
{
"filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
},
"title": "Quick"
}
因此,您的查询需要查看attachment.content
字段而不是data
字段(仅用于在索引期间发送原始内容的目的)
修改您的查询,它将工作:
POST /my_index/my_type/_search
{
"query": {
"match": {
"attachment.content": { <---- change this
"query": "lorem"
}
}
}
}
PS:发送有效载荷时使用POST
而不是GET
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.