[英]Elasticsearch Using the Attachment Processor in a Pipeline doesn't remove images from files
I am using the Attachment Processor in a Pipeline defined as:我在定义为的管道中使用附件处理器:
PUT _ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment",
"properties": [ "content" ]
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor" : {
"remove" : { "field" : "_ingest._value.data" }
}
}
}
]
}
EXPECTED:预期的:
Given a set of attachments, with different file types, such as doc, Docx, or pdf, those files will be processed (by Tika) to get the raw text, where table layouts, font type, font colors and IMAGES will be removed.给定一组具有不同文件类型的附件,例如 doc、Docx 或 pdf,这些文件将被处理(由 Tika)以获取原始文本,其中表格布局、字体类型、字体 colors 和 IMAGES 将被删除。
But it looks like the images are still there after ingestion.但看起来图像在摄取后仍然存在。 I can see some very long base64 strings into the content field, like我可以在内容字段中看到一些很长的 base64 字符串,例如
lWQk9SdzBLR2dvQUFBQU5TVWhFVWdBQUFQb0FBQUQ2Q0FZQUFBQ0k3Rm85QUFBZ0FFbEVRVlI0WGx5OUI1TWNXWEtsNjVGYWxBWlFBQnBvTVQz... lWQk9SdzBLR2dvQUFBQU5TVWhFVWdBQUFQb0FBQUQ2Q0FZQUFBQ0k3Rm85QUFBZ0FFbEVRVlI0WGx5OUI1TWNXWEtsNjVGYWxBWlFBQnBvTVQz...
which I believe, are related to images into the files.我相信,与文件中的图像有关。
Any suggestions for getting rid of the images?有什么摆脱图像的建议吗?
Could not replicate.无法复制。 I installed the plugin via bin/elasticsearch-plugin install ingest-attachment
and exported a simple document into PDF and DOCX:我通过bin/elasticsearch-plugin install ingest-attachment
安装了插件,并将一个简单的文档导出到 PDF 和 DOCX 中:
The PDF converted into base64 is in this gist , the the DOCX here . PDF 转换成 base64 就在这个要点,DOCX 在这里。
Running跑步
PUT my-index-000001/_doc/1?pipeline=attachment
{
"attachments": [
{
"data": "pdf-base64-txt..."
},
{
"data": "docx-base64-txt..."
}
]
}
correctly removed the images and left only the text.正确删除了图像,只留下了文字。
As such,像这样,
GET my-index-000001/_search
resulted in导致
{
"_index" : "my-index-000001",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"attachments" : [
{
"attachment" : {
"content" : "Some text"
}
},
{
"attachment" : {
"content" : "Some text"
}
}
]
}
}
In the end I decided for something like this, and remove long strings as suggested by Joe Sorocin最后我决定这样做,并按照 Joe Sorocin 的建议删除长字符串
{
"description": "Extract attachment information",
"processors": [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment",
"properties": [ "content" ]
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor" : {
"remove" : { "field" : "_ingest._value.data" }
}
}
},
{
"foreach": {
"field": "attachments",
"processor": {
"gsub": {
"field": "_ingest._value.attachment.content",
"pattern": "[a-z|A-Z|0-9|+|\/]{100,}",
"replacement":""
}
}
}
}
]
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.