[英]How to consume a document in attachment pipeline using elasticsearch ruby?
[英]Elasticsearch Using the Attachment Processor in a Pipeline doesn't remove images from files
我在定义为的管道中使用附件处理器:
PUT _ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment",
"properties": [ "content" ]
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor" : {
"remove" : { "field" : "_ingest._value.data" }
}
}
}
]
}
预期的:
给定一组具有不同文件类型的附件,例如 doc、Docx 或 pdf,这些文件将被处理(由 Tika)以获取原始文本,其中表格布局、字体类型、字体 colors 和 IMAGES 将被删除。
但看起来图像在摄取后仍然存在。 我可以在内容字段中看到一些很长的 base64 字符串,例如
lWQk9SdzBLR2dvQUFBQU5TVWhFVWdBQUFQb0FBQUQ2Q0FZQUFBQ0k3Rm85QUFBZ0FFbEVRVlI0WGx5OUI1TWNXWEtsNjVGYWxBWlFBQnBvTVQz...
我相信,与文件中的图像有关。
有什么摆脱图像的建议吗?
无法复制。 我通过bin/elasticsearch-plugin install ingest-attachment
安装了插件,并将一个简单的文档导出到 PDF 和 DOCX 中:
PDF 转换成 base64 就在这个要点,DOCX 在这里。
跑步
PUT my-index-000001/_doc/1?pipeline=attachment
{
"attachments": [
{
"data": "pdf-base64-txt..."
},
{
"data": "docx-base64-txt..."
}
]
}
正确删除了图像,只留下了文字。
像这样,
GET my-index-000001/_search
导致
{
"_index" : "my-index-000001",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"attachments" : [
{
"attachment" : {
"content" : "Some text"
}
},
{
"attachment" : {
"content" : "Some text"
}
}
]
}
}
最后我决定这样做,并按照 Joe Sorocin 的建议删除长字符串
{
"description": "Extract attachment information",
"processors": [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment",
"properties": [ "content" ]
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor" : {
"remove" : { "field" : "_ingest._value.data" }
}
}
},
{
"foreach": {
"field": "attachments",
"processor": {
"gsub": {
"field": "_ingest._value.attachment.content",
"pattern": "[a-z|A-Z|0-9|+|\/]{100,}",
"replacement":""
}
}
}
}
]
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.