Elasticsearch 在管道中使用附件处理器不会从文件中删除图像

Question

I am using the Attachment Processor in a Pipeline defined as:我在定义为的管道中使用附件处理器：

PUT _ingest/pipeline/attachment

{
  "description": "Extract attachment information",
  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.data",
            "target_field": "_ingest._value.attachment",
            "properties": [ "content" ]
          }
        }
      }
    },
    {
     "foreach": {
        "field": "attachments",
        "processor" : {
            "remove" : { "field" : "_ingest._value.data" }
          }
        }
     }  
  ]
}

EXPECTED:预期的：

Given a set of attachments, with different file types, such as doc, Docx, or pdf, those files will be processed (by Tika) to get the raw text, where table layouts, font type, font colors and IMAGES will be removed.给定一组具有不同文件类型的附件，例如 doc、Docx 或 pdf，这些文件将被处理（由 Tika）以获取原始文本，其中表格布局、字体类型、字体 colors 和 IMAGES 将被删除。

But it looks like the images are still there after ingestion.但看起来图像在摄取后仍然存在。 I can see some very long base64 strings into the content field, like我可以在内容字段中看到一些很长的 base64 字符串，例如

lWQk9SdzBLR2dvQUFBQU5TVWhFVWdBQUFQb0FBQUQ2Q0FZQUFBQ0k3Rm85QUFBZ0FFbEVRVlI0WGx5OUI1TWNXWEtsNjVGYWxBWlFBQnBvTVQz... lWQk9SdzBLR2dvQUFBQU5TVWhFVWdBQUFQb0FBQUQ2Q0FZQUFBQ0k3Rm85QUFBZ0FFbEVRVlI0WGx5OUI1TWNXWEtsNjVGYWxBWlFBQnBvTVQz...

which I believe, are related to images into the files.我相信，与文件中的图像有关。

Any suggestions for getting rid of the images?有什么摆脱图像的建议吗？

Answer 1

Could not replicate.无法复制。 I installed the plugin via bin/elasticsearch-plugin install ingest-attachment and exported a simple document into PDF and DOCX:我通过bin/elasticsearch-plugin install ingest-attachment安装了插件，并将一个简单的文档导出到 PDF 和 DOCX 中：

The PDF converted into base64 is in this gist , the the DOCX here . PDF 转换成 base64 就在这个要点，DOCX 在这里。

Running跑步

PUT my-index-000001/_doc/1?pipeline=attachment
{
  "attachments": [
    {
      "data": "pdf-base64-txt..."
    },
    {
      "data": "docx-base64-txt..."
    }
  ]
}

correctly removed the images and left only the text.正确删除了图像，只留下了文字。

As such,像这样，

GET my-index-000001/_search

resulted in导致

{
  "_index" : "my-index-000001",
  "_type" : "_doc",
  "_id" : "1",
  "_score" : 1.0,
  "_source" : {
    "attachments" : [
      {
        "attachment" : {
          "content" : "Some text"
        }
      },
      {
        "attachment" : {
          "content" : "Some text"
        }
      }
    ]
  }
}

Answer 2

In the end I decided for something like this, and remove long strings as suggested by Joe Sorocin最后我决定这样做，并按照 Joe Sorocin 的建议删除长字符串

{
  "description": "Extract attachment information",

  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.data",
            "target_field": "_ingest._value.attachment",
            "properties": [ "content" ]
          }
        }
      }
    },
  {
         "foreach": {
        "field": "attachments",
        "processor" : {
            "remove" : { "field" : "_ingest._value.data" }
          }
        }
     },
     {
         "foreach": {
             "field": "attachments",
             "processor": {
                 "gsub": {
                     "field": "_ingest._value.attachment.content",
                     "pattern": "[a-z|A-Z|0-9|+|\/]{100,}",
                     "replacement":""
                     }
             }
         }
     }  
  ]
}

Elasticsearch 在管道中使用附件处理器不会从文件中删除图像

问题描述

2 个解决方案

解决方案1
1 2021-04-03 14:43:43

解决方案2
1 2021-04-11 17:24:06

Elasticsearch 在管道中使用附件处理器不会从文件中删除图像

问题描述

2 个解决方案

解决方案1 1 2021-04-03 14:43:43

解决方案2 1 2021-04-11 17:24:06

解决方案1
1 2021-04-03 14:43:43

解决方案2
1 2021-04-11 17:24:06