简体   繁体   English

Elasticsearch 在管道中使用附件处理器不会从文件中删除图像

[英]Elasticsearch Using the Attachment Processor in a Pipeline doesn't remove images from files

I am using the Attachment Processor in a Pipeline defined as:我在定义为的管道中使用附件处理器:

PUT _ingest/pipeline/attachment

{
  "description": "Extract attachment information",
  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.data",
            "target_field": "_ingest._value.attachment",
            "properties": [ "content" ]
          }
        }
      }
    },
    {
     "foreach": {
        "field": "attachments",
        "processor" : {
            "remove" : { "field" : "_ingest._value.data" }
          }
        }
     }  
  ]
} 

EXPECTED:预期的:

Given a set of attachments, with different file types, such as doc, Docx, or pdf, those files will be processed (by Tika) to get the raw text, where table layouts, font type, font colors and IMAGES will be removed.给定一组具有不同文件类型的附件,例如 doc、Docx 或 pdf,这些文件将被处理(由 Tika)以获取原始文本,其中表格布局、字体类型、字体 colors 和 IMAGES 将被删除。

But it looks like the images are still there after ingestion.但看起来图像在摄取后仍然存在。 I can see some very long base64 strings into the content field, like我可以在内容字段中看到一些很长的 base64 字符串,例如

lWQk9SdzBLR2dvQUFBQU5TVWhFVWdBQUFQb0FBQUQ2Q0FZQUFBQ0k3Rm85QUFBZ0FFbEVRVlI0WGx5OUI1TWNXWEtsNjVGYWxBWlFBQnBvTVQz... lWQk9SdzBLR2dvQUFBQU5TVWhFVWdBQUFQb0FBQUQ2Q0FZQUFBQ0k3Rm85QUFBZ0FFbEVRVlI0WGx5OUI1TWNXWEtsNjVGYWxBWlFBQnBvTVQz...

which I believe, are related to images into the files.我相信,与文件中的图像有关。

Any suggestions for getting rid of the images?有什么摆脱图像的建议吗?

Could not replicate.无法复制。 I installed the plugin via bin/elasticsearch-plugin install ingest-attachment and exported a simple document into PDF and DOCX:我通过bin/elasticsearch-plugin install ingest-attachment安装了插件,并将一个简单的文档导出到 PDF 和 DOCX 中:

在此处输入图像描述

The PDF converted into base64 is in this gist , the the DOCX here . PDF 转换成 base64 就在这个要点,DOCX 在这里

Running跑步

PUT my-index-000001/_doc/1?pipeline=attachment
{
  "attachments": [
    {
      "data": "pdf-base64-txt..."
    },
    {
      "data": "docx-base64-txt..."
    }
  ]
}

correctly removed the images and left only the text.正确删除了图像,只留下了文字。

As such,像这样,

GET my-index-000001/_search

resulted in导致

{
  "_index" : "my-index-000001",
  "_type" : "_doc",
  "_id" : "1",
  "_score" : 1.0,
  "_source" : {
    "attachments" : [
      {
        "attachment" : {
          "content" : "Some text"
        }
      },
      {
        "attachment" : {
          "content" : "Some text"
        }
      }
    ]
  }
}

In the end I decided for something like this, and remove long strings as suggested by Joe Sorocin最后我决定这样做,并按照 Joe Sorocin 的建议删除长字符串

{
  "description": "Extract attachment information",

  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.data",
            "target_field": "_ingest._value.attachment",
            "properties": [ "content" ]
          }
        }
      }
    },
  {
         "foreach": {
        "field": "attachments",
        "processor" : {
            "remove" : { "field" : "_ingest._value.data" }
          }
        }
     },
     {
         "foreach": {
             "field": "attachments",
             "processor": {
                 "gsub": {
                     "field": "_ingest._value.attachment.content",
                     "pattern": "[a-z|A-Z|0-9|+|\/]{100,}",
                     "replacement":""
                     }
             }
         }
     }  
  ]
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用elasticsearch ruby​​在附件管道中使用文档? - How to consume a document in attachment pipeline using elasticsearch ruby? elasticsearch 管道与路径不匹配 - elasticsearch pipeline doesn't match paths Elasticsearch 摄取管道脚本处理器无法投射 - Elasticsearch ingest pipeline script processor fails to cast Golang:如何为Elasticsearch管道附件构建结构 - Golang: How to construct a struct for Elasticsearch pipeline attachment 如果Elasticsearch管道的条件与确切的字符串不匹配,则进行字符串比较 - String comparison in if condition of Elasticsearch pipeline doesn't match exact string 如何在 Java 中使用 Elasticsearch Ingest 附件处理器插件 - How to use Elasticsearch Ingest Attachment Processor Plugin in Java 如何将Elasticsearch Ingest附件处理器插件与Python软件包elasticsearch-dsl结合使用 - How do you use the Elasticsearch Ingest Attachment Processor Plugin with the Python package elasticsearch-dsl 如果在Elasticsearch中使用批量处理器不存在索引 - Index if not exists using bulk processor in elasticsearch 如何使用NEST客户端在一系列附件中使用附件处理器并删除处理器? - How to use the attachment processor and remove processor within an array of attachments with NEST client? 在Logstash管道中使用Elasticsearch过滤器 - using elasticsearch filter in logstash pipeline
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM