简体   繁体   中英

Elasticsearch Using the Attachment Processor in a Pipeline doesn't remove images from files

I am using the Attachment Processor in a Pipeline defined as:

PUT _ingest/pipeline/attachment

{
  "description": "Extract attachment information",
  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.data",
            "target_field": "_ingest._value.attachment",
            "properties": [ "content" ]
          }
        }
      }
    },
    {
     "foreach": {
        "field": "attachments",
        "processor" : {
            "remove" : { "field" : "_ingest._value.data" }
          }
        }
     }  
  ]
} 

EXPECTED:

Given a set of attachments, with different file types, such as doc, Docx, or pdf, those files will be processed (by Tika) to get the raw text, where table layouts, font type, font colors and IMAGES will be removed.

But it looks like the images are still there after ingestion. I can see some very long base64 strings into the content field, like

lWQk9SdzBLR2dvQUFBQU5TVWhFVWdBQUFQb0FBQUQ2Q0FZQUFBQ0k3Rm85QUFBZ0FFbEVRVlI0WGx5OUI1TWNXWEtsNjVGYWxBWlFBQnBvTVQz...

which I believe, are related to images into the files.

Any suggestions for getting rid of the images?

Could not replicate. I installed the plugin via bin/elasticsearch-plugin install ingest-attachment and exported a simple document into PDF and DOCX:

在此处输入图像描述

The PDF converted into base64 is in this gist , the the DOCX here .

Running

PUT my-index-000001/_doc/1?pipeline=attachment
{
  "attachments": [
    {
      "data": "pdf-base64-txt..."
    },
    {
      "data": "docx-base64-txt..."
    }
  ]
}

correctly removed the images and left only the text.

As such,

GET my-index-000001/_search

resulted in

{
  "_index" : "my-index-000001",
  "_type" : "_doc",
  "_id" : "1",
  "_score" : 1.0,
  "_source" : {
    "attachments" : [
      {
        "attachment" : {
          "content" : "Some text"
        }
      },
      {
        "attachment" : {
          "content" : "Some text"
        }
      }
    ]
  }
}

In the end I decided for something like this, and remove long strings as suggested by Joe Sorocin

{
  "description": "Extract attachment information",

  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.data",
            "target_field": "_ingest._value.attachment",
            "properties": [ "content" ]
          }
        }
      }
    },
  {
         "foreach": {
        "field": "attachments",
        "processor" : {
            "remove" : { "field" : "_ingest._value.data" }
          }
        }
     },
     {
         "foreach": {
             "field": "attachments",
             "processor": {
                 "gsub": {
                     "field": "_ingest._value.attachment.content",
                     "pattern": "[a-z|A-Z|0-9|+|\/]{100,}",
                     "replacement":""
                     }
             }
         }
     }  
  ]
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM