Elasticsearch Using the Attachment Processor in a Pipeline doesn't remove images from files

Question

I am using the Attachment Processor in a Pipeline defined as:

PUT _ingest/pipeline/attachment

{
  "description": "Extract attachment information",
  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.data",
            "target_field": "_ingest._value.attachment",
            "properties": [ "content" ]
          }
        }
      }
    },
    {
     "foreach": {
        "field": "attachments",
        "processor" : {
            "remove" : { "field" : "_ingest._value.data" }
          }
        }
     }  
  ]
}

EXPECTED:

Given a set of attachments, with different file types, such as doc, Docx, or pdf, those files will be processed (by Tika) to get the raw text, where table layouts, font type, font colors and IMAGES will be removed.

But it looks like the images are still there after ingestion. I can see some very long base64 strings into the content field, like

lWQk9SdzBLR2dvQUFBQU5TVWhFVWdBQUFQb0FBQUQ2Q0FZQUFBQ0k3Rm85QUFBZ0FFbEVRVlI0WGx5OUI1TWNXWEtsNjVGYWxBWlFBQnBvTVQz...

which I believe, are related to images into the files.

Any suggestions for getting rid of the images?

Answer 1

Could not replicate. I installed the plugin via bin/elasticsearch-plugin install ingest-attachment and exported a simple document into PDF and DOCX:

The PDF converted into base64 is in this gist , the the DOCX here .

Running

PUT my-index-000001/_doc/1?pipeline=attachment
{
  "attachments": [
    {
      "data": "pdf-base64-txt..."
    },
    {
      "data": "docx-base64-txt..."
    }
  ]
}

correctly removed the images and left only the text.

As such,

GET my-index-000001/_search

resulted in

{
  "_index" : "my-index-000001",
  "_type" : "_doc",
  "_id" : "1",
  "_score" : 1.0,
  "_source" : {
    "attachments" : [
      {
        "attachment" : {
          "content" : "Some text"
        }
      },
      {
        "attachment" : {
          "content" : "Some text"
        }
      }
    ]
  }
}

Answer 2

In the end I decided for something like this, and remove long strings as suggested by Joe Sorocin

{
  "description": "Extract attachment information",

  "processors": [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "_ingest._value.data",
            "target_field": "_ingest._value.attachment",
            "properties": [ "content" ]
          }
        }
      }
    },
  {
         "foreach": {
        "field": "attachments",
        "processor" : {
            "remove" : { "field" : "_ingest._value.data" }
          }
        }
     },
     {
         "foreach": {
             "field": "attachments",
             "processor": {
                 "gsub": {
                     "field": "_ingest._value.attachment.content",
                     "pattern": "[a-z|A-Z|0-9|+|\/]{100,}",
                     "replacement":""
                     }
             }
         }
     }  
  ]
}

Elasticsearch Using the Attachment Processor in a Pipeline doesn't remove images from files

Question

2 answers

solution1
1 2021-04-03 14:43:43

solution2
1 2021-04-11 17:24:06

Elasticsearch Using the Attachment Processor in a Pipeline doesn't remove images from files

Question

2 answers

solution1 1 2021-04-03 14:43:43

solution2 1 2021-04-11 17:24:06

solution1
1 2021-04-03 14:43:43

solution2
1 2021-04-11 17:24:06