With Elastic Search, how can I index a document containing an array into multiple documents, one per array item?

Question

Suppose I have a JSON document (sent from packetbeat in this case) containing some structure like this:

{
  "source": "http://some/url/",
  "items": [
    {"name":"item1", "value":1},
    {"name":"item2", "value":2}
  ]
}

How can I have Elastic Search index these as separate documents, such that I can retrieve them like this:

GET http://elasicsearch:9200/indexname/doc/item1
{
  "source": "http://some/url/",
  "item": {
     "name":"item1", 
     "value":1
  }
}
GET http://elasicsearch:9200/indexname/doc/item2
{
  "source": "http://some/url/",
  "item": {
     "name":"item2", 
     "value":2
  }
}

Can an injest pipeline , using painless or some other means, achieve this? (perhaps reindexing??)

(The data come from Packetbeat , which is efficient for the large volumes involved, and consist of arrays of similar items, more complex that the example above. I'm not using Logstash, and would rather avoid it for simplicity, but if it's necessary I can add it. Obviously I could split the document with a programming language before sending it, but if possible I'd like to do this within the Elastic Stack, to minimise additional dependencies.)

Answer 1

According to the previous question at elasticsearch split document ingest processor , it isn't possible to split documents using Elastic Search's ingest node .

I got splitting of documents sent by packetbeat to work using Logstash and its split filter , with config something like the below:

input {
  beats {
    port => "5044"
  }
}
filter {
  split {
    field => "[body][requests]"
    target =>  "[body][requests]"
  }
}
output {
  stdout { codec => rubydebug }
}

The JSON filter was also useful to parse stringified JSON:

filter {
  json {
    source => "_body"
    target => "_body"
  }
}

However it proved quite memory intensive to run Logstash where it wasn't otherwise needed, and it would sometimes crash with stack overflows. I opted instead to use node.js, using puppeteer and chromium to harvest the data instead of packetbeat, and handled the parsing and splitting with in node.js before sending it directly to Elastic Search. This works well for my use case, where the data being captured are AJAX calls from a web page, but it might not suit in others.

With Elastic Search, how can I index a document containing an array into multiple documents, one per array item?

Question

1 answers

solution1
0 ACCPTED 2020-07-02 17:34:11

With Elastic Search, how can I index a document containing an array into multiple documents, one per array item?

Question

1 answers

solution1 0 ACCPTED 2020-07-02 17:34:11

solution1
0 ACCPTED 2020-07-02 17:34:11