Split/Slice large JSON using jq

Question

Would like to SLICE a huge json file ~20GB into smaller chunk of data based on array size (10000/50000 etc)..

Input: {"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"evnCd":"O","rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},{"evnCd":"O","rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"},{"evnCd":"O","rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},{"evnCd":"O","rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"}]}

Currently running in a loop to get the desire output by incrementing x/y value, but performance is very slow and takes very 8-20 seconds for a iteration depends on size of the file to complete the split process. Currently using 1.6 version, is there any alternates for getting below result

Expected Output: for Slice of 2 objects in array {"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},{"rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"}]} {"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},{"rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"}]}

Tried with cat $inFile | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile

cat $inFile | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile

Please share if there are any alternate available..

Answer 1

In this response, which calls jq just once, I'm going to assume your computer has enough memory to read the entire JSON. I'll also assume you want to create separate files for each slice, and that you want the JSON to be pretty-printed in each file.

Assuming a chunk size of 2, and that the output files are to be named using the template part-N.json, you could write:

< input.json jq -r --argjson size 2 '
  del(.add) as $object
  | (.add|_nwise($size) | ("\t", $object + {add:.} ))
' | awk '
      /^\t/ {fn++; next}
      { print >> "part-" fn ".json"}'

The trick being used here is that valid JSON cannot contain a tab character.

Answer 2

The following assumes the input JSON is too large to read into memory and therefore uses jq's --stream command-line option.

To keep things simple, I'll focus on the "slicing" of the .add array, and won't worry about the other keys, or pretty-printing, and other details, as you can easily adapt the following according to your needs:

< input.json jq -nc --stream --argjson size 2 '
  def regroup(stream; $n):
    foreach (stream, null) as $x ({a:[]};
      if $x == null then .emit = .a
      elif .a|length == $n then .emit = .a | .a = [$x]
      else .emit=null | .a += [$x] end;
      select(.emit).emit);

    regroup(fromstream( 2 | truncate_stream(inputs | select(.[0][0] == "add")) );
            $size)' |
  awk '{fn++; print > fn ".json"}'

This writes the arrays to files with filenames of the form N.json

Split/Slice large JSON using jq

Question

2 answers

solution1
0 2022-01-07 23:34:31

solution2
0 2022-01-08 00:00:06

Split/Slice large JSON using jq

Question

2 answers

solution1 0 2022-01-07 23:34:31

solution2 0 2022-01-08 00:00:06

solution1
0 2022-01-07 23:34:31

solution2
0 2022-01-08 00:00:06