Would like to SLICE a huge json file ~20GB into smaller chunk of data based on array size (10000/50000 etc)..
Input: {"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"evnCd":"O","rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},{"evnCd":"O","rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"},{"evnCd":"O","rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},{"evnCd":"O","rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"}]}
Currently running in a loop to get the desire output by incrementing x/y value, but performance is very slow and takes very 8-20 seconds for a iteration depends on size of the file to complete the split process. Currently using 1.6 version, is there any alternates for getting below result
Expected Output: for Slice of 2 objects in array {"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"1","state":"TX","city":"ANDERSON","postal":"77830"},{"rngNum":"2","state":"TX","city":"ANDERSON","postal":"77832"}]} {"recDt":"2021-01-05","country":"US","name":"ABC","number":"9828","add":[{"rngNum":"3","state":"TX","city":"ANDERSON","postal":"77831"},{"rngNum":"4","state":"TX","city":"ANDERSON","postal":"77834"}]}
Tried with cat $inFile | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile
cat $inFile | jq --arg x $x --arg y $y -c '{recDt: .recDt, country: .country, name: .name, number: .number, add: .add[$x|tonumber:$y|tonumber]}' >> $outFile
Please share if there are any alternate available..
In this response, which calls jq just once, I'm going to assume your computer has enough memory to read the entire JSON. I'll also assume you want to create separate files for each slice, and that you want the JSON to be pretty-printed in each file.
Assuming a chunk size of 2, and that the output files are to be named using the template part-N.json, you could write:
< input.json jq -r --argjson size 2 '
del(.add) as $object
| (.add|_nwise($size) | ("\t", $object + {add:.} ))
' | awk '
/^\t/ {fn++; next}
{ print >> "part-" fn ".json"}'
The trick being used here is that valid JSON cannot contain a tab character.
The following assumes the input JSON is too large to read into memory and therefore uses jq's --stream
command-line option.
To keep things simple, I'll focus on the "slicing" of the .add
array, and won't worry about the other keys, or pretty-printing, and other details, as you can easily adapt the following according to your needs:
< input.json jq -nc --stream --argjson size 2 '
def regroup(stream; $n):
foreach (stream, null) as $x ({a:[]};
if $x == null then .emit = .a
elif .a|length == $n then .emit = .a | .a = [$x]
else .emit=null | .a += [$x] end;
select(.emit).emit);
regroup(fromstream( 2 | truncate_stream(inputs | select(.[0][0] == "add")) );
$size)' |
awk '{fn++; print > fn ".json"}'
This writes the arrays to files with filenames of the form N.json
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.