jq: How can I pipe objects from array to different files based on data in object?

Question

I have a large array of objects stored in a master JSON file. I want to loop through that array, take each object, and append it to a new file based on a field in the object (in this case, the state name). In other words, in a set of data containing many states, I want to filter it out to a file for each state.

I'm using an existing JQ expression to filter for only the data I actually need:

{ fipscode: .fipscode, level: .level, polid: .polid, polnum: .polnum, precinctsreporting: .precinctsreporting, precinctsreportingpct: .precinctsreportingpct, precinctstotal: .precinctstotal, raceid: .raceid, runoff: .runoff, statepostal: .statepostal, votecount: .votecount, votepct: .votepct, winner: .winner }

Here's a sample of my input:

[
    { "ballotorder": 2, "candidateid": "9718", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Doug", "id": "3015-polid-64364-state-AZ-1", "incumbent": true, "initialization_data": false, "is_ballot_measure": false, "last": "Ducey", "lastupdated": "2018-08-30T00:01:38.897Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "GOP", "polid": "64364", "polnum": "5554", "precinctsreporting": 1488, "precinctsreportingpct": 0.9993000000000001, "precinctstotal": 1489, "raceid": "3015", "racetype": "Primary", "racetypeid": "R", "reportingunitid": "state-AZ-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Arizona", "statepostal": "AZ", "test": false, "uncontested": false, "votecount": 355455, "votepct": 0.705493, "winner": true },
    { "ballotorder": 2, "candidateid": "21689", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Ron", "id": "10046-polid-62557-state-FL-1", "incumbent": false, "initialization_data": false, "is_ballot_measure": false, "last": "DeSantis", "lastupdated": "2018-08-29T19:29:50.367Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "GOP", "polid": "62557", "polnum": "13918", "precinctsreporting": 5968, "precinctsreportingpct": 1.0, "precinctstotal": 5968, "raceid": "10046", "racetype": "Primary", "racetypeid": "R", "reportingunitid": "state-FL-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Florida", "statepostal": "FL", "test": false, "uncontested": false, "votecount": 913997, "votepct": 0.564728, "winner": true },
    { "ballotorder": 2, "candidateid": "45555", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Rex", "id": "38538-polid-67011-state-OK-1", "incumbent": false, "initialization_data": false, "is_ballot_measure": false, "last": "Lawhorn", "lastupdated": "2018-08-29T02:44:44.610Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "Lib", "polid": "67011", "polnum": "40784", "precinctsreporting": 1951, "precinctsreportingpct": 1.0, "precinctstotal": 1951, "raceid": "38538", "racetype": "Runoff", "racetypeid": "L", "reportingunitid": "state-OK-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Oklahoma", "statepostal": "OK", "test": false, "uncontested": false, "votecount": 379, "votepct": 0.409287, "winner": false }
]

As output, I would expect to have a Arizona.json containing only the item(s) from that state, and also filtered to remove unwanted fields:

[
  { "fipscode": null, "level": "state", "polid": "64364", "polnum": "5554", "precinctsreporting": 1488, "precinctsreportingpct": 0.9993000000000001, "precinctstotal": 1489, "raceid": "3015", "runoff": false, "statepostal": "AZ", "votecount": 355455, "votepct": 0.705493, "winner": true }
]

...and likewise for the other states involved ( Florida.json and Oklahoma.json ).

Here's the bash and jq script I have so far:

cat master.json |
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
jq -c '.statename as $state | {
    fipscode: .fipscode,
    level: .level,
    polid: .polid,
    polnum: .polnum,
    precinctsreporting: .precinctsreporting,
    precinctsreportingpct: .precinctsreportingpct,
    precinctstotal: .precinctstotal,
    raceid: .raceid,
    runoff: .runoff,
    statepostal: .statepostal,
    votecount: .votecount,
    votepct: .votepct,
    winner: .winner
}'

What I can't figure out is how to intercept each row so I can determine where the output should go. Is this possible?

Answer 1

You can do this with one copy of jq splitting out data items from the input file, and then another instance per state collating those data items together, with bash providing the glue. See the following example, for bash 4.2 or newer (might work with 4.1, I'd need to check).

#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*|4.[01].*) echo "ERROR: Bash 4.2 required" >&2; exit 1;; esac

input_file=$1
[[ -s $input_file ]] || { echo "Usage: ${0##*/} input-file" >&2; exit 1; }

jq_split_script='
# modify this function to fit your needs
def relevantContentOnly:
  { fipscode, level, polid, polnum, precinctsreporting, precinctsreportingpct, precinctstotal, raceid, runoff, statepostal, votecount, votepct, winner };

.[] | [.statename, (relevantContentOnly | tojson)] | @tsv
'

# Use an associative array to map from state names to output FDs
declare -A out_fds=( )

# Read state / line-of-data pairs from our JQ script...
while IFS=$'\t' read -r state data; do
  # If we don't already have a writer for the current state, start one.
  if [[ ! ${out_fds[$state]} ]]; then
    exec {new_fd}> >(jq -n '[inputs]' >"$state.json")
    out_fds[$state]=$new_fd
  fi
  # Regardless, send the data to the FD we have for this state
  printf '%s\n' "$data" >&${out_fds[$state]}
done < <(jq -rc "$jq_split_script" <"$input_file") # ...running the JQ script above.

# close output FDs, so the JQ instances all flush
for fd in "${!out_fds[@]}"; do
  exec {fd}>&-
done

Answer 2

Here's a simple solution piggybacking on what you started with:

< master.json jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
  jq -cr '.statename, {
    fipscode,
    level,
    polid,
    polnum,
    precinctsreporting,
    precinctsreportingpct,
    precinctstotal,
    raceid,
    runoff,
    statepostal,
    votecount,
    votepct,
    winner
}' | while read -r statename && read -r object
do
  echo "$object" >> "$statename.json"
done

Note that this will append the objects to any existing "$statename.json" files.

With your [original] sample data, the above produces Arizona.json, Florida.json, and Oklahoma.json

Tweak

If the overhead in using echo is an issue, then you could use awk :

awk '
  fn!="" {print > fn; fn=""; next}
  {fn=$0 ".json";
   if (fns[fn]!=1){fns[fn]=1; print fn > "filenames.txt"}}'

Finale

Since you want these files to contain arrays of objects, you could then use jq -s to achieve the final results. I'd probably collect the filenames within the while loop (naively, eg echo "$statename.json" >> filenames.txt ), and then use sponge :

sort -u filenames.txt | 
  while read -r fn ; do 
    jq -s . "$fn" | sponge "$fn"
  done

jq: How can I pipe objects from array to different files based on data in object?

Question

2 answers

solution1
1 ACCPTED 2018-09-18 16:59:39

solution2
1 2018-09-18 17:42:17

Tweak

Finale

jq: How can I pipe objects from array to different files based on data in object?

Question

2 answers

solution1 1 ACCPTED 2018-09-18 16:59:39

solution2 1 2018-09-18 17:42:17

Tweak

Finale

solution1
1 ACCPTED 2018-09-18 16:59:39

solution2
1 2018-09-18 17:42:17