简体   繁体   中英

How to use jq to economically extract a small JSON fragment from near the beginning of a very large monolithic JSON document?

The JSON file in question is quite large (~1.5GB) but has some metadata at a known location (.meta.view.approvals) near the beginning.

How can jq or gojq be used to extract the object at that location without having to load the whole file into memory, and without having to wait for the processing of the whole file to stop after the item of interest has been extracted?

A generic method is sought, but the specific file I'm interested in is rows.json at https://data.montgomerycountymd.gov/api/views/4mse-ku6q/rows.json My copy was retrieved on Jan 12, 2023; the file size is 1459382170 bytes, and the value of.meta.view.createdAt in the file is 1403103517

Command-line alternatives to jq, gojq, and jm would also be of interest, provided they are economical with respect to both memory and CPU usage.

Use jq's (or gojq's) streaming parser in conjunction with first/1 as shown below.

This reduces both the execution time and the memory requirements, eg compared to using the non-streaming parser: from 50 seconds to a few microseconds, and from 4,112MB of RAM (mrss) to 3MB.

Notes:

  • jq and gojq do not produce identical results because gojq does not respect the ordering of keys within objects.
  • The performance statistics shown below are for rows.json described in the Q.

Here is an extract from the transcript showing the command invocations and key performance statistics, on a 3GHz machine.

+ /usr/bin/time -lp gojq -n --stream 'first(fromstream(3|truncate_stream(inputs| select(.[0][0:3] == ["meta","view", "approvals"]) )))' rows.json
user 0.00
sys 0.00
             3702784  maximum resident set size
             1531904  peak memory footprint
+ /usr/bin/time -lp jq -n --stream 'first(fromstream(3|truncate_stream(inputs| select(.[0][0:3] == ["meta","view", "approvals"]) )))' rows.json
user 0.00
sys 0.00
             1990656  maximum resident set size
             1114112  peak memory footprint
/usr/bin/time -lp jq .meta.view.approvals rows.json
user 39.90
sys 11.82
          4112465920  maximum resident set size
          6080188416  peak memory footprint
/usr/bin/time -lp gojq -n --stream '
  fromstream(3|truncate_stream(inputs | select(.[0][0:3] == ["meta","view", "approvals"]) ))' rows.json
user 495.30
sys 273.72
          7858896896  maximum resident set size
         38385831936  peak memory footprint

The following jm command produces essentially the same results:

/usr/bin/time -lp jm --pointer /meta/view/approvals rows.json
user 0.05
sys 0.07
            13594624  maximum resident set size
             7548928  peak memory footprint

An alternative would be to use first_run/2 defined as follows:

# Emit the first run of the items in the stream for which the condition is truthy
def first_run(stream; condition):
  label $out
  | foreach stream as $x (null;
      ($x|condition) as $y
      | if $y
        then [$x]
        elif . then break $out
        else .
        end;
      if . then .[0] else empty end);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM