The JSON file in question is quite large (~1.5GB) but has some metadata at a known location (.meta.view.approvals) near the beginning.
How can jq or gojq be used to extract the object at that location without having to load the whole file into memory, and without having to wait for the processing of the whole file to stop after the item of interest has been extracted?
A generic method is sought, but the specific file I'm interested in is rows.json at https://data.montgomerycountymd.gov/api/views/4mse-ku6q/rows.json My copy was retrieved on Jan 12, 2023; the file size is 1459382170 bytes, and the value of.meta.view.createdAt in the file is 1403103517
Command-line alternatives to jq, gojq, and jm would also be of interest, provided they are economical with respect to both memory and CPU usage.
Use jq's (or gojq's) streaming parser in conjunction with first/1
as shown below.
This reduces both the execution time and the memory requirements, eg compared to using the non-streaming parser: from 50 seconds to a few microseconds, and from 4,112MB of RAM (mrss) to 3MB.
Notes:
Here is an extract from the transcript showing the command invocations and key performance statistics, on a 3GHz machine.
+ /usr/bin/time -lp gojq -n --stream 'first(fromstream(3|truncate_stream(inputs| select(.[0][0:3] == ["meta","view", "approvals"]) )))' rows.json
user 0.00
sys 0.00
3702784 maximum resident set size
1531904 peak memory footprint
+ /usr/bin/time -lp jq -n --stream 'first(fromstream(3|truncate_stream(inputs| select(.[0][0:3] == ["meta","view", "approvals"]) )))' rows.json
user 0.00
sys 0.00
1990656 maximum resident set size
1114112 peak memory footprint
/usr/bin/time -lp jq .meta.view.approvals rows.json
user 39.90
sys 11.82
4112465920 maximum resident set size
6080188416 peak memory footprint
/usr/bin/time -lp gojq -n --stream '
fromstream(3|truncate_stream(inputs | select(.[0][0:3] == ["meta","view", "approvals"]) ))' rows.json
user 495.30
sys 273.72
7858896896 maximum resident set size
38385831936 peak memory footprint
The following jm command produces essentially the same results:
/usr/bin/time -lp jm --pointer /meta/view/approvals rows.json
user 0.05
sys 0.07
13594624 maximum resident set size
7548928 peak memory footprint
An alternative would be to use first_run/2
defined as follows:
# Emit the first run of the items in the stream for which the condition is truthy
def first_run(stream; condition):
label $out
| foreach stream as $x (null;
($x|condition) as $y
| if $y
then [$x]
elif . then break $out
else .
end;
if . then .[0] else empty end);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.