简体   繁体   中英

Extract all unique keys from arbitrarily nested json data with jq

As the subject states, my goal is to write an all_keys function to extract all keys from an arbitrarily nested json blob, traversing contained arrays and objects as needed, and outputting an array containing the keys, without duplicates.

For instance, given the following input:

[
    {"name": "/", "children": [
      {"name": "/bin", "children": [
        {"name": "/bin/ls", "children": []},
        {"name": "/bin/sh", "children": []}]},
      {"name": "/home", "children": [
        {"name": "/home/stephen", "children": [
          {"name": "/home/stephen/jq", "children": []}]}]}]},
    {"name": "/", "children": [
      {"name": "/bin", "children": [
        {"name": "/bin/ls", "children": []},
        {"name": "/bin/sh", "children": []}]},
      {"name": "/home", "children": [
        {"name": "/home/stephen", "children": [
          {"name": "/home/stephen/jq", "children": []}]}]}]}      
]

The all_keys fuction should produce this output:

[
  "children",
  "name"
]

To this end, I devised the following function, but it's as slow as it is convoluted, so I was wondering whether you could come up with a more concise and faster way of obtaining the same result.

def all_keys: 
    . as $in |
    if type == "object" then 
        reduce keys[] as $k (
            [];
            . + [$k, ($in[$k] | all_keys)[]]
        ) | unique
    elif type == "array" then (
        reduce .[] as $i (
            [];
            . + ($i | all_keys)
        ) | unique
    ) 
    else
        empty
    end
;

For reference, running that function on this 53MB json file takes roughly 22 seconds on my Intel T9300@2.50GHz CPU (I know, it's quite ancient but still works fine).

A naive approach would just collect all the keys and get the unique values.

[.. | objects | keys[]] | unique

But with that data, it's a bit on the slow side since the keys need to be collected and sorted.

We could do a little better with this. Since we're trying to determine all the distinct keys, we'd use a hashmap of some sort to be more efficient. Well, we have objects that can act as such.

reduce (.. | objects | keys[]) as $k ({}; .[$k] = true) | keys

I didn't measure the time on this but it's magnitudes faster than the other version. I didn't even wait for the other to finish, this one was well within 10 seconds on my work machine (i5-2400@3.1GHz).

I think you'll find the following variant of the OP's all_keys is actually slightly faster than the version using .. ; this is probably to be expected -- for jeopardy.json, .. generates altogether 1,731,807 JSON entities whereas there are only 216,930 JSON objects:

def all_keys:
  def uniquely(f): reduce f as $x ({}; .[$x] = true) | keys;
  def rkeys:
    if type == "object" then keys[] as $k | ($k, (.[$k]|rkeys))
    elif type == "array" then .[]|rkeys
    else empty
    end;
  uniquely(rkeys);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM