简体   繁体   English

使用 jq 展平嵌套的 JSON

[英]Flatten nested JSON using jq

I'd like to flatten a nested json object, eg {"a":{"b":1}} to {"ab":1} in order to digest it in solr.我想将嵌套的 json 对象展平,例如{"a":{"b":1}}{"ab":1}以便在 solr 中消化它。

I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.我有 11 TB 的 json 文件,它们都是嵌套的并且在字段名称中包含点,这意味着不是 elasticsearch(点)也不是 solr(没有_childDocument_符号嵌套)可以按_childDocument_消化它。

The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).其他解决方案是用下划线替换字段名称中的点并将其推送到elasticsearch,但我对 solr 的体验要好得多,因此我更喜欢 flatten 解决方案(除非 solr 可以按原样消化那些嵌套的 jsons??)。

I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).仅当消化过程比 solr 花费的时间少得多时,我才会更喜欢 elasticsearch,因为我的首要任务是尽可能快地消化(因此我选择了 jq 而不是在 python 中编写脚本)。

Kindly help.请帮忙。

EDIT:编辑:

I think the pair of examples 3&4 solves this for me: https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/我认为这对示例 3&4 为我解决了这个问题: https : //lucidworks.com/blog/2014/08/12/indexing-custom-json-data/

I'll try soon.我会尽快尝试。

You can also use the following jq command to flatten nested JSON objects in this manner:您还可以使用以下 jq 命令以这种方式展平嵌套的 JSON 对象:

[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries

The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans.它的工作方式是: leaf_paths返回一个数组流,这些数组表示给定 JSON 文档上出现“叶元素”的路径,即没有子元素的元素,例如数字、字符串和布尔值。 We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path.我们将该流传输到具有keyvalue属性的对象中,其中key包含路径数组的元素,作为由点连接的字符串, value包含该路径上的元素。 Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.最后,我们将整个事物放入一个数组中并在from_entries上运行from_entries ,它将{key, value}对象数组转换为包含这些键值对的对象。

This is just a variant of Santiago's jq:这只是 Santiago 的 jq 的一个变体:

. as $in 
| reduce leaf_paths as $path ({};
     . + { ($path | map(tostring) | join(".")): $in | getpath($path) })

It avoids the overhead of the key/value construction and destruction.它避免了键/值构建和销毁的开销。

(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".) (如果您可以访问 jq 1.5 之后的 jq 版本,则可以省略“map(tostring)”。)

Two important points about both these jq solutions:关于这两个 jq 解决方案的两个要点:

  1. Arrays are also flattened.数组也被展平。 Eg given {"a": {"b": [0,1,2]}} as input, the output would be:例如,给定{"a": {"b": [0,1,2]}}作为输入,输出将是:

     { "ab0": 0, "ab1": 1, "ab2": 2 }
  2. If any of the keys in the original JSON contain periods, then key collisions are possible;如果原始 JSON 中的任何键包含句点,则可能发生键冲突; such collisions will generally result in the loss of a value.此类冲突通常会导致值丢失。 This would happen, for example, with the following input:例如,使用以下输入时会发生这种情况:

     {"ab":0, "a": {"b": 1}}

Here is a solution that uses tostream , select , join , reduce and setpath这是一个使用tostreamselectjoinreducesetpath的解决方案

  reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
     {}
     ; setpath($p; $v)
  )

As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d @json_file does just this:事实证明, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d @json_file就是这样做的:

{
    "a.b":[1],
    "id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
    "_version_":1535841499921514496
}

EDIT 1: solr 6.0.1 with bin/solr -e cloud .编辑 1: solr 6.0.1 with bin/solr -e cloud collection name is flat , all the rest are default (with data-driven-schema which is also default).集合名称是flat ,其余的都是默认的( data-driven-schema也是默认的)。

EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d @{} \\;编辑 2:我使用的最终脚本: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d @{} \\; find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d @{} \\; . .

EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .ab}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d @-"编辑 3:也可以与 xargs 并行并使用 jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .ab}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d @-"添加 id 字段find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .ab}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d @-" find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .ab}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d @-" find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .ab}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d @-" where -P is the parallelism factor. find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .ab}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d @-"其中-P是并行系数。 I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)我使用 jq 设置了一个 id,因此同一文档的多次上传不会在集合中创建重复项(当我搜索-P的最佳值时,它在集合中创建了重复项)

I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex;我最近编写了一个名为jqg的脚本,它可以将任意复杂的 JSON平并使用正则表达式搜索结果; to simply flatten the JSON, your regex would be ' .要简单地展平 JSON,您的正则表达式将是 ' . ', which matches everything. ',匹配一切。 Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ( [] & {} ) as leaf nodes.与上面的答案不同,该脚本将处理嵌入数组、 falsenull值,并且可以选择将空数组和对象 ( [] & {} ) 视为叶节点。

$ jq . test/odd-values.json
{
  "one": {
    "start-string": "foo",
    "null-value": null,
    "integer-number": 101
  },
  "two": [
    {
      "two-a": {
        "non-integer-number": 101.75,
        "number-zero": 0
      },
      "true-boolean": true,
      "two-b": {
        "false-boolean": false
      }
    }
  ],
  "three": {
    "empty-string": "",
    "empty-object": {},
    "empty-array": []
  },
  "end-string": "bar"
}

$ jqg . test/odd-values.json
{
  "one.start-string": "foo",
  "one.null-value": null,
  "one.integer-number": 101,
  "two.0.two-a.non-integer-number": 101.75,
  "two.0.two-a.number-zero": 0,
  "two.0.true-boolean": true,
  "two.0.two-b.false-boolean": false,
  "three.empty-string": "",
  "three.empty-object": {},
  "three.empty-array": [],
  "end-string": "bar"
}

jqg was tested using jq 1.6 jqg使用 jq 1.6 进行了测试

Note: I am the author of the jqg script.注意:我是jqg脚本的作者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM