简体   繁体   English

使用 jq 处理巨大的 GEOJson 文件

[英]Process huge GEOJson file with jq

Given a GEOJson file as follows:-给定一个 GEOJson 文件如下:-

{
  "type": "FeatureCollection",
  "features": [
   {
     "type": "Feature",
     "properties": {
     "FEATCODE": 15014
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
     .....

I want to end up with the following:-我想结束以下内容:-

{
  "type": "FeatureCollection",
  "features": [
   {
     "tippecanoe" : {"minzoom" : 13},
     "type": "Feature",
     "properties": {
     "FEATCODE": 15014
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
     .....

ie. IE。 I have added the tippecanoe object to each feature in the array features我已将tippecanoe 对象添加到数组特征中的每个特征中

I can make this work with:-我可以使用:-

 jq '.features[].tippecanoe.minzoom = 13' <GEOJSON FILE> > <OUTPUT FILE>

Which is fine for small files.这对于小文件来说很好。 But processing a large file of 414Mb seems to take forever with the processor maxing out and nothing being written to the OUTPUT FILE但是处理一个 414Mb 的大文件似乎需要很长时间,因为处理器已经达到极限并且没有任何内容写入 OUTPUT FILE

Reading further into jq it appears that the --stream command line parameter may help but I am completely confused as to how to use this for my purposes.进一步阅读 jq 似乎--stream命令行参数可能会有所帮助,但我完全不知道如何将其用于我的目的。

I would be grateful for an example command line that serves my purposes along with an explanation as to what --stream is doing.我将不胜感激一个示例命令行,该命令行服务于我的目的以及对 --stream 正在做什么的解释。

A one-pass jq-only approach may require more RAM than is available.一次性 jq-only 方法可能需要比可用 RAM 更多的 RAM。 If that is the case, then a simple all-jq approach is shown below, together with a more economical approach based on using jq along with awk.如果是这种情况,那么下面显示了一个简单的 all-jq 方法,以及基于将 jq 与 awk 一起使用的更经济的方法。

The two approaches are the same except for the reconstitution of the stream of objects into a single JSON document.除了将对象流重构为单个 JSON 文档外,这两种方法是相同的。 This step can be accomplished very economically using awk.使用 awk 可以非常经济地完成此步骤。

In both cases, the large JSON input file with objects of the required form is assumed to be named input.json.在这两种情况下,具有所需形式对象的大型 JSON 输入文件都假定命名为 input.json。

jq-only仅 jq

jq -c  '.features[]' input.json |
    jq -c '.tippecanoe.minzoom = 13' |
    jq -c -s '{type: "FeatureCollection", features: .}'

jq and awk jq 和 awk

jq -c '.features[]' input.json |
   jq -c '.tippecanoe.minzoom = 13' | awk '
     BEGIN {print "{\"type\": \"FeatureCollection\", \"features\": ["; }
     NR==1 { print; next }
           {print ","; print}
     END   {print "] }";}'

Performance comparison性能比较

For comparison, an input file with 10,000,000 objects in .features[] was used.为了进行比较,使用了 .features[] 中包含 10,000,000 个对象的输入文件。 Its size is about 1GB.它的大小约为 1GB。

u+s:你+s:

jq-only:              15m 15s
jq-awk:                7m 40s
jq one-pass using map: 6m 53s

An alternative solution could be for example:例如,替代解决方案可能是:

jq '.features |= map_values(.tippecanoe.minzoom = 13)'

To test this, I created a sample JSON as为了测试这一点,我创建了一个示例 JSON 作为

d = {'features': [{"type":"Feature", "properties":{"FEATCODE": 15014}} for i in range(0,N)]}

and inspected the execution time as a function of N .并检查执行时间作为N的函数。 Interestingly, while the map_values approach seems to have linear complexity in N , .features[].tippecanoe.minzoom = 13 exhibits quadratic behavior (already for N=50000, the former method finishes in about 0.8 seconds, while the latter needs around 47 seconds)有趣的是,虽然map_values方法似乎在N中具有线性复杂性, .features[].tippecanoe.minzoom = 13表现出二次行为(对于 N=50000,前一种方法在大约 0.8 秒内完成,而后者需要大约 47 秒)

Alternatively, one might just do it manually with, eg, Python:或者,也可以使用 Python 手动完成:

import json
import sys

data = {}
with open(sys.argv[1], 'r') as F:
    data = json.load(F)

extra_item = {"minzoom" : 13}
for feature in data['features']:
    feature["tippecanoe"] = extra_item

with open(sys.argv[2], 'w') as F:
    F.write(json.dumps(data))

In this case, map rather than map_values is far faster (*):在这种情况下, map而不是map_values要快得多(*):

.features |= map(.tippecanoe.minzoom = 13)

However, using this approach will still require enough RAM.但是,使用这种方法仍然需要足够的 RAM。

ps If you want to use jq to generate a large file for timing, consider: ps如果要使用jq生成大文件进行计时,可以考虑:

def N: 1000000;

def data:
   {"features": [range(0;N) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };

(*) Using map , 20s for 100MB, and approximately linear. (*) 使用map ,20s 为 100MB,近似线性。

Here, based on the work of @nicowilliams at GitHub, is a solution that uses the streaming parser available with jq.这里,基于@nicowilliams 在 GitHub 的工作,是一个使用 jq 提供的流解析器的解决方案。 The solution is very economical with memory, but is currently quite slow if the input is large.该解决方案使用内存非常经济,但如果输入很大,则当前速度很慢。

The solution has two parts: a function for injecting the update into the stream produced using the --stream command-line option;该解决方案有两个部分:一个用于将更新注入到使用 --stream 命令行选项生成的流中的函数; and a function for converting the stream back to JSON in the original form.以及用于将流转换回原始格式的 JSON 的函数。

Invocation:调用:

jq -cnr --stream -f program.jq input.json

program.jq程序.jq

# inject the given object into the stream produced from "inputs" with the --stream option
def inject(object):
  [object|tostream] as $object
  | 2
  | truncate_stream(inputs)
  | if (.[0]|length == 1) and length == 1
    then $object[]
    else .
    end ;

# Input: the object to be added
# Output: text
def output:
  . as $object
  | ( "[",
      foreach fromstream( inject($object) ) as $o
        (0;
         if .==0 then 1 else 2 end;
         if .==1 then $o else ",", $o end),
      "]" ) ;

{}
| .tippecanoe.minzoom = 13
| output

Generation of test data测试数据的生成

def data(N):
 {"features":
  [range(0;2) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };

Example output示例输出

With N=2: N = 2:

[
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
,
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM