简体   繁体   English

使用`jq`对不适合内存的大文件进行`sort_by`的最佳方法是什么

[英]What is the best way to `sort_by` a huge file that doesn't fit in memory with `jq`

I'm working with a text file that has one JSON object per line and I want to use jq to select, group_by (key1), and sort_by (key1) the file.我正在处理一个文本文件,每行有一个 JSON 对象,我想使用 jq 来选择 group_by (key1) 和 sort_by (key1) 文件。 The file looks like this:该文件如下所示:

# /tmp/sample.json
{"key1": "value11", "key2": "value21", "key3": "value31"}
{"key1": "value11", "key2": "value22", "key3": "value32"}
{"key1": "value11", "key2": "value22", "key3": "value32"}
{"key1": "value13", "key2": "value23", "key3": "value33"}
{"key1": "value13", "key2": "value24", "key3": "value34"}
{"key1": "value16", "key2": "value26", "key3": "value36"}
{"key1": "value17", "key2": "value27", "key3": "value37"}
...

I'm running the file through Hadoop MapReduce in a similar manner to this question :我正在以与此问题类似的方式通过 Hadoop MapReduce 运行文件:

hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
    -files $HOME/bin/jq,$HOME/proj-map.jq,$HOME/proj-reduce.jq \
    -mapper "./jq -c --from-file=proj-map.jq" \
    -reducer "./jq -ncr --from-file=proj-reduce.jq" \
    -input  /tmp/sample.json \
    -output /tmp/sample.json.output

with

#proj-map.jq
# some transformation
{key1, key2}

and

#proj-reduce.jq
# by @peak -- https://stackoverflow.com/a/45715729/948914
# sort-free stream-oriented variant of group_by/1
# f should always evaluate to a string.
# Output: a stream of arrays, one array per group
def GROUPS_BY(stream; f): reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;

GROUPS_BY(inputs|.key1; .) | {key1: .[0], size: length} | (.size|tostring) + "\t" + tostring

The above yields something that I can feed into Unix sort for sorting:以上产生了一些我可以输入 Unix sort 进行排序的东西:

3 {"key1": "value11", "size": 3}
2 {"key1": "value13", "size": 2}
1 {"key1": "value16", "size": 1}
1 {"key1": "value17", "size": 1}

This works.这有效。 Now, I don't want to rely on Unix sort and I'm looking for a way to use jq's sort_by() .现在,我不想依赖 Unix 排序,我正在寻找一种使用 jq 的sort_by() I figured out that this can be challenging because from what I understand, sort_by() requires an array as input, which implies that the array is loaded in memory.我发现这可能具有挑战性,因为据我所知, sort_by()需要一个数组作为输入,这意味着该数组已加载到内存中。 Since the file may not fit in memory, I'm looking for a way using jq's sort_by() without reading the entire file in memory.由于文件可能不适合内存,我正在寻找一种使用 jq 的sort_by()而不读取内存中的整个文件的方法。 In particular, I'm interested in an efficient, streaming-type way of sorting, similar to Unix sort , or to the streaming group_by() .特别是,我对一种类似于 Unix sort流式group_by()的高效、流式类型的排序方式感兴趣。

If there is no such way, then to the best of my knowledge is this answer , which combines jq and Unix sort , as I showed above.如果没有这样的方法,那么据我所知,这个答案结合了jq和 Unix sort ,如我上面所示。 Obviously it would be great is sort_by worked like Unix sort but I don't have the means to find out.显然, sort_by像 Unix sort一样工作会很棒,但我没有办法找出答案。

[The following was written before the question was updated to explain that the input consists of multiple JSON entities.] [以下是在问题更新之前编写的,以解释输入由多个 JSON 实体组成。]

To simplify things a bit, the following assumes that you have a huge file consisting of a single JSON array.为了稍微简化一下,下面假设您有一个包含单个 JSON 数组的巨大文件。 Since, by assumption, this file is too big to read into memory, the first step will be to get each of the top-level array elements on a line by itself.由于假设此文件太大而无法读入内存,因此第一步将是单独获取一行中的每个顶级数组元素。 That can be done using jq's --stream command-line option, as described in the jq FAQ , eg perhaps along the lines of:这可以使用 jq 的--stream命令行选项来完成,如jq FAQ 中所述,例如可能沿着以下--stream行:

jq -cn --stream 'fromstream( inputs|(.[0] |= .[1:]) | select(. != [[]]) )'

The next step is to prefix each of these lines with the "sort by" value, as described in the link included in the Q. (That is, jq can easily be used.)下一步是为这些行中的每一行添加“排序依据”值作为前缀,如 Q 中包含的链接中所述。(也就是说,可以轻松使用 jq。)

Next, run the operating system sort .接下来,运行操作系统sort

Finally, if you really need the result as a single large array, you could use a text-processing tool (eg awk).最后,如果您确实需要将结果作为单个大数组,您可以使用文本处理工具(例如 awk)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM