简体   繁体   English

使用jq将TSV文件转换为多个JSON arrays

[英]Convert TSV file to multiple JSON arrays with jq

I would like to use the open data IMDb , but they serve it in TSV format, which is not very convenient.我想使用开放数据 IMDb ,但他们以 TSV 格式提供它,这不是很方便。

https://datasets.imdbws.com/title.crew.tsv.gz https://datasets.imdbws.com/title.crew.tsv.gz

tconst  directors   writers
tt0000238   nm0349785   \N
tt0000239   nm0349785   \N
tt0000240   \N  \N
tt0000241   nm0349785   \N
tt0000242   nm0617588   nm0617588
tt0000243   nm0349785   \N
tt0000244   nm0349785   \N
tt0000245   \N  \N
tt0000246   nm0617588   \N
tt0000247   nm0002504,nm0005690,nm2156608   nm0000636,nm0002504
tt0000248   nm0808310   \N
tt0000249   nm0808310   \N
tt0000250   nm0005717   \N
tt0000251   nm0177862   \N

I want to convert TSV data to JSON.我想将 TSV 数据转换为 JSON。

[
  {
    "tconst": "tt0000247",
    "directors": [
      "nm0005690",
      "nm0002504",
      "nm2156608"
    ],
    "writers": [
      "nm0000636",
      "nm0002504"
    ]
  },
  {
    "tconst": "tt0000248",
    "directors": [
      "nm0808310"
    ],
    "writers": [
      "\\N"
    ]
  }
]

I can do this with the command:我可以使用以下命令执行此操作:

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json

However, the file turns out to be very large, I get out of memory errors.但是,文件变得非常大,我摆脱了 memory 错误。

1. How can split this TSV file into several JSON files, each with 1000 records? 1 、如何将这个TSV文件拆分成几个JSON文件,每个文件有1000条记录?

./title.crew.page1.json
./title.crew.page2.json
./title.crew.page3.json

2. How can exclude empty fields? 2.如何排除空字段? To have an empty array.有一个空数组。

"writers": [ "\\N" ] -> "writers": [ ] "writers": [ "\\N" ] -> "writers": [ ]

UPD (The second question was solved.): UPD(第二个问题已解决。):

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | 
         .[2] |= if .[0] == "\\N" then [] else . end | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json
[
  {
    "tconst": "tt0000247",
    "directors": [
      "nm0005690",
      "nm0002504",
      "nm2156608"
    ],
    "writers": [
      "nm0000636",
      "nm0002504"
    ]
  },
  {
    "tconst": "tt0000248",
    "directors": [
      "nm0808310"
    ],
    "writers": []
  }
]

Thanks for answers.感谢您的回答。

they serve it in TSV format, which is not very convenient.他们以 TSV 格式提供它,这不是很方便。

Actually, jq and TSV go extremely well together, and certainly using jq to process TSV files does not require using the -s ("slurp") option, which indeed is usually (but by no means always) best avoided.实际上,jq 和 TSV go 非常好地结合在一起,当然使用 jq 处理 TSV 文件不需要使用 -s(“slurp”)选项,这确实通常(但绝不总是)最好避免。

If your goal were simply to produce a stream of the “tconst” objects, you could process the TSV file on a line-by-line basis;如果您的目标只是生成“tconst”对象的 stream,您可以逐行处理 TSV 文件; if you wanted to assemble that stream into a single array, then you could use jq with the -c option to produce a stream with one JSON object per line, and then assemble them together using a tool such as awk (ie, simply adding the opening and closing brackets and the delimiting commas). if you wanted to assemble that stream into a single array, then you could use jq with the -c option to produce a stream with one JSON object per line, and then assemble them together using a tool such as awk (ie, simply adding the左括号和右括号以及分隔逗号)。

In your case, though, it would probably be simplest to split the TSV file first (eg using the unix/linux/mac split command -- see below) and then process each file along the lines of your jq program.但是,在您的情况下,首先拆分 TSV 文件可能是最简单的(例如,使用 unix/linux/mac split命令 -- 见下文),然后按照 jq 程序的行处理每个文件。 Since your chunks are quite small (1000 objects each), you could even use jq with the -s option, but it's just as easy to use inputs and the -n command-line option instead:由于您的块非常小(每个 1000 个对象),您甚至可以将 jq 与 -s 选项一起使用,但使用inputs和 -n 命令行选项同样容易:

jq -n '[inputs]'

Or you could combine these strategies: split into chunks, and process each chunk using jq with the -c option to produce a stream, and assembling each such stream into a JSON array.或者您可以组合这些策略:拆分成块,并使用带有 -c 选项的 jq 处理每个块以生成 stream,并将每个这样的 stream 组装成 Z0ECD11C1D7A2874201D148A23BBDA 数组。

split分裂

For splitting a file into chunks, see for example:要将文件拆分为块,请参见例如:

How to split a large text file into smaller files with equal number of lines? 如何将大文本文件拆分成行数相等的小文件?

Split text file into smaller multiple text file using command line 使用命令行将文本文件拆分为更小的多个文本文件

and many others.和许多其他人。

If python is your option, how about making use of it because the data structure of python has a high compatibility with json .如果python是您的选择,那么如何使用它,因为 python 的数据结构与json具有很高的兼容性。 Would you please try:请你试试:

#!/usr/bin/python

import json

ary = []                                        # declare an empty array
with open('./title.crew.tsv') as f:
    header = f.readline().rstrip().split('\t')  # read the header line and split
    for line in f:                              # iterate the following lines
        body = line.rstrip().split('\t')
        d = {}                                  # empty dictionary
        for i in range(0, len(header)):
            if ',' in body[i]:                  # if the value contains ","
                b = body[i].split(',')          # then split the value on it
            else:
                b = body[i]
            if b == '\N':                       # if the value is "\N"
                b = []                          # then replace with an empty array
            d[header[i]] = b                    # generate an object
        ary.append(d)                           # append the object to the array
print(json.dumps(ary, indent=2))

Output: Output:

[
  {
    "directors": "nm0349785", 
    "tconst": "tt0000238", 
    "writers": []
  }, 
  {
    "directors": "nm0349785", 
    "tconst": "tt0000239", 
    "writers": []
  }, 
  {
    "directors": [], 
    "tconst": "tt0000240", 
    "writers": []
  }, 
<..SNIPPED..>

As python is a general programing language, it has a high flexibility to process the input.由于python是一种通用编程语言,它对输入的处理具有很高的灵活性。 It is also easy to split the result into multiple json files.将结果拆分为多个 json 文件也很容易。

Since 1000 is a small number in the present context, here's a solution that does not use split ;由于 1000 在当前上下文中是一个很小的数字,因此这里有一个不使用split的解决方案; instead, it boils down to a single two-step pipeline.相反,它归结为一个两步管道。

The first part of the pipeline consists of an invocation of jq with the -c option (for converting the TSV into a stream of JSON arrays, one per chunk);管道的第一部分包括使用 -c 选项调用 jq(用于将 TSV 转换为 JSON arrays 的 stream,每个块一个) this is described below.下文对此进行了描述。

The second part of the pipeline converts this stream of arrays into the desired set of files, one array per file;管道的第二部分将 arrays 的 stream 转换为所需的文件集,每个文件一个数组; this part of the pipeline can easily be implemented using awk or a similar tool of your choice, and is not discussed further below.这部分管道可以使用awk或您选择的类似工具轻松实现,下面不再进一步讨论。

program.jq程序.jq

# Assemble the items in the (possibly empty) stream into a 
# (possibly empty) stream of arrays of length $n or less.
# $n can be any integer greater than 0;
# emit nothing if `stream` is empty.
def assemble(stream; $n):
  # box the input to detect eos
  foreach ((stream|[.]), null) as $item ({};
     (.array|length) as $l
     | if $item == null # eos
       then .emit = (0 < $l and $l < $n)
       else if $l == $n
            then .array = $item
            else .array += $item
            end
       | .emit = (.array|length == $n)
       end;

     if .emit then .array else empty end) ;


def stream:
  inputs
  | split("\t")
  | map_values(if . == "\\N" then "" else . end)
  | map(split(","))
  | { tconst: .[0][0],
      directors: .[1],
      writers:   .[2] };
      
assemble(stream; 1000)

Invocation:调用:

To skip the header, we omit the -n command-line option that would be used if there were no header:要跳过 header,我们省略 -n 命令行选项,如果没有 header 将使用该选项:

jq -Rc -f program.jq input.tsv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM