使用jq将TSV文件转换为多个JSON arrays

Question

我想使用开放数据 IMDb ，但他们以 TSV 格式提供它，这不是很方便。

https://datasets.imdbws.com/title.crew.tsv.gz

tconst  directors   writers
tt0000238   nm0349785   \N
tt0000239   nm0349785   \N
tt0000240   \N  \N
tt0000241   nm0349785   \N
tt0000242   nm0617588   nm0617588
tt0000243   nm0349785   \N
tt0000244   nm0349785   \N
tt0000245   \N  \N
tt0000246   nm0617588   \N
tt0000247   nm0002504,nm0005690,nm2156608   nm0000636,nm0002504
tt0000248   nm0808310   \N
tt0000249   nm0808310   \N
tt0000250   nm0005717   \N
tt0000251   nm0177862   \N

我想将 TSV 数据转换为 JSON。

[
  {
    "tconst": "tt0000247",
    "directors": [
      "nm0005690",
      "nm0002504",
      "nm2156608"
    ],
    "writers": [
      "nm0000636",
      "nm0002504"
    ]
  },
  {
    "tconst": "tt0000248",
    "directors": [
      "nm0808310"
    ],
    "writers": [
      "\\N"
    ]
  }
]

我可以使用以下命令执行此操作：

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json

但是，文件变得非常大，我摆脱了 memory 错误。

1 、如何将这个TSV文件拆分成几个JSON文件，每个文件有1000条记录？

./title.crew.page1.json
./title.crew.page2.json
./title.crew.page3.json

2.如何排除空字段？ 有一个空数组。

"writers": [ "\\N" ] -> "writers": [ ]

UPD（第二个问题已解决。）：

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | 
         .[2] |= if .[0] == "\\N" then [] else . end | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json

[
  {
    "tconst": "tt0000247",
    "directors": [
      "nm0005690",
      "nm0002504",
      "nm2156608"
    ],
    "writers": [
      "nm0000636",
      "nm0002504"
    ]
  },
  {
    "tconst": "tt0000248",
    "directors": [
      "nm0808310"
    ],
    "writers": []
  }
]

感谢您的回答。

Answer 1

他们以 TSV 格式提供它，这不是很方便。

实际上，jq 和 TSV go 非常好地结合在一起，当然使用 jq 处理 TSV 文件不需要使用 -s（“slurp”）选项，这确实通常（但绝不总是）最好避免。

如果您的目标只是生成“tconst”对象的 stream，您可以逐行处理 TSV 文件； if you wanted to assemble that stream into a single array, then you could use jq with the -c option to produce a stream with one JSON object per line, and then assemble them together using a tool such as awk (ie, simply adding the左括号和右括号以及分隔逗号）。

但是，在您的情况下，首先拆分 TSV 文件可能是最简单的（例如，使用 unix/linux/mac split命令 -- 见下文），然后按照 jq 程序的行处理每个文件。 由于您的块非常小（每个 1000 个对象），您甚至可以将 jq 与 -s 选项一起使用，但使用inputs和 -n 命令行选项同样容易：

jq -n '[inputs]'

或者您可以组合这些策略：拆分成块，并使用带有 -c 选项的 jq 处理每个块以生成 stream，并将每个这样的 stream 组装成 Z0ECD11C1D7A2874201D148A23BBDA 数组。

分裂

要将文件拆分为块，请参见例如：

如何将大文本文件拆分成行数相等的小文件？

使用命令行将文本文件拆分为更小的多个文本文件

和许多其他人。

Answer 2

如果python是您的选择，那么如何使用它，因为 python 的数据结构与json具有很高的兼容性。 请你试试：

#!/usr/bin/python

import json

ary = []                                        # declare an empty array
with open('./title.crew.tsv') as f:
    header = f.readline().rstrip().split('\t')  # read the header line and split
    for line in f:                              # iterate the following lines
        body = line.rstrip().split('\t')
        d = {}                                  # empty dictionary
        for i in range(0, len(header)):
            if ',' in body[i]:                  # if the value contains ","
                b = body[i].split(',')          # then split the value on it
            else:
                b = body[i]
            if b == '\N':                       # if the value is "\N"
                b = []                          # then replace with an empty array
            d[header[i]] = b                    # generate an object
        ary.append(d)                           # append the object to the array
print(json.dumps(ary, indent=2))

Output：

[
  {
    "directors": "nm0349785", 
    "tconst": "tt0000238", 
    "writers": []
  }, 
  {
    "directors": "nm0349785", 
    "tconst": "tt0000239", 
    "writers": []
  }, 
  {
    "directors": [], 
    "tconst": "tt0000240", 
    "writers": []
  }, 
<..SNIPPED..>

由于python是一种通用编程语言，它对输入的处理具有很高的灵活性。 将结果拆分为多个 json 文件也很容易。

Answer 3

由于 1000 在当前上下文中是一个很小的数字，因此这里有一个不使用split的解决方案； 相反，它归结为一个两步管道。

管道的第一部分包括使用 -c 选项调用 jq（用于将 TSV 转换为 JSON arrays 的 stream，每个块一个） 下文对此进行了描述。

管道的第二部分将 arrays 的 stream 转换为所需的文件集，每个文件一个数组； 这部分管道可以使用awk或您选择的类似工具轻松实现，下面不再进一步讨论。

程序.jq

# Assemble the items in the (possibly empty) stream into a 
# (possibly empty) stream of arrays of length $n or less.
# $n can be any integer greater than 0;
# emit nothing if `stream` is empty.
def assemble(stream; $n):
  # box the input to detect eos
  foreach ((stream|[.]), null) as $item ({};
     (.array|length) as $l
     | if $item == null # eos
       then .emit = (0 < $l and $l < $n)
       else if $l == $n
            then .array = $item
            else .array += $item
            end
       | .emit = (.array|length == $n)
       end;

     if .emit then .array else empty end) ;


def stream:
  inputs
  | split("\t")
  | map_values(if . == "\\N" then "" else . end)
  | map(split(","))
  | { tconst: .[0][0],
      directors: .[1],
      writers:   .[2] };
      
assemble(stream; 1000)

调用：

要跳过 header，我们省略 -n 命令行选项，如果没有 header 将使用该选项：

jq -Rc -f program.jq input.tsv

使用jq将TSV文件转换为多个JSON arrays

问题描述

3 个解决方案

解决方案1
1 2021-02-10 11:06:46

分裂

解决方案2
0 2021-02-10 12:13:09

解决方案3
0 2021-02-10 13:05:02

程序.jq

调用：

使用jq将TSV文件转换为多个JSON arrays

问题描述

3 个解决方案

解决方案1 1 2021-02-10 11:06:46

分裂

解决方案2 0 2021-02-10 12:13:09

解决方案3 0 2021-02-10 13:05:02

程序.jq

调用：

解决方案1
1 2021-02-10 11:06:46

解决方案2
0 2021-02-10 12:13:09

解决方案3
0 2021-02-10 13:05:02