使用jq將TSV文件轉換為多個JSON arrays

Question

我想使用開放數據 IMDb ，但他們以 TSV 格式提供它，這不是很方便。

https://datasets.imdbws.com/title.crew.tsv.gz

tconst  directors   writers
tt0000238   nm0349785   \N
tt0000239   nm0349785   \N
tt0000240   \N  \N
tt0000241   nm0349785   \N
tt0000242   nm0617588   nm0617588
tt0000243   nm0349785   \N
tt0000244   nm0349785   \N
tt0000245   \N  \N
tt0000246   nm0617588   \N
tt0000247   nm0002504,nm0005690,nm2156608   nm0000636,nm0002504
tt0000248   nm0808310   \N
tt0000249   nm0808310   \N
tt0000250   nm0005717   \N
tt0000251   nm0177862   \N

我想將 TSV 數據轉換為 JSON。

[
  {
    "tconst": "tt0000247",
    "directors": [
      "nm0005690",
      "nm0002504",
      "nm2156608"
    ],
    "writers": [
      "nm0000636",
      "nm0002504"
    ]
  },
  {
    "tconst": "tt0000248",
    "directors": [
      "nm0808310"
    ],
    "writers": [
      "\\N"
    ]
  }
]

我可以使用以下命令執行此操作：

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json

但是，文件變得非常大，我擺脫了 memory 錯誤。

1 、如何將這個TSV文件拆分成幾個JSON文件，每個文件有1000條記錄？

./title.crew.page1.json
./title.crew.page2.json
./title.crew.page3.json

2.如何排除空字段？ 有一個空數組。

"writers": [ "\\N" ] -> "writers": [ ]

UPD（第二個問題已解決。）：

jq -rRs 'split("\n")[1:-1] |
         map([split("\t")[]|split(",")] | 
         .[2] |= if .[0] == "\\N" then [] else . end | {
                 "tconst":.[0][0],
                 "directors":.[1],
                 "writers":.[2]
             }
    )' ./title.crew.tsv > ./title.crew.json

[
  {
    "tconst": "tt0000247",
    "directors": [
      "nm0005690",
      "nm0002504",
      "nm2156608"
    ],
    "writers": [
      "nm0000636",
      "nm0002504"
    ]
  },
  {
    "tconst": "tt0000248",
    "directors": [
      "nm0808310"
    ],
    "writers": []
  }
]

感謝您的回答。

Answer 1

他們以 TSV 格式提供它，這不是很方便。

實際上，jq 和 TSV go 非常好地結合在一起，當然使用 jq 處理 TSV 文件不需要使用 -s（“slurp”）選項，這確實通常（但絕不總是）最好避免。

如果您的目標只是生成“tconst”對象的 stream，您可以逐行處理 TSV 文件； if you wanted to assemble that stream into a single array, then you could use jq with the -c option to produce a stream with one JSON object per line, and then assemble them together using a tool such as awk (ie, simply adding the左括號和右括號以及分隔逗號）。

但是，在您的情況下，首先拆分 TSV 文件可能是最簡單的（例如，使用 unix/linux/mac split命令 -- 見下文），然后按照 jq 程序的行處理每個文件。 由於您的塊非常小（每個 1000 個對象），您甚至可以將 jq 與 -s 選項一起使用，但使用inputs和 -n 命令行選項同樣容易：

jq -n '[inputs]'

或者您可以組合這些策略：拆分成塊，並使用帶有 -c 選項的 jq 處理每個塊以生成 stream，並將每個這樣的 stream 組裝成 Z0ECD11C1D7A2874201D148A23BBDA 數組。

分裂

要將文件拆分為塊，請參見例如：

如何將大文本文件拆分成行數相等的小文件？

使用命令行將文本文件拆分為更小的多個文本文件

和許多其他人。

Answer 2

如果python是您的選擇，那么如何使用它，因為 python 的數據結構與json具有很高的兼容性。 請你試試：

#!/usr/bin/python

import json

ary = []                                        # declare an empty array
with open('./title.crew.tsv') as f:
    header = f.readline().rstrip().split('\t')  # read the header line and split
    for line in f:                              # iterate the following lines
        body = line.rstrip().split('\t')
        d = {}                                  # empty dictionary
        for i in range(0, len(header)):
            if ',' in body[i]:                  # if the value contains ","
                b = body[i].split(',')          # then split the value on it
            else:
                b = body[i]
            if b == '\N':                       # if the value is "\N"
                b = []                          # then replace with an empty array
            d[header[i]] = b                    # generate an object
        ary.append(d)                           # append the object to the array
print(json.dumps(ary, indent=2))

Output：

[
  {
    "directors": "nm0349785", 
    "tconst": "tt0000238", 
    "writers": []
  }, 
  {
    "directors": "nm0349785", 
    "tconst": "tt0000239", 
    "writers": []
  }, 
  {
    "directors": [], 
    "tconst": "tt0000240", 
    "writers": []
  }, 
<..SNIPPED..>

由於python是一種通用編程語言，它對輸入的處理具有很高的靈活性。 將結果拆分為多個 json 文件也很容易。

Answer 3

由於 1000 在當前上下文中是一個很小的數字，因此這里有一個不使用split的解決方案； 相反，它歸結為一個兩步管道。

管道的第一部分包括使用 -c 選項調用 jq（用於將 TSV 轉換為 JSON arrays 的 stream，每個塊一個） 下文對此進行了描述。

管道的第二部分將 arrays 的 stream 轉換為所需的文件集，每個文件一個數組； 這部分管道可以使用awk或您選擇的類似工具輕松實現，下面不再進一步討論。

程序.jq

# Assemble the items in the (possibly empty) stream into a 
# (possibly empty) stream of arrays of length $n or less.
# $n can be any integer greater than 0;
# emit nothing if `stream` is empty.
def assemble(stream; $n):
  # box the input to detect eos
  foreach ((stream|[.]), null) as $item ({};
     (.array|length) as $l
     | if $item == null # eos
       then .emit = (0 < $l and $l < $n)
       else if $l == $n
            then .array = $item
            else .array += $item
            end
       | .emit = (.array|length == $n)
       end;

     if .emit then .array else empty end) ;


def stream:
  inputs
  | split("\t")
  | map_values(if . == "\\N" then "" else . end)
  | map(split(","))
  | { tconst: .[0][0],
      directors: .[1],
      writers:   .[2] };
      
assemble(stream; 1000)

調用：

要跳過 header，我們省略 -n 命令行選項，如果沒有 header 將使用該選項：

jq -Rc -f program.jq input.tsv

使用jq將TSV文件轉換為多個JSON arrays

問題描述

3 個解決方案

解決方案1
1 2021-02-10 11:06:46

分裂

解決方案2
0 2021-02-10 12:13:09

解決方案3
0 2021-02-10 13:05:02

程序.jq

調用：

使用jq將TSV文件轉換為多個JSON arrays

問題描述

3 個解決方案

解決方案1 1 2021-02-10 11:06:46

分裂

解決方案2 0 2021-02-10 12:13:09

解決方案3 0 2021-02-10 13:05:02

程序.jq

調用：

解決方案1
1 2021-02-10 11:06:46

解決方案2
0 2021-02-10 12:13:09

解決方案3
0 2021-02-10 13:05:02