使用 jq 和 awk 拆分大型 JSON 文件

Question

I have a large file called我有一个大文件叫

Metadata_01.json

It consistst of blocks that following this structure:它由遵循以下结构的块组成：

[
 {
  "Participant_id": "P04_00001",
  "no_of_people": "Multiple",
  "apparent_gender": "F",
  "geographic_location": "AUS",
  "ethnicity": "Caucasian",
  "capture_device_used": "iOS 14",
  "camera_orientation": "Portrait",
  "camera_position": "Side View",
  "indoor_outdoor_env": "Indoors",
  "lighting_condition": "Bright",
  "Occluded": 1,
  "category": "Two Person",
  "camera_movement": "Still",
  "action": "No action",
  "indoor_outdoor_in_moving_car_or_train": "Indoor",
  "daytime_nighttime": "Nighttime"
 },
 {
  "Participant_id": "P04_00002",
  "no_of_people": "Single",
  "apparent_gender": "M",
  "geographic_location": "AUS",
  "ethnicity": "Caucasian",
  "capture_device_used": "iOS 14",
  "camera_orientation": "Portrait",
  "camera_position": "Frontal View",
  "indoor_outdoor_env": "Outdoors",
  "lighting_condition": "Bright",
  "Occluded": "None",
  "category": "Animals",
  "camera_movement": "Still",
  "action": "Small action",
  "indoor_outdoor_in_moving_car_or_train": "Outdoor",
  "daytime_nighttime": "Daytime"
 },

And so on... thousands of them.依此类推……成千上万。

I am using the following command:我正在使用以下命令：

jq -cr '.[]' Metadata_01.json | awk '{print > (NR ".json")}'

And it's kinda doing the expected work.它正在做预期的工作。

From large file that is structured like this来自结构如下的大文件

I am getting tons of files that named like this我收到大量这样命名的文件

And structure like this (in one line)和这样的结构（一行）

Instead of those results I need each json file to be named after the "Participant_id" (eg P04_00002.json) And I want to preserve the json structure to look like that for each file而不是那些结果我需要每个 json 文件以“Participant_id”命名（例如 P04_00002.json）我想保留 json 结构看起来像每个文件

{
  "Participant_id": "P04_00002",
  "no_of_people": "Single",
  "apparent_gender": "M",
  "geographic_location": "AUS",
  "ethnicity": "Caucasian",
  "capture_device_used": "iOS 14",
  "camera_orientation": "Portrait",
  "camera_position": "Frontal View",
  "indoor_outdoor_env": "Outdoors",
  "lighting_condition": "Bright",
  "Occluded": "None",
  "category": "Animals",
  "camera_movement": "Still",
  "action": "Small action",
  "indoor_outdoor_in_moving_car_or_train": "Outdoor",
  "daytime_nighttime": "Daytime"
 }

What adjustments should I make to the command above to achieve this?我应该对上面的命令进行哪些调整才能实现这一目标？ Or maybe there's an easier way to do this?或者也许有更简单的方法来做到这一点？ Thank you!谢谢！

Answer 1

What adjustments should I make...?我应该做哪些调整...？

I'd go with:我会 go 与：

jq -cr '.[] | (.Participant_id, .)' Metadata_01.json | awk '
  NR%2==1 {fn="id." $0 ".json"; next} {print >> fn; close(fn); }
'

Alternatively, if you can navigate your way around any issues that might arise when escaping quotation marks, you could get awk to call jq:或者，如果您可以解决 escaping 引号时可能出现的任何问题，您可以让 awk 调用 jq：

jq -cr '.[] | (.Participant_id, .)' Metadata_01.json | awk -v q=$'\'' '
  NR%2==1 {fn = "id." $0 ".json"; next}
  {  system( ("jq . <<< " q $0 q " >> \"" fn "\"") );
     close(fn);
  }
'

"Big Data" “大数据”

Of course if the input file is too large or too slow for jq empty , then you will want to consider alternatives, eg jq's --stream option, jstream , or my own jm .当然，如果输入文件对于jq empty来说太大或太慢，那么您将需要考虑替代方案，例如 jq 的--stream选项、 jstream或我自己的jm 。 For example if you want the JSON to be pretty-printed in each file:例如，如果您希望 JSON 漂亮地打印在每个文件中：

while read -r json
do
   fn=$(jq -r .Participant_id <<< "$json")
   <<< "$json" jq . > "id.$fn.json"
done < <(jm Metadata_01.json)

Answer 2

Would recommend using PowerShell since working with objects tends to be easier overall.建议使用 PowerShell，因为整体上处理对象往往更容易。 Fortunately, PowerShell has a ConvertFrom-Json cmdlet you can use to convert the returned text into a PS object letting you reference the properties via dot notation ( .Participant_id ).幸运的是，PowerShell 有一个ConvertFrom-Json cmdlet，您可以使用它来将返回的文本转换为 PS object，让您可以通过点符号 ( .Participant_id ) 引用属性。 Then, you'd just have to convert each iteration back to JSON format and export it.然后，您只需将每次迭代转换回 JSON 格式并将其导出。 Here I use New-Item to create the file with the output but piping to Out-File would work as well.在这里，我使用New-Item创建带有 output 的文件，但通过管道传输到Out-File也可以。

$json = Get-Content -Path '.\Metadata_01.json' -Raw | ConvertFrom-Json 
foreach ($json_object in $json)
{
    New-Item -Path ".\Desktop\" -Name "$($json_object.Participant_id).json" -Value (ConvertTo-Json -InputObject $json_object) -ItemType 'File' -Force
}

The issue I can see you probably running into is not enough memory, due to the size of that file since you'll be saving to a variable first in this example.我可以看到您可能遇到的问题是 memory 不够，由于该文件的大小，因为在此示例中您将首先保存到一个变量。 There are ways around it but this is for demonstration purposes.有很多解决方法，但这是出于演示目的。

使用 jq 和 awk 拆分大型 JSON 文件

问题描述

2 个解决方案

解决方案1
2 2022-11-23 21:30:05

"Big Data" “大数据”

解决方案2
1 已采纳 2022-11-23 19:16:33

使用 jq 和 awk 拆分大型 JSON 文件

问题描述

2 个解决方案

解决方案1 2 2022-11-23 21:30:05

"Big Data" “大数据”

解决方案2 1 已采纳 2022-11-23 19:16:33

解决方案1
2 2022-11-23 21:30:05

解决方案2
1 已采纳 2022-11-23 19:16:33