在 BigQuery 中取消嵌套非数组 JSON

Question

I have data arriving as separate events in JSON form resembling:我有数据作为单独的事件以 JSON 形式到达，类似于：

{
"id":1234,
"data":{
    "packet1":{"name":"packet1", "value":1},
    "packet2":{"name":"packet2", "value":2}
     }
}

I'd like to unnest the data to essentially have one row per 'packet' (there may be any number of packets).我想取消嵌套数据，使每个“数据包”基本上只有一行（可能有任意数量的数据包）。

id ID	name名称	value价值
1234 1234	packet1包1	1 1个
1234 1234	packet2数据包2	2 2个

I've looked at using the unnest function with the various JSON functions but it seems limited to working with arrays. I have not been able to find a way to treat the 'data' field as if it were an array.我已经看过将 unnest function 与各种 JSON 函数一起使用，但它似乎仅限于使用 arrays。我一直无法找到将“数据”字段视为数组的方法。

At the moment, I cannot change these events to store packets in an array, and ideally the unnesting should be happening within BigQuery.目前，我无法更改这些事件以将数据包存储在数组中，理想情况下，取消嵌套应该发生在 BigQuery 中。

Answer 1

1. Regular expressions 1.正则表达式

There might be other ways but you can consider below approach using regular expressions.可能还有其他方法，但您可以考虑使用正则表达式的以下方法。

WITH sample_table AS (
  SELECT """{
    "id":1234,
    "data":{
      "packet1":{"name":"packet1", "value":1},
      "packet2":{"name":"packet2", "value":2}
     }
  }""" AS events
)
SELECT JSON_VALUE(events, '$.id') AS id, name, value
  FROM sample_table,
       UNNEST (REGEXP_EXTRACT_ALL(events, r'"name":"(\w+)"')) name WITH offset
  JOIN UNNEST (REGEXP_EXTRACT_ALL(events, r'"value":([0-9.]+)')) value WITH offset
 USING (offset);

Query results查询结果

2. Javascript UDF 2. Javascript UDF

or, you might consider below using Javascript UDF.或者，您可以考虑在下面使用 Javascript UDF。

CREATE TEMP FUNCTION extract_pair(json STRING)
RETURNS ARRAY<STRUCT<name STRING, value STRING>>
LANGUAGE js AS """
  result = [];
  for (const [key, value] of Object.entries(JSON.parse(json))) {
    result.push(value);
  }
  return result;
""";

WITH sample_table AS (
  SELECT """{
    "id":1234,
    "data":{
      "packet1":{"name":"packet1", "value":1},
      "packet2":{"name":"packet2", "value":2}
     }
  }""" AS events
)
SELECT JSON_VALUE(events, '$.id') AS id, obj.*
  FROM sample_table, UNNEST(extract_pair(JSON_QUERY(events, '$.data'))) obj;

Answer 2

@Jaytiger's suggestion of unnesting a regex extract led me to the following solution. @Jaytiger 关于取消嵌套正则表达式提取物的建议使我想到了以下解决方案。 The example I showed was simplified - there are more fields within the packets.我展示的例子被简化了——数据包中有更多的字段。 To avoid requiring separate regex for each field name, I used regex to split/extract each individual packet, and then read the JSON.为了避免为每个字段名称要求单独的正则表达式，我使用正则表达式拆分/提取每个单独的数据包，然后阅读 JSON。

This iteration doesn't do everything in one step but works when just looking at packets.此迭代不会在一个步骤中完成所有操作，但仅在查看数据包时起作用。

with sample_data
AS (SELECT """{"packet1":{"name":"packet1", "value":1},
               "packet2":{"name":"packet2", "value":2}}""" as packets)

select
    json_value('{'||packet||'}', "$.name") name,
    json_value('{'||packet||'}', "$.value") value
from sample_data,
unnest(regexp_extract_all(packets, r'\:{(.*?)\}')) packet

在 BigQuery 中取消嵌套非数组 JSON

问题描述

2 个解决方案

解决方案1
1 已采纳 2023-01-26 12:51:50

1. Regular expressions 1.正则表达式

2. Javascript UDF 2. Javascript UDF

解决方案2
1 2023-01-26 15:13:27

在 BigQuery 中取消嵌套非数组 JSON

问题描述

2 个解决方案

解决方案1 1 已采纳 2023-01-26 12:51:50

1. Regular expressions 1.正则表达式

2. Javascript UDF 2. Javascript UDF

解决方案2 1 2023-01-26 15:13:27

解决方案1
1 已采纳 2023-01-26 12:51:50

解决方案2
1 2023-01-26 15:13:27