简体   繁体   English

在 BigQuery 中取消嵌套非数组 JSON

[英]Unnest non-array JSON in BigQuery

I have data arriving as separate events in JSON form resembling:我有数据作为单独的事件以 JSON 形式到达,类似于:

{
"id":1234,
"data":{
    "packet1":{"name":"packet1", "value":1},
    "packet2":{"name":"packet2", "value":2}
     }
}

I'd like to unnest the data to essentially have one row per 'packet' (there may be any number of packets).我想取消嵌套数据,使每个“数据包”基本上只有一行(可能有任意数量的数据包)。

id ID name名称 value价值
1234 1234 packet1包1 1 1个
1234 1234 packet2数据包2 2 2个

I've looked at using the unnest function with the various JSON functions but it seems limited to working with arrays. I have not been able to find a way to treat the 'data' field as if it were an array.我已经看过将 unnest function 与各种 JSON 函数一起使用,但它似乎仅限于使用 arrays。我一直无法找到将“数据”字段视为数组的方法。

At the moment, I cannot change these events to store packets in an array, and ideally the unnesting should be happening within BigQuery.目前,我无法更改这些事件以将数据包存储在数组中,理想情况下,取消嵌套应该发生在 BigQuery 中。

1. Regular expressions 1.正则表达式

There might be other ways but you can consider below approach using regular expressions.可能还有其他方法,但您可以考虑使用正则表达式的以下方法。

WITH sample_table AS (
  SELECT """{
    "id":1234,
    "data":{
      "packet1":{"name":"packet1", "value":1},
      "packet2":{"name":"packet2", "value":2}
     }
  }""" AS events
)
SELECT JSON_VALUE(events, '$.id') AS id, name, value
  FROM sample_table,
       UNNEST (REGEXP_EXTRACT_ALL(events, r'"name":"(\w+)"')) name WITH offset
  JOIN UNNEST (REGEXP_EXTRACT_ALL(events, r'"value":([0-9.]+)')) value WITH offset
 USING (offset);

Query results查询结果

在此处输入图像描述

2. Javascript UDF 2. Javascript UDF

or, you might consider below using Javascript UDF.或者,您可以考虑在下面使用 Javascript UDF。

CREATE TEMP FUNCTION extract_pair(json STRING)
RETURNS ARRAY<STRUCT<name STRING, value STRING>>
LANGUAGE js AS """
  result = [];
  for (const [key, value] of Object.entries(JSON.parse(json))) {
    result.push(value);
  }
  return result;
""";

WITH sample_table AS (
  SELECT """{
    "id":1234,
    "data":{
      "packet1":{"name":"packet1", "value":1},
      "packet2":{"name":"packet2", "value":2}
     }
  }""" AS events
)
SELECT JSON_VALUE(events, '$.id') AS id, obj.*
  FROM sample_table, UNNEST(extract_pair(JSON_QUERY(events, '$.data'))) obj;

@Jaytiger's suggestion of unnesting a regex extract led me to the following solution. @Jaytiger 关于取消嵌套正则表达式提取物的建议使我想到了以下解决方案。 The example I showed was simplified - there are more fields within the packets.我展示的例子被简化了——数据包中有更多的字段。 To avoid requiring separate regex for each field name, I used regex to split/extract each individual packet, and then read the JSON.为了避免为每个字段名称要求单独的正则表达式,我使用正则表达式拆分/提取每个单独的数据包,然后阅读 JSON。

This iteration doesn't do everything in one step but works when just looking at packets.此迭代不会在一个步骤中完成所有操作,但仅在查看数据包时起作用。

with sample_data
AS (SELECT """{"packet1":{"name":"packet1", "value":1},
               "packet2":{"name":"packet2", "value":2}}""" as packets)

select
    json_value('{'||packet||'}', "$.name") name,
    json_value('{'||packet||'}', "$.value") value
from sample_data,
unnest(regexp_extract_all(packets, r'\:{(.*?)\}')) packet

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM