简体   繁体   English

Google BigQuery:UNNEST,其中每个不同的键成为一列

[英]Google BigQuery: UNNEST where each different key becomes a column

I have this table with several columns containing dictionaries: payloadKV, metaKV, etc.我有这张表,其中有几列包含字典:payloadKV、metaKV 等。

示例表

I need to unnest the dict and pivot the result to put each key in a column and the value in the correspondent cell of that row,column.我需要取消 dict 和 pivot 的结果,将每个键放在一列中,并将值放在该行、列的对应单元格中。 The desired output of the screenshot above would be:上面屏幕截图中所需的 output 将是:

+---------------------+-------+---------------+----------------+---------------------+-----+
|   ingestTimestamp   |  ...  | metadata.Area |  metadata.Cell | metadata.Department | ... |
+---------------------+-------+---------------+----------------+---------------------+-----+
| 2022-03-23 02:34:41 |   ... | MC            |           0010 |                0752 | ... |
| ...                 |   ... | ...           |            ... |                 ... | ... |
+---------------------+-------+---------------+----------------+---------------------+-----+

Each of these dictionaries have an arbitrary number of key/values, which can be hundreds, and I cannot know the key names beforehand, so I need some generic expression to extract them.这些字典中的每一个都有任意数量的键/值,可以是数百个,我无法事先知道键名,所以我需要一些通用表达式来提取它们。

I have seen examples how to extract the desired keys by hardcoding them, but I cannot seem to find a generic way to do it.我已经看到了如何通过硬编码来提取所需密钥的示例,但我似乎找不到通用的方法来做到这一点。

Can be relatively easily done with BigQuery PIVOT along with EXECUTE IMMEDIATE as in below example可以使用 BigQuery PIVOT以及EXECUTE IMMEDIATE相对轻松地完成,如下例所示

create temp table temp as (
  select t.* except(payloadKV, metaKV), replace(key, '.', '_') key, value
  from your_table t, unnest(payloadKV)
  union all
  select t.* except(payloadKV, metaKV), replace(key, '.', '_') key, value
  from your_table t, unnest(metaKV)
);

execute immediate (select '''
  select * from temp pivot (any_value(value) for key in (''' || 
  (select string_agg("'" || key || "'", ',' order by key) from (select distinct key from temp))
  || '''))
''')           

if applied to sample data as in (similar to) your question如果应用于样本数据(类似于)您的问题

select '2022-03-23 02:34:41' ingestTimestamp, [
    struct('payload.Area' as key, 'MC1' as value), ('payload.Cell', '00101'), ('payload.Department', '07521')] payloadKV, [
    struct('metadata.Area' as key, 'MC' as value),('metadata.Cell', '0010'), ('metadata.Department', '0752')] metaKV
union all
select '2022-03-24 02:34:41' ingestTimestamp, [
    struct('payload.Area' as key, 'MC2' as value), ('payload.Cell', '00102'), ('payload.Department', '07522')] payloadKV, [
    struct('metadata.Area' as key, 'MC3' as value),('metadata.Cell', '00103'), ('metadata.Department', '07523')] metaKV

output is output 是

在此处输入图像描述

Its kind of challenging to get such a script.获得这样的剧本是一种挑战。 Its possible although you might need to do a lot of coding and try-error.尽管您可能需要进行大量编码和试错,但这是可能的。 If a table is really extensive you might want to use python ( as suggested by martin weitzmann ) to retrieve column information and create your script to get your data.如果一个表真的很广泛,你可能想使用 python (正如martin weitzmann所建议的那样)来检索列信息并创建你的脚本来获取你的数据。

You can also use only BigQuery but you might find it difficult to implement on large tables but here is my approach, you can try this approach if it fits your scenario:您也可以只使用 BigQuery,但您可能会发现它很难在大表上实现,但这是我的方法,如果适合您的场景,您可以尝试这种方法:

  1. Create our test table with some records用一些记录创建我们的测试表
create or replace table`projectid.dataset.table`
(
    id INT64,
    ingestTimeStamp date,
    payloadKV STRUCT<id INT64,json STRING>,
    metaKV STRUCT<id INT64,description STRING>
)

insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(1,"2022-03-03",(100,'{"kardexid":11,"desc":"d1"}'),(100,"a desc1"));
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(2,"2022-03-04",(101,'{"kardexid":22,"desc":"d2"}'),(110,"a desc2"));
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(3,"2022-03-05",(102,'{"kardexid":34,"desc":"d3"}'),(120,"a desc3"));
insert into `projectid.dataset.table`(id,ingestTimeStamp,payloadKV,metaKV)values(4,"2022-03-06",(103,'{"kardexid":53,"desc":"d4"}'),(130,"a desc4"));
  1. Lets declare our working variables让我们声明我们的工作变量
declare working_table string;
declare loop_col String;
declare query String;
declare single_col_names String;
declare nested_col_array ARRAY<STRING>;
declare nested_col_string String DEFAULT "";
  1. Set our working variables设置我们的工作变量
# Set columns to work
set working_table = "table";

set single_col_names = (SELECT STRING_AGG(column_name) FROM `projectid.dataset.INFORMATION_SCHEMA.COLUMNS`
where table_name = working_table and data_type not like 'STRUCT%');

set nested_col_array = (SELECT ARRAY_AGG(column_name) FROM `projectid.dataset.INFORMATION_SCHEMA.COLUMNS`
where table_name = working_table and data_type like 'STRUCT%');
  1. Get our nested columns获取我们的嵌套列
# Retrieve nested columns
FOR record IN
  (SELECT * FROM unnest(nested_col_array) as col_names)
DO   
    SET loop_col = (SELECT CONCAT(column_name,
                                    ".",
                                    REPLACE(ARRAY_TO_STRING(REGEXP_EXTRACT_ALL(data_type,r'[STRUCT<,INT64 STRING ]+(.+?) '),",")
                                            ,",",
                                            CONCAT(",",column_name,".")))
    FROM `projectid.dataset.INFORMATION_SCHEMA.COLUMNS`
    where table_name = working_table and data_type like 'STRUCT%' and column_name=record.col_names);

    SET nested_col_string = (SELECT CONCAT(nested_col_string,",",loop_col));
END FOR;
  1. we then finalize by creating our custom query and run it.然后我们通过创建自定义查询并运行它来完成。
# build & run query
set query = (SELECT FORMAT("select %s%s from `projectid.dataset.table` order by 1",single_col_names,nested_col_string));
EXECUTE IMMEDIATE(query);

output: output:

id ID ingestTimeStamp摄取时间戳 id_1 id_1 json json id_2 id_2 description描述
1 1个 2022-03-03 2022-03-03 100 100 {"kardexid":11,"desc":"d1"} {“kardexid”:11,“desc”:“d1”} 100 100 a desc1描述1
2 2个 2022-03-04 2022-03-04 101 101 {"kardexid":22,"desc":"d2"} {“kardexid”:22,“desc”:“d2”} 110 110 a desc2描述2
3 3个 2022-03-05 2022-03-05 102 102 {"kardexid":34,"desc":"d3"} {“kardexid”:34,“desc”:“d3”} 120 120 a desc3一个desc3
4 4个 2022-03-06 2022-03-06 103 103 {"kardexid":53,"desc":"d4"} {“kardexid”:53,“desc”:“d4”} 130 130 a desc4一个desc4

As you can see, a process such as this one on BigQuery is quite challenging as you will have to parse your struct types to get the names of your inner columns, do scripting and definitely not optimal.如您所见,BigQuery 上的此类过程非常具有挑战性,因为您必须解析结构类型以获取内部列的名称、编写脚本并且绝对不是最佳选择。 For the sake of the question, this can be done but it's not something I would recommend.为了这个问题,可以这样做,但这不是我推荐的。

When dealing with BigQuery you usually want to go for less resource invested queries and just pick up what is truly needed.在处理 BigQuery 时,您通常希望 go 用于较少资源投入的查询,并且只选择真正需要的东西。 You can use client libraries to perform less operations on BigQuery side and use the code to perform transformations with the data you get from your raw queries.您可以使用客户端库在 BigQuery 端执行更少的操作,并使用代码对从原始查询中获得的数据执行转换。

To create this code I consult the following documentation, check it out:要创建此代码,我查阅了以下文档,检查它:

In databases rows are observations and columns describe these observations.在数据库中,行是观察结果,列描述这些观察结果。 You want to non-consistently describe observations - that's a meta level of databases who are only meant to be consistent in this respect.您想要不一致地描述观察结果 - 这是数据库的元级别,仅在这方面保持一致。

You can cross join with an unnested array and then pivot but "you" still need to know the column names in advance.您可以使用未嵌套数组交叉连接, 然后使用 pivot ,但“您”仍然需要提前知道列名。

"You" is either you in person or a bit of code that prepares the SQL statement by gathering the information in advance - which can be an automated solution in python for instance. “你”可以是你本人,也可以是通过预先收集信息准备 SQL 语句的一段代码——例如,它可以是 python 中的一个自动化解决方案。 Basically基本上

  1. Gather pivot column information using python+bigquery: flatten the array and get distinct metadata.key values使用 python+bigquery 收集 pivot 列信息:展平数组并获得不同的 metadata.key 值
  2. In python prepare a sql statement with a customized pivot statement using metadata.key information from step 1在 python 中准备一个 sql 语句,其中使用来自步骤 1 的 metadata.key 信息自定义 pivot 语句
  3. run that statement运行该语句

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM