[英]How to extract a value as a column from JSON with multiple key-value lists using (a materialized view compatible) SQL?
有一張測量表。 此表中有一個名為measurement
的列( JSON 類型)。 它包含命名參數值的列表。
具有一個稱為參數的鍵值列表的示例表可以定義如下:
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement
表格形式相同:
ID | 測量 |
---|---|
1 | {"參數":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value" :30}]} |
2 | {"參數":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value" :33}]} |
3 | {"參數":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value" :333}]} |
現在,我想從列表中提取一些值到列中。 例如,如果我想要參數aaa
和bbb
,我希望 output 像:
ID | 啊啊啊 | bbb |
---|---|---|
1 | 10 | 20 |
2 | 11 | 22 |
3 | 111 | 222 |
我可以使用 4 個子查詢來實現這一點。 它已經開始變得復雜,但仍然可以忍受:
with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement
),
parameters AS (select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) measurements_list from measurements),
param_values as (select id, JSON_VALUE(ml.name) name, JSON_VALUE(ml.value) value from parameters, parameters.measurements_list ml),
trimmed_values as (select id, case when name="aaa" then value else null end as aaa, case when name="bbb" then value else null end as bbb
from param_values where name in ("aaa", "bbb"))
select id, max(aaa) aaa, max(bbb) bbb from trimmed_values group by id
我還可以按照 Mikhail 的建議使用功能齊全的JSONPath
function。 然后事情開始看起來更易於管理:
select id,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="aaa")].value') aaa,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="bbb")].value') bbb
from `sap-clm-analytics-dev.ag_experiment.measurements`
(由於外部 UDF 調用,它可能不如CASE-WHEN-GROUP-BY方法有效,但現在讓我們關注可維護性)。
現在我添加另一個名為colors
的鍵值對列表:
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement
讓我們從colors列表中選擇green
值。 那么 output 將是:
ID | 啊啊啊 | bbb | 綠色 |
---|---|---|---|
1 | 10 | 20 | 一個 |
2 | 11 | 22 | AA |
3 | 111 | 222 | AAA |
上面的JSONPath解決方案可以簡單地擴展以涵蓋這種情況:
select id,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="aaa")].value') aaa,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="bbb")].value') bbb,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.colors), '$.[?(@.name=="green")].value') bbb
from measurements
使用CASE-WHEN方法,事情開始變得棘手。 下面的查詢已經變得復雜而且完全是錯誤的:
with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement),
parameters_colors AS (
select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) parameters_list, JSON_QUERY_ARRAY(measurements.measurement.colors) colors_list from measurements),
param_color_values AS (select id, JSON_VALUE(parameters_list.name) param_name, JSON_VALUE(parameters_list.value) param_value, JSON_VALUE(colors_list.name) color_name, JSON_VALUE(colors_list.value) color_value from parameters_colors, parameters_colors.parameters_list, parameters_colors.colors_list),
trimmed_values AS (select id,
case when param_name="aaa" then param_value else null end as aaa,
case when param_name="bbb" then param_value else null end as bbb,
case when color_name="green" then color_value else null end as green,
from param_color_values where param_name in ("aaa", "bbb") and color_name = "green")
select id, max(aaa) aaaa, max(bbb) bbb, max(green) green from trimmed_values group by 1
錯誤的結果:
ID | 啊啊啊 | bbb | 綠色 |
---|---|---|---|
1 | 10 | 20 | 一個 |
2 | 10 | 20 | AA |
3 | 10 | 20 | AAA |
param_color_values
中的笛卡爾積很好,但trimmed_values
錯誤地用空值填充排列。 顯然,“綠色”值需要依賴級別。
顯然可以修復我的示例,但在另一個參數列表之后可能無法維護。 所以,我想用不同的方式表達我的問題。
從 SQL 中的此類數據結構中提取多個值的可維護方法是什么?
理想情況下,我希望將此類查詢保留為 BigQuery 物化視圖。 原始數據 object 是巨大的,所以我想在數據管道中創建一個階段,它保留了它的一個策划子集,不同的集群。 我希望 BigQuery 管理此 object 的刷新。 物化視圖具有一組有限的功能。 例如,不支持 UDF(如 CUSTOM_JSON_PATH)。
我傾向於放棄使用物化視圖的想法,轉而支持UDF/JSONPath方法的可維護性,並使用計划查詢自己組織提取數據集的刷新。
我是否監督任何瑣碎的純 SQL 解決方案,該解決方案可選物化視圖兼容並易於擴展到更復雜的情況?
我傾向於放棄使用物化視圖的想法,轉而支持 UDF/JSONPath 方法的可維護性,並使用計划查詢自己組織提取數據集的刷新。
考慮以下方法(與物化視圖不兼容)
create temp function get_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function get_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function get_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
with temp as (
select id, val, --key, val, --leaves
if(ends_with(key, '.name'), 'name', 'value') type,
regexp_replace(key, r'.name$|.value$', '') key
from your_table, unnest([struct(get_leaves(json_extract(to_json_string(measurement), '$')) as leaves)]),
unnest(get_keys(leaves)) key with offset
join unnest(get_values(leaves)) val with offset using(offset)
)
select * from (
select * except(key)
from temp
pivot (any_value(val) for type in ('name', 'value'))
)
pivot (any_value(value) for name in ('aaa', 'bbb', 'ccc', 'green', 'yellow') )
如果應用於您問題中的示例數據 - output 是
如果事先不知道密鑰或太多無法手動管理 - 您可以使用以下動態版本
create temp function get_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function get_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function get_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
create temp table temp as (
select * except(key) from (
select id, val,
if(ends_with(key, '.name'), 'name', 'value') type,
regexp_replace(key, r'.name$|.value$', '') key
from your_table, unnest([struct(get_leaves(json_extract(to_json_string(measurement), '$')) as leaves)]),
unnest(get_keys(leaves)) key with offset
join unnest(get_values(leaves)) val with offset using(offset)
)
pivot (any_value(val) for type in ('name', 'value'))
);
execute immediate (select '''
select * from temp
pivot (any_value(value) for name in ("''' || string_agg(distinct name, '","') || '"))'
from temp
);
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.