There is a table with measurements. There is a column in this table with the name measurement
( JSON type ). It contains lists of a named parameter values.
A sample table with one key-value list called parameters can be defined as follow:
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement
Same in the table form:
id | measurement |
---|---|
1 | {"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]} |
2 | {"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]} |
3 | {"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]} |
Now, I want to extract some values from the list to columns. For example, if I want parameters aaa
and bbb
, I would expect the output like:
id | aaa | bbb |
---|---|---|
1 | 10 | 20 |
2 | 11 | 22 |
3 | 111 | 222 |
I can achieve this using 4 sub-queries. It starts already getting complex, but still bearable:
with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":11},{"name":"bbb","value":22},{"name":"ccc","value":33}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":111},{"name":"bbb","value":222},{"name":"ccc","value":333}]}') measurement
),
parameters AS (select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) measurements_list from measurements),
param_values as (select id, JSON_VALUE(ml.name) name, JSON_VALUE(ml.value) value from parameters, parameters.measurements_list ml),
trimmed_values as (select id, case when name="aaa" then value else null end as aaa, case when name="bbb" then value else null end as bbb
from param_values where name in ("aaa", "bbb"))
select id, max(aaa) aaa, max(bbb) bbb from trimmed_values group by id
I can also use full-featured JSONPath
function as suggested by Mikhail. Then things start looking more manageable:
select id,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="aaa")].value') aaa,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="bbb")].value') bbb
from `sap-clm-analytics-dev.ag_experiment.measurements`
(It may be less efficient then the CASE-WHEN-GROUP-BY method because of the external UDF call, but let's focus on maintainability for now).
Now I add another list of key-value pairs named colors
:
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement
Let's pick values for green
from the list of colors . Then the output will be:
id | aaa | bbb | green |
---|---|---|---|
1 | 10 | 20 | A |
2 | 11 | 22 | AA |
3 | 111 | 222 | AAA |
The JSONPath solution from above can be trivially extended to cover this case:
select id,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="aaa")].value') aaa,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.parameters), '$.[?(@.name=="bbb")].value') bbb,
bq_data_loader_json.CUSTOM_JSON_VALUE(TO_JSON_STRING(measurement.colors), '$.[?(@.name=="green")].value') bbb
from measurements
With the CASE-WHEN approach things starts getting tricky. The below query already gets complex and is simply wrong :
with measurements AS (
select 1 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "A"}, {"name": "yellow", "value": "B"}]}') measurement union all
select 2 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AA"}, {"name": "yellow", "value": "BB"}]}') measurement union all
select 3 id, parse_json('{"parameters":[{"name":"aaa","value":10},{"name":"bbb","value":20},{"name":"ccc","value":30}], "colors": [{"name": "green", "value": "AAA"}, {"name": "yellow", "value": "BBB"}]}') measurement),
parameters_colors AS (
select id, JSON_QUERY_ARRAY(measurements.measurement.parameters) parameters_list, JSON_QUERY_ARRAY(measurements.measurement.colors) colors_list from measurements),
param_color_values AS (select id, JSON_VALUE(parameters_list.name) param_name, JSON_VALUE(parameters_list.value) param_value, JSON_VALUE(colors_list.name) color_name, JSON_VALUE(colors_list.value) color_value from parameters_colors, parameters_colors.parameters_list, parameters_colors.colors_list),
trimmed_values AS (select id,
case when param_name="aaa" then param_value else null end as aaa,
case when param_name="bbb" then param_value else null end as bbb,
case when color_name="green" then color_value else null end as green,
from param_color_values where param_name in ("aaa", "bbb") and color_name = "green")
select id, max(aaa) aaaa, max(bbb) bbb, max(green) green from trimmed_values group by 1
Wrong result:
id | aaa | bbb | green |
---|---|---|---|
1 | 10 | 20 | A |
2 | 10 | 20 | AA |
3 | 10 | 20 | AAA |
The cartesian product in param_color_values
is fine, but trimmed_values
incorrectly fill permutations with nulls. Apparently the level of dependency is needed for "green" values.
It would be apparently possible to fix my example, but it probably won't be maintainable after another list of parameters. So, I want to phrase my question differently.
What would be a maintainable way to extract multiple values from such data structures in SQL?
Ideally, I'd like to persist such query as a BigQuery materialized view. The original data object is huge, so I want to create a stage in the data pipeline, which persists a curated subset of it, differently clustered. I want that the BigQuery manages the refreshes of this object. Materialized view has a limited set of functions. For examples, UDFs (like CUSTOM_JSON_PATH) are not supported.
I tend to drop the idea of using the materialized view in favor of the maintainability of the UDF/JSONPath method and organize the refresh of the extracted dataset myself using scheduled queries.
Do I oversee any trivial pure SQL solution, which is optionally materialized-view compatible and easy to extend to more complex cases?
I tend to drop the idea of using the materialized view in favor of the maintainability of the UDF/JSONPath method and organize the refresh of the extracted dataset myself using scheduled queries.
Consider below approach (not compatible with materialized view)
create temp function get_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function get_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function get_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
with temp as (
select id, val, --key, val, --leaves
if(ends_with(key, '.name'), 'name', 'value') type,
regexp_replace(key, r'.name$|.value$', '') key
from your_table, unnest([struct(get_leaves(json_extract(to_json_string(measurement), '$')) as leaves)]),
unnest(get_keys(leaves)) key with offset
join unnest(get_values(leaves)) val with offset using(offset)
)
select * from (
select * except(key)
from temp
pivot (any_value(val) for type in ('name', 'value'))
)
pivot (any_value(value) for name in ('aaa', 'bbb', 'ccc', 'green', 'yellow') )
if applied to sample data in your question - output is
In case if keys are not known in advance or too many to manually manage - you can use below dynamic version
create temp function get_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function get_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function get_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
create temp table temp as (
select * except(key) from (
select id, val,
if(ends_with(key, '.name'), 'name', 'value') type,
regexp_replace(key, r'.name$|.value$', '') key
from your_table, unnest([struct(get_leaves(json_extract(to_json_string(measurement), '$')) as leaves)]),
unnest(get_keys(leaves)) key with offset
join unnest(get_values(leaves)) val with offset using(offset)
)
pivot (any_value(val) for type in ('name', 'value'))
);
execute immediate (select '''
select * from temp
pivot (any_value(value) for name in ("''' || string_agg(distinct name, '","') || '"))'
from temp
);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.