简体   繁体   中英

Presto unnest/map complex json

I've been on stack for a few hours exploring examples of other presto unnest/map/cast solutions but I can't seem to find one that works for my data.

Here's a sample of my data:

with test_data (id, messy_json) AS (
    VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
           ('TEST_B', JSON '{"problem":[],"version":4}'),
           ('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
),

The JSON column is semi-unstructured and can hold multiple lvls / doesn't always have every key:value pair as other rows.

I was trying solutions like:

with test_data AS (
select id,
       messy_json        
from larger_tbl),


select 
    id as id,
    json_extract_scalar(test_data, '$.version') as lvl1_version
    json_extract_scalar(lvl2, '$.problem') as lvl2_id
    from test
LEFT JOIN UNNEST(CAST(json_parse(messy_json) AS array(json))) AS x(lvl1) ON TRUE
LEFT JOIN UNNEST(CAST(json_extract(lvl1, '$.problem') AS array(json))) AS y(lvl2) ON TRUE

This gets me cast errors etc. I've tried some variations with

unnest(cast(json_col as map(varchar, map(varchar,varchar)) options too.

My goal is to explode the entire dataset with the retained ID and all keys/multi-lvl keys retained in a long dataset. I appreciate any input/guidance, thanks!

To be honest your current query does not make much sense based on provided sample data and stated goals.

  • UNNEST is used with cross join rather then left join (I use succinct syntax skipping the keyword completely)
  • No need to json_parse(messy_json) in your test data since it is already json (though I assume in the real data it just a varchar field then it should be kept)
  • Root json is a json object, so it can't be casted to array (and there is no need to).
  • problem in json_extract_scalar(lvl2, '$.problem') as lvl2_id should be changed to id since problem array was already unnested
  • To handle cases where the problem array is empty you can use approach with unnesting multiple arrays with a dummy one with 1 element
-- sample data
with test_data (id, messy_json) AS (
    VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
           ('TEST_B', JSON '{"problem":[],"version":4}'),
           ('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
)

-- query
select id
     , json_extract_scalar(messy_json, '$.version') as lvl1_version
     , json_extract_scalar(lvl2, '$.id')            as lvl2_id
from test_data
   , UNNEST(CAST(json_extract(messy_json, '$.problem') 
       AS array(json)), array[1]) AS y(lvl2, ignored);

Output:

id lvl1_version lvl2_id
TEST_A 4 2
TEST_A 4 3
TEST_A 4 25
TEST_A 4 27
TEST_A 4 1
TEST_B 4 NULL
TEST_C 4 NULL

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM