简体   繁体   中英

Update nested bigquery json element

I have a big-query schema such as this:

visitorId                    INTEGER NULLABLE
visitID                      INTEGER NULLABLE
hits                         RECORD  REPEATED
hits.eventInfo               RECORD  NULLABLE   
hits.eventInfo.eventCategory STRING  NULLABLE   
hits.eventInfo.eventLabel    STRING  NULLABLE   

with sample data as:

visitorId  visitId  hits.eventInfo.eventCategory hits.eventInfo.eventCategory
123456     1        abc                          {"info":"secret", "otherfields":"blah"}
                    lmn                          {"info":"secret", "otherfields":"blah"}
                    xyz                          {"info":"secret", "otherfields":"blah"}
124557     1        abc                          {"info":"secret", "otherfields":"blah"}
                    lmn                          {"info":"secret", "otherfields":"blah"}
                    xyz                          {"info":"secret", "otherfields":"blah"}

I need to remove "info":"secret", only when the eventCategory is "abc" .

I am a big-query newbie. After much hitting and trying I was able to come to this, but unfortunately stuck now.

UPDATE `project.dataset.ga_sessions_20200608`
SET hits = ARRAY(
  SELECT AS STRUCT * REPLACE((REGEXP_REPLACE(eventInfo.eventLabel, r"\"info\":\"[a-z A-Z]*\",", "")) AS eventInfo.eventLabel) from UNNEST(hits) 
)
WHERE (select eventInfo.eventLabel from UNNEST(hits)) LIKE '%info%'

There are two problems here.

  • set part is not working:(
  • subquery in where (subselect) is not giving a scalar output:'(

Any help, pointers will be appreciated.

It's challenging to do this with DML approach ( UPDATE ) as opposed to a batch process that operates over the whole table for a few reasons (some of which you already gave):

  • The subfields within the records that need updating are JSON string-ified
  • The nature of the update is a filter by field key in the string representation
  • The records themselves are repeated type at top level of the table, and contain structs

This puts you in the position of having to reconstruct the entire struct of both hits and eventInfo in order to replace it's one field (which is itself a JSON encoded struct, although BigQuery is sort of blind to this and treats it as a string).

AFAIK BigQuery does not have a function that marshalls a true STRUCT from json, so regex might be the only way to actually eliminate the undesired field in the json-encoded column.

The WHERE condition here can make use of the JSON_EXTRACT_SCALAR standardSQL function to capture the condition that the particular key to be removed actually exists in the json.

So a possible pure BigQuery approach (no UDFs) could look like:

#standardSQL
UPDATE
  `project.dataset.table_DATE`
SET
  --reconstruct hits column as an array
  hits = ARRAY(
  SELECT
       --reconstruct each struct of hits
    AS STRUCT * REPLACE( (
      SELECT
        --reconstruct eventInfo...
        AS STRUCT eventInfo.* REPLACE(
          --with eventLabel replaced with a filtered version
          REGEXP_REPLACE(eventInfo.eventLabel, r"\"secret\":\"[A-Za-z\p{L}]* [A-Za-z\p{L}]*\",", "") AS eventLabel)
        ) AS eventInfo )
  FROM
    UNNEST(hits) AS hits )
WHERE
  --Only for rows where at least one eventLabel's json contains a `secret` key
  EXISTS (SELECT JSON_EXTRACT_SCALAR(h.eventInfo.eventLabel, "$.secret") is not null from unnest(hits) as h)

Note the regex there may need some changes based on the format of the data. Here I'm assuming a two word secret value with possible unicode characters (like a first + last name).

The regex part might be better served by a UDF , although those can be slow for large table.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM