简体   繁体   English

更新嵌套的 bigquery json 元素

[英]Update nested bigquery json element

I have a big-query schema such as this:我有一个像这样的大查询模式:

visitorId                    INTEGER NULLABLE
visitID                      INTEGER NULLABLE
hits                         RECORD  REPEATED
hits.eventInfo               RECORD  NULLABLE   
hits.eventInfo.eventCategory STRING  NULLABLE   
hits.eventInfo.eventLabel    STRING  NULLABLE   

with sample data as:样本数据为:

visitorId  visitId  hits.eventInfo.eventCategory hits.eventInfo.eventCategory
123456     1        abc                          {"info":"secret", "otherfields":"blah"}
                    lmn                          {"info":"secret", "otherfields":"blah"}
                    xyz                          {"info":"secret", "otherfields":"blah"}
124557     1        abc                          {"info":"secret", "otherfields":"blah"}
                    lmn                          {"info":"secret", "otherfields":"blah"}
                    xyz                          {"info":"secret", "otherfields":"blah"}

I need to remove "info":"secret", only when the eventCategory is "abc" .只有当 eventCategory 是"abc",我才需要删除 "info":"secret "。

I am a big-query newbie.我是一个大查询新手。 After much hitting and trying I was able to come to this, but unfortunately stuck now.经过多次打击和尝试,我能够做到这一点,但不幸的是现在卡住了。

UPDATE `project.dataset.ga_sessions_20200608`
SET hits = ARRAY(
  SELECT AS STRUCT * REPLACE((REGEXP_REPLACE(eventInfo.eventLabel, r"\"info\":\"[a-z A-Z]*\",", "")) AS eventInfo.eventLabel) from UNNEST(hits) 
)
WHERE (select eventInfo.eventLabel from UNNEST(hits)) LIKE '%info%'

There are two problems here.这里有两个问题。

  • set part is not working:(设置部分不起作用:(
  • subquery in where (subselect) is not giving a scalar output:'(其中 (subselect) 中的子查询未提供标量 output:'(

Any help, pointers will be appreciated.任何帮助,指针将不胜感激。

It's challenging to do this with DML approach ( UPDATE ) as opposed to a batch process that operates over the whole table for a few reasons (some of which you already gave):使用 DML 方法 ( UPDATE ) 与在整个表上运行的批处理过程相比具有挑战性,原因有几个(其中一些您已经给出):

  • The subfields within the records that need updating are JSON string-ified需要更新的记录中的子字段是 JSON 字符串化的
  • The nature of the update is a filter by field key in the string representation更新的本质是通过字符串表示中的字段键进行过滤
  • The records themselves are repeated type at top level of the table, and contain structs记录本身是表顶层的重复类型,并包含结构

This puts you in the position of having to reconstruct the entire struct of both hits and eventInfo in order to replace it's one field (which is itself a JSON encoded struct, although BigQuery is sort of blind to this and treats it as a string).这使您处于 position 的境地,必须重建hitseventInfo的整个结构以替换它的一个字段(它本身是一个 JSON 编码结构,尽管 BigQuery 对此视而不见并将其视为字符串)。

AFAIK BigQuery does not have a function that marshalls a true STRUCT from json, so regex might be the only way to actually eliminate the undesired field in the json-encoded column. AFAIK BigQuery 没有 function 从 json 编组真正的 STRUCT,因此正则表达式可能是实际消除 json 编码列中不需要的字段的唯一方法。

The WHERE condition here can make use of the JSON_EXTRACT_SCALAR standardSQL function to capture the condition that the particular key to be removed actually exists in the json.这里的WHERE条件可以利用JSON_EXTRACT_SCALAR标准SQL function来捕捉json中实际存在的条件。

So a possible pure BigQuery approach (no UDFs) could look like:因此,一种可能的纯 BigQuery 方法(无 UDF)可能如下所示:

#standardSQL
UPDATE
  `project.dataset.table_DATE`
SET
  --reconstruct hits column as an array
  hits = ARRAY(
  SELECT
       --reconstruct each struct of hits
    AS STRUCT * REPLACE( (
      SELECT
        --reconstruct eventInfo...
        AS STRUCT eventInfo.* REPLACE(
          --with eventLabel replaced with a filtered version
          REGEXP_REPLACE(eventInfo.eventLabel, r"\"secret\":\"[A-Za-z\p{L}]* [A-Za-z\p{L}]*\",", "") AS eventLabel)
        ) AS eventInfo )
  FROM
    UNNEST(hits) AS hits )
WHERE
  --Only for rows where at least one eventLabel's json contains a `secret` key
  EXISTS (SELECT JSON_EXTRACT_SCALAR(h.eventInfo.eventLabel, "$.secret") is not null from unnest(hits) as h)

Note the regex there may need some changes based on the format of the data.请注意,那里的正则表达式可能需要根据数据格式进行一些更改。 Here I'm assuming a two word secret value with possible unicode characters (like a first + last name).在这里,我假设一个可能包含 unicode 个字符(如名字 + 姓氏)的两个单词的秘密值。

The regex part might be better served by a UDF , although those can be slow for large table.正则表达式部分可能由UDF提供更好的服务,尽管对于大表来说它们可能很慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM