[英]Update nested bigquery json element
I have a big-query schema such as this:我有一个像这样的大查询模式:
visitorId INTEGER NULLABLE
visitID INTEGER NULLABLE
hits RECORD REPEATED
hits.eventInfo RECORD NULLABLE
hits.eventInfo.eventCategory STRING NULLABLE
hits.eventInfo.eventLabel STRING NULLABLE
with sample data as:样本数据为:
visitorId visitId hits.eventInfo.eventCategory hits.eventInfo.eventCategory
123456 1 abc {"info":"secret", "otherfields":"blah"}
lmn {"info":"secret", "otherfields":"blah"}
xyz {"info":"secret", "otherfields":"blah"}
124557 1 abc {"info":"secret", "otherfields":"blah"}
lmn {"info":"secret", "otherfields":"blah"}
xyz {"info":"secret", "otherfields":"blah"}
I need to remove "info":"secret", only when the eventCategory is "abc" .只有当 eventCategory 是"abc"时,我才需要删除 "info":"secret "。
I am a big-query newbie.我是一个大查询新手。 After much hitting and trying I was able to come to this, but unfortunately stuck now.
经过多次打击和尝试,我能够做到这一点,但不幸的是现在卡住了。
UPDATE `project.dataset.ga_sessions_20200608`
SET hits = ARRAY(
SELECT AS STRUCT * REPLACE((REGEXP_REPLACE(eventInfo.eventLabel, r"\"info\":\"[a-z A-Z]*\",", "")) AS eventInfo.eventLabel) from UNNEST(hits)
)
WHERE (select eventInfo.eventLabel from UNNEST(hits)) LIKE '%info%'
There are two problems here.这里有两个问题。
Any help, pointers will be appreciated.任何帮助,指针将不胜感激。
It's challenging to do this with DML approach ( UPDATE
) as opposed to a batch process that operates over the whole table for a few reasons (some of which you already gave):使用 DML 方法 (
UPDATE
) 与在整个表上运行的批处理过程相比具有挑战性,原因有几个(其中一些您已经给出):
This puts you in the position of having to reconstruct the entire struct of both hits
and eventInfo
in order to replace it's one field (which is itself a JSON encoded struct, although BigQuery is sort of blind to this and treats it as a string).这使您处于 position 的境地,必须重建
hits
和eventInfo
的整个结构以替换它的一个字段(它本身是一个 JSON 编码结构,尽管 BigQuery 对此视而不见并将其视为字符串)。
AFAIK BigQuery does not have a function that marshalls a true STRUCT from json, so regex might be the only way to actually eliminate the undesired field in the json-encoded column. AFAIK BigQuery 没有 function 从 json 编组真正的 STRUCT,因此正则表达式可能是实际消除 json 编码列中不需要的字段的唯一方法。
The WHERE condition here can make use of the JSON_EXTRACT_SCALAR
standardSQL function to capture the condition that the particular key to be removed actually exists in the json.这里的WHERE条件可以利用
JSON_EXTRACT_SCALAR
标准SQL function来捕捉json中实际存在的条件。
So a possible pure BigQuery approach (no UDFs) could look like:因此,一种可能的纯 BigQuery 方法(无 UDF)可能如下所示:
#standardSQL
UPDATE
`project.dataset.table_DATE`
SET
--reconstruct hits column as an array
hits = ARRAY(
SELECT
--reconstruct each struct of hits
AS STRUCT * REPLACE( (
SELECT
--reconstruct eventInfo...
AS STRUCT eventInfo.* REPLACE(
--with eventLabel replaced with a filtered version
REGEXP_REPLACE(eventInfo.eventLabel, r"\"secret\":\"[A-Za-z\p{L}]* [A-Za-z\p{L}]*\",", "") AS eventLabel)
) AS eventInfo )
FROM
UNNEST(hits) AS hits )
WHERE
--Only for rows where at least one eventLabel's json contains a `secret` key
EXISTS (SELECT JSON_EXTRACT_SCALAR(h.eventInfo.eventLabel, "$.secret") is not null from unnest(hits) as h)
Note the regex there may need some changes based on the format of the data.请注意,那里的正则表达式可能需要根据数据格式进行一些更改。 Here I'm assuming a two word secret value with possible unicode characters (like a first + last name).
在这里,我假设一个可能包含 unicode 个字符(如名字 + 姓氏)的两个单词的秘密值。
The regex part might be better served by a UDF , although those can be slow for large table.正则表达式部分可能由UDF提供更好的服务,尽管对于大表来说它们可能很慢。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.