[英]sql arrays, event occurrences with respect prior event
我有一个数组中的用户事件数据,如下所示,
Column X
["event A", "event B", "event C", "event D", "event E"]
["event A", "event D", "event N"]
["event C", "event E", "event P"]
["event C", "event E", "event Q"]
我试图查看,当特定事件发生时,之后发生的其他事件是什么以及它们的频率,如下面的上述示例数据,
所以FLATTEN 、 ARRAY_SLICE和ARRAY_SIZE是这里需要的主要工具。
CTE 只是为了伪造一个数据表,所以展平数组循环穿过我们别名a
. 我们可以在这一点上进行子选择以查看下一层,但我们可以直接加入到结果中,所以我有。 因此,我们得到了数组的尾部并将其展平,现在我们有了我们的对,我们可以数数
WITH data AS (
SELECT parse_json(column1) as array FROM VALUES
( '["event A", "event B", "event C", "event D", "event E"]' ),
( '["event A", "event D", "event N"]' ),
( '["event C", "event E", "event P"]' ),
( '["event C", "event E", "event Q"]' )
)
SELECT
a.value as e_s
,t.value as e_o
,count(*) as frequency
FROM data d,
table(flatten(input=> d.array)) a,
table(flatten(input=> array_slice(d.array, a.index+1, ARRAY_SIZE(d.array)))) t
GROUP BY 1,2
ORDER BY 1,2;
给出:
E_S E_O FREQUENCY
"event A" "event B" 1
"event A" "event C" 1
"event A" "event D" 2
"event A" "event E" 1
"event A" "event N" 1
"event B" "event C" 1
"event B" "event D" 1
"event B" "event E" 1
"event C" "event D" 1
"event C" "event E" 3
"event C" "event P" 1
"event C" "event Q" 1
"event D" "event E" 1
"event D" "event N" 1
"event E" "event P" 1
"event E" "event Q" 1
一个更长的版本,其中每个步骤更明确,一次一个是:
SELECT f.e_s,
f.e_o,
count(*) as frequency
FROM (
SELECT e.e_s,
t.value as e_o
FROM (
SELECT
d.array,
a.value as e_s,
array_slice(d.array, a.index+1, d.len) as tail
FROM (
SELECT array,
ARRAY_SIZE(array) as len
FROM data
) d,
TABLE(FLATTEN(input=> d.array)) a
) e,
TABLE(FLATTEN(input=> e.tail)) t
) f
GROUP BY 1,2
ORDER BY 1,2;
我的速度不够快,无法击败 Simeon,但我们最终还是使用了不同的方法,所以我想选择最适合你的方法!
我将数组展平为 CTE 中的行,然后将 CTE 连接回自身,然后总结结果。
查询
with flat as (
select *
from test_table,
table (flatten(test_table.col_x)) f
)
select
a.value as E_S,
b.value as E_O,
count(1) as FREQUENCY
from flat a
join flat b on a.seq = b.seq and a.INDEX < b.INDEX
group by a.value, b.value
order by a.value, b.value
完整示例
-- create sample table
create or replace transient table test_table
(
col_x array
);
-- insert sample data
insert overwrite into test_table (col_x)
SELECT
parse_json(column1)
FROM
VALUES ('["event A", "event B", "event C", "event D", "event E"]'),
('["event A", "event D", "event N"]'),
('["event C", "event E", "event P"]'),
('["event C", "event E", "event Q"]')
;
with flat as (
select *
from test_table,
table (flatten(test_table.col_x)) f
)
select
a.value as E_S,
b.value as E_O,
count(1) as FREQUENCY
from flat a
join flat b on a.seq = b.seq and a.INDEX < b.INDEX
group by a.value, b.value
order by a.value, b.value
;
结果
+---------+---------+---------+
|E_S |E_O |FREQUENCY|
+---------+---------+---------+
|"event A"|"event B"|1 |
|"event A"|"event C"|1 |
|"event A"|"event D"|2 |
|"event A"|"event E"|1 |
|"event A"|"event N"|1 |
|"event B"|"event C"|1 |
|"event B"|"event D"|1 |
|"event B"|"event E"|1 |
|"event C"|"event D"|1 |
|"event C"|"event E"|3 |
|"event C"|"event P"|1 |
|"event C"|"event Q"|1 |
|"event D"|"event E"|1 |
|"event D"|"event N"|1 |
|"event E"|"event P"|1 |
|"event E"|"event Q"|1 |
+---------+---------+---------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.