[英]How to optimize SQL query for funnel analysis?
我正在尝试进行漏斗分析,以观察从开始事件到结束事件有多少用户。 假设我想分析由 3 个事件组成的漏斗,即访问、注册和激活。
我有一个表,其中所有用户事件都存储在列中,例如 user_id、epoch_utc、event_date、event_name 以及用户事件进入系统的媒介。 我需要显示完成漏斗的用户(按给定顺序的事件),并根据哪些用户在哪个事件中掉线以及他们的计数将其分解为中等。
我已经按照此链接中的讨论编写了查询 -
https://popsql.com/sql-templates/marketing/running-a-funnel-analysis
查询如下:
with visit_users as (
select user_id, min(epoch_utc) as min_time, min(medium) as medium, count(event_name) as total_events from events_table
where date_parse(event_date,'%Y-%m-%d') >= date_parse('2022-09-16','%Y-%m-%d') and date_parse(event_date,'%Y-%m-%d') <= date_parse('2022-09-22','%Y-%m-%d')
and event_name = 'Visit'
group by 1
),
signup_users as (
select su.user_id, su.min_time, su.medium, su.total_events from
(
select user_id, min(epoch_utc) as min_time, min(medium) as medium, count(event_name) as total_events from (
SELECT user_id, epoch_utc, medium, event_name FROM events_table
where date_parse(event_date,'%Y-%m-%d') >= date_parse('2022-09-16','%Y-%m-%d') and date_parse(event_date,'%Y-%m-%d') <= date_parse('2022-09-22','%Y-%m-%d') AND
( event_name = 'Sign Up' )
) group by 1
) su, visit_users acu
where su.user_id = acu.user_id
and su.min_time > acu.min_time
),
activate_users as (
select icu.user_id, icu.min_time, icu.medium, icu.total_events from
(
select user_id, min(epoch_utc) as min_time, min(medium) as medium, count(event_name) as total_events from (
SELECT user_id, epoch_utc, medium, event_name FROM events_table
where date_parse(event_date,'%Y-%m-%d') >= date_parse('2022-09-16','%Y-%m-%d') and date_parse(event_date,'%Y-%m-%d') <= date_parse('2022-09-22','%Y-%m-%d') AND
( event_name = 'Activate' )
) group by 1
) icu, signup_users su
where icu.user_id = su.user_id
and icu.min_time > su.min_time
)
select * from (
select step, medium, count(user_id) as total_users, 0 as total_time, 0 as avg_time, sum(total_events) as total_events from (
select 'Visit' as step, acu.medium, acu.user_id,
0, acu.total_events from visit_users acu
) group by 1, 2
UNION
select step, medium, count(user_id) as total_users, sum(diff) as total_time, (sum(diff) / count(user_id) ) as avg_time, sum(total_events) as total_events from (
select 'Sign Up' as step, su.medium, su.user_id,
date_diff('second', from_unixtime(acu.min_time/1000) , from_unixtime(su.min_time/1000)) as diff,
su.total_events
from visit_users acu, signup_users su
where acu.user_id = su.user_id
) group by 1, 2
UNION
select step, medium, count(user_id) as total_users, sum(diff) as total_time, (sum(diff) / count(user_id) ) as avg_time, sum(total_events) as total_events from (
select 'Activate' as step, icu.medium, icu.user_id,
date_diff('second', from_unixtime(su.min_time/1000) , from_unixtime(icu.min_time/1000)) as diff,
icu.total_events
from signup_users su, activate_users icu
where su.user_id = icu.user_id
) group by 1, 2
) order by step, total_users desc
现在用户可以在运行时定义一个漏斗,而不是这 3 个事件,一个漏斗中可以有更多的事件。 有没有什么方法可以优化上面的查询,使其适用于 n 个稍有变化的事件。
我们使用基于 java 的应用程序来生成查询。 有什么办法可以优化吗?
我能够通过在 Athena 中创建 CTAS(Create Table as Select)来提高性能。 events_table 是在运行时使用 CTE 创建的临时表。
在 Athena 中,CTAS 表内容将存储在 S3 中,因此我通过应用以下查询从 events_table 创建了一个临时 CTAS 表。
select user_id、epoch_utc、event_name、来自 events_table 的介质 where date_parse(event_date,'%Y-%m-%d') >= date_parse('2022-09-16','%Y-%m-%d') and date_parse(event_date,'%Y-%m-%d') <= date_parse('2022-09-22','%Y-%m-%d') and event_name in ('Visit','Sign Up' ,'启用')
然后在问题的查询中,没有引用 event_table,而是引用了这个 CTAS 表。
对于大量数据,性能显着提高。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.