繁体   English   中英

如何针对漏斗分析优化 SQL 查询?

[英]How to optimize SQL query for funnel analysis?

我正在尝试进行漏斗分析,以观察从开始事件到结束事件有多少用户。 假设我想分析由 3 个事件组成的漏斗,即访问、注册和激活。

我有一个表,其中所有用户事件都存储在列中,例如 user_id、epoch_utc、event_date、event_name 以及用户事件进入系统的媒介。 我需要显示完成漏斗的用户(按给定顺序的事件),并根据哪些用户在哪个事件中掉线以及他们的计数将其分解为中等。

我已经按照此链接中的讨论编写了查询 -

https://popsql.com/sql-templates/marketing/running-a-funnel-analysis

查询如下:

with visit_users as (
    select user_id, min(epoch_utc) as min_time, min(medium) as medium, count(event_name) as total_events from events_table 
    where  date_parse(event_date,'%Y-%m-%d') >= date_parse('2022-09-16','%Y-%m-%d') and date_parse(event_date,'%Y-%m-%d') <= date_parse('2022-09-22','%Y-%m-%d')
    and event_name = 'Visit'
    group by 1
),

signup_users as (
    select su.user_id, su.min_time, su.medium, su.total_events from 
    (
        select user_id, min(epoch_utc) as min_time, min(medium) as medium, count(event_name) as total_events from (
            SELECT user_id, epoch_utc, medium, event_name   FROM events_table 
                where  date_parse(event_date,'%Y-%m-%d') >= date_parse('2022-09-16','%Y-%m-%d') and date_parse(event_date,'%Y-%m-%d') <= date_parse('2022-09-22','%Y-%m-%d')  AND 
                ( event_name = 'Sign Up' ) 
        ) group by 1
    ) su, visit_users acu
    where su.user_id = acu.user_id
    and su.min_time > acu.min_time
),

activate_users as (
    select icu.user_id, icu.min_time, icu.medium, icu.total_events from 
    (
        select user_id, min(epoch_utc) as min_time, min(medium) as medium, count(event_name) as total_events from (
            SELECT user_id, epoch_utc, medium, event_name   FROM events_table 
                where  date_parse(event_date,'%Y-%m-%d') >= date_parse('2022-09-16','%Y-%m-%d') and date_parse(event_date,'%Y-%m-%d') <= date_parse('2022-09-22','%Y-%m-%d')  AND 
                ( event_name = 'Activate' ) 
        ) group by 1
    ) icu, signup_users su
    where icu.user_id = su.user_id
    and icu.min_time > su.min_time
)

select * from (
    select step, medium, count(user_id) as total_users, 0 as total_time, 0 as avg_time, sum(total_events) as total_events from (
    select 'Visit' as step, acu.medium, acu.user_id, 
    0, acu.total_events from visit_users acu
    ) group by 1, 2 
    UNION
    select step, medium, count(user_id) as total_users, sum(diff) as total_time, (sum(diff) / count(user_id) ) as avg_time, sum(total_events) as total_events from ( 
    select 'Sign Up' as step, su.medium, su.user_id, 
    date_diff('second', from_unixtime(acu.min_time/1000) , from_unixtime(su.min_time/1000)) as diff,
    su.total_events
    from visit_users acu, signup_users su
    where acu.user_id = su.user_id
    ) group by 1, 2 
    UNION
    select step, medium, count(user_id) as total_users, sum(diff) as total_time, (sum(diff) / count(user_id) ) as avg_time, sum(total_events) as total_events from ( 
    select 'Activate' as step, icu.medium, icu.user_id, 
    date_diff('second', from_unixtime(su.min_time/1000) , from_unixtime(icu.min_time/1000)) as diff,
    icu.total_events
    from signup_users su, activate_users icu
    where su.user_id = icu.user_id
    ) group by 1, 2
) order by step, total_users desc

现在用户可以在运行时定义一个漏斗,而不是这 3 个事件,一个漏斗中可以有更多的事件。 有没有什么方法可以优化上面的查询,使其适用于 n 个稍有变化的事件。

我们使用基于 java 的应用程序来生成查询。 有什么办法可以优化吗?

我能够通过在 Athena 中创建 CTAS(Create Table as Select)来提高性能。 events_table 是在运行时使用 CTE 创建的临时表。

在 Athena 中,CTAS 表内容将存储在 S3 中,因此我通过应用以下查询从 events_table 创建了一个临时 CTAS 表。

select user_id、epoch_utc、event_name、来自 events_table 的介质 where date_parse(event_date,'%Y-%m-%d') >= date_parse('2022-09-16','%Y-%m-%d') and date_parse(event_date,'%Y-%m-%d') <= date_parse('2022-09-22','%Y-%m-%d') and event_name in ('Visit','Sign Up' ,'启用')

然后在问题的查询中,没有引用 event_table,而是引用了这个 CTAS 表。

对于大量数据,性能显着提高。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM