简体   繁体   English

在 Redshift 中查找下一个最旧的行

[英]Find the next oldest row in Redshift

I have a table called user_activity in Redshift that has department, user_id, activity_type, activity_id, activity_date.我在 Redshift 中有一个名为 user_activity 的表,其中包含部门、user_id、activity_type、activity_id、activity_date。

I'd like to query a daily report of how many days since the last event (of any type).我想查询自上次事件(任何类型)以来多少天的每日报告。 Using CROSS APPLY (SQL Server) or LATERAL JOIN (Postgres 9+), I'd do something like...使用 CROSS APPLY (SQL Server) 或 LATERAL JOIN (Postgres 9+),我会做类似...

SELECT d.date, a.last_activity_date
FROM date_table d
CROSS JOIN (
            SELECT DISTINCT user_id FROM activity_table
        ) u
CROSS APPLY (
                SELECT TOP 1 activity_date as last_activity_date
                FROM activity_table
                WHERE user_id = u.user_id AND activity_date <= d.date
                ORDER BY activity_date DESC
            ) a

For now, I write it similar to the below, but it is a bit slow and I am afraid it'll only get slower.现在,我写的和下面类似,但是有点慢,恐怕只会越来越慢。

with user_activity as (
    select distinct activity_date, user_id from activity_table
)
select
    d.date, u.user_id,
    max(u.activity_date) as last_activity_date
from date_table d
inner join user_activity u on u.activity_date <= d.date
where d.date between '2020-01-01' and current_date
group by 1, 2

Can someone suggest a good alternative for my needs or for CROSS APPLY / LATERAL JOIN.有人可以为我的需求或交叉应用/横向连接建议一个好的替代方案。

As you are seeing cross joining and inequality joining will slow down as you data grows and are generally not the approach you want in Redshift.正如您所看到的,交叉连接和不平等连接会随着数据的增长而减慢,并且通常不是您在 Redshift 中想要的方法。 This is due to the data size increase that comes with this type of action when applied to large data.tables that are typical in Redshift.这是因为当应用于 Redshift 中典型的大型 data.tables 时,此类操作会导致数据大小增加。

You want to use window functions to perform this type of analysis.您想要使用 window 函数来执行此类分析。 But you will need to step back and rethink how you will structure the SQL. A MAX(activity_date) window function, partitioned by user_id and ordered by date and with a frame clause of all preceding rows, will find the most recent activity to any activity.但是您需要退后一步,重新考虑如何构造 SQL。一个 MAX(activity_date) window function,按 user_id 分区并按日期排序,并带有前面所有行的框架子句,将找到任何活动的最新活动.

Now this will produce only rows for user_ids and dates that exist in the data.table and it looks like you want 1 row for each date for each user_id, right?现在这将只为 data.table 中存在的 user_id 和日期生成行,看起来你想要为每个 user_id 的每个日期生成 1 行,对吧? To do this you need to UNION in a frame of data that has 1 row for each date for each user_id ahead of the window function. You will need NULLs in for the other columns so that the data widths match.为此,您需要在 window function 之前的每个 user_id 的每个日期具有 1 行的数据帧中进行 UNION。其他列需要 NULL,以便数据宽度匹配。 You will also want the dates in a separate column from activity_date.您还需要将日期与 activity_date 放在单独的列中。 Now all dates for all user ids will be in the source and the window function will give you the result you want.现在所有用户 ID 的所有日期都将在源中,window function 将为您提供所需的结果。

You also ask 'how is this better than the joins?'您还问“这比连接更好吗?” Well in the joins you are replicating all the data records by the number of dates which can get really big.那么在连接中,您将根据可能变得非常大的日期数复制所有数据记录。 In this approach you just have the original data records plus one row per user_id per date (which is the size of your output) and as the number of records per user_id grows this approach doesn't.在这种方法中,您只有原始数据记录加上每个日期每个 user_id 的一行(这是输出的大小),并且随着每个 user_id 的记录数增加,这种方法不会。

——— Request to modify asker's code per comments made to their approach ——— ——— 请求根据对他们方法的评论修改提问者的代码 ———

Your code is definitely on the right track as you have removed the massive inequality join of your original.您的代码绝对是在正确的轨道上,因为您已经删除了原始代码中的大量不等式连接。 I made 2 comments about it.我对此发表了 2 条评论。 The first is that I believe you need GROUP BY user_id, date to prevent multiple rows per user_id per date that would result if there are records for the same user_id on a single date with differing activity_types.首先是我相信您需要 GROUP BY user_id, date 以防止每个日期每个 user_id 多行,如果在一个日期有不同 activity_types 的相同 user_id 的记录会导致这种情况。 This is a simple oversight.这是一个简单的疏忽。

The second is to state that I intended for you to use UNION ALL, not LEFT JOIN, in combining the actual data and the user_id/date framework.第二个是 state,我打算让您在结合实际数据和 user_id/date 框架时使用 UNION ALL,而不是 LEFT JOIN。 Your approach works fine but I have found that unioning with very large amounts of data is generally faster than joining but you do need to make sure the columns match up.您的方法工作正常,但我发现与大量数据联合通常比加入更快,但您确实需要确保列匹配。 Either way we end up with a data segment with 3 columns - 2 date columns, one with NULLs for framework rows, and 1 user_id.无论哪种方式,我们最终都会得到一个包含 3 列的数据段 - 2 个日期列,一个框架行的 NULL 和 1 个 user_id。 Your approach is fine and the difference in performance is likely very small unless you have huge tables.您的方法很好,除非您有很大的表,否则性能差异可能很小。

Since you asked for a rewrite, here it is with both changes.由于您要求重写,这里有两个更改。 (NOTE: my laptop is in the shop so I don't have ready access to Redshift at the moment and this SQL is untested. If the intent is not clear from this and you need me to debug it will be delayed by a few days. I'm keeping your setup methods and SQL structure.) (注意:我的笔记本电脑在商店里,所以我目前还没有准备好访问 Redshift 并且这个 SQL 未经测试。如果意图不明确并且你需要我调试它会延迟几天.我保留了你的设置方法和 SQL 结构。)

with date_table as (
    select '2000-01-01'::date as date
    union all
    select '2000-01-02'::date
    union all
    select '2000-01-03'::date
    union all
    select '2000-01-04'::date
    union all
    select '2000-01-05'::date
    union all
    select '2000-01-06'::date
),
users as (
    select 1 as user_id
    union all
    select 2
    union all
    select 3
),
user_activity as (
    select 1 as user_id, '2000-01-01'::date as activity_date
    union all
    select 1 as user_id, '2000-01-04'::date as activity_date
    union all
    select 3 as user_id, '2000-01-03'::date as activity_date
    union all
    select 1 as user_id, '2000-01-05'::date as activity_date
    union all
    select 1 as user_id, '2000-01-06'::date as activity_date
),
user_dates as (
    select d.date, u.user_id
    from date_table d
    cross join users u
),
user_date_activity as (
    select cal_date, user_id,
        lag(max(activity_date), 1) ignore nulls over (partition by user_id order by date) as last_activity_date
    from (
        Select user_id, date as cal_date, NULL as activity_date from user_dates
        Union all
        Select user_id, activity_date as cal_date, activity_date from user_activity 
    )
    Group by user_id, cal_date
)
select * from user_date_activity
order by user_id, cal_date```

This was my query based on Bill's answer.这是我根据比尔的回答提出的问题。

with date_table as (
    select '2000-01-01'::date as date
    union all
    select '2000-01-02'::date
    union all
    select '2000-01-03'::date
    union all
    select '2000-01-04'::date
    union all
    select '2000-01-05'::date
    union all
    select '2000-01-06'::date
),
users as (
    select 1 as user_id
    union all
    select 2
    union all
    select 3
),
user_activity as (
    select 1 as user_id, '2000-01-01'::date as activity_date
    union all
    select 1 as user_id, '2000-01-04'::date as activity_date
    union all
    select 3 as user_id, '2000-01-03'::date as activity_date
    union all
    select 1 as user_id, '2000-01-05'::date as activity_date
    union all
    select 1 as user_id, '2000-01-06'::date as activity_date
),
user_dates as (
    select d.date, u.user_id
    from date_table d
    cross join users u
),
user_date_activity as (
    select ud.date, ud.user_id,
        lag(ua.activity_date, 1) ignore nulls over (partition by ud.user_id order by ud.date) as last_activity_date
    from user_dates ud
    left join user_activity ua on ud.date = ua.activity_date and ud.user_id = ua.user_id
)
select * from user_date_activity
order by user_id, date

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM