繁体   English   中英

如何查询按匹配字符串列分组的行,但只计算一组特定关键字的最新行?

[英]How to query rows grouped by matching string column but only count the most recent row for a specific set of keywords?

有一个包含 email 事件的表,其中每一行都以特定的传出 email 记录 fk 和特定的收件人用户 fk 为键。 在任何给定的时间,没有特定的顺序,甚至可以同时从不同的线程中,我可以将新记录放入该表中。 以下是相关专栏...

id (pk), email_id (fk), user_id (fk), event (string/name), created_at

我正在计算给定 email 的总体事件计数,例如发送了多少封电子邮件、退回了多少封邮件等。但是我需要忽略特定用户的 email 事件的特定组合,因为当更新的事件进入时它们会过时。例如,如果某行表示 email 已针对特定用户“延迟”,但后来插入了一个新的事件行,表示“已交付”或“退回”,那么我只希望最近添加的任何相关关键字的行算一次为当前state。

在阅读时执行此操作的好方法是什么? 由于我需要进行多层分组并达到我的 SQL 排骨的限制,我遇到了麻烦,这是我正在尝试增强的查询,如下所述:

select `event`, COUNT(1) as count, COUNT(DISTINCT user_id) as unique_count
from `email_activity`
where `email_id` = 7518
group by `event`

对于大多数事件,我希望它们全部计算在内而无需任何替换,因此在这些情况下仅按事件分组就可以了,例如,如果某事是“点击”或“打开”事件,只需将它们加起来。

但是,如果同一 email_id/user_id 有任意数量的“延迟”、“退回”或“已交付”事件,我只想计算具有最近 created_at 日期的事件,而忽略所有较旧的事件。

示例行集(email_id、event、user_id、created_at):

7518, "click", 25, 1-20-2021
7518, "click", 73, 1-5-2021
7518, "bounced", 45, 1-19-2021
7518, "deferred", 45, 1-17-2021
7518, "delivered", 19, 1-1-2021
7518, "delivered", 25, 1-1-2021
7518, "delivered", 73, 1-1-2021

因此,查询的 email 7518 计数为:

2 个“点击”、3 个“已交付”和 1 个“退回”作为“延迟”行将被用户 45 忽略,因为它较旧(只有“退回”、“延迟”和“已交付”事件是此事件的一部分“只计算最新的”规则,所有其他事件名称总是被计算在内)。

解决方案 1:

如果有WITH子句和 window 函数,我更喜欢把表分成两部分,给每一行适当的优先级,用UNION ALL合并它们,最后只聚合优先级最高的行。

WITH specific_email_activity AS (
    SELECT * FROM email_activity WHERE email_id = 7518
),
specific_email_activity_with_priority AS (
    SELECT
        *,
        1 AS rank_priority
    FROM
        specific_email_activity
    WHERE
        event NOT IN ('deferred', 'bounced', 'delivered')
    UNION ALL
    SELECT
        *,
        ROW_NUMBER () over (PARTITION BY email_id, user_id ORDER BY created_at DESC) AS rank_priority
    FROM
        specific_email_activity
    WHERE
        event IN ('deferred', 'bounced', 'delivered')
)
SELECT
    email_id,
    event,
    COUNT(*) AS count_event,
    COUNT(DISTINCT user_id) AS unique_count_event
FROM
    specific_email_activity_with_priority
WHERE
    rank_priority = 1
GROUP BY
    email_id,
    event
ORDER BY
    email_id,
    event;

解决方案 2:

如果您不能使用WITH子句或 window 函数,请尝试以下代码:

SELECT
    email_id,
    event,
    COUNT(*) AS count_event,
    COUNT(DISTINCT user_id) AS unique_count_event
FROM
    (
        SELECT
            *
        FROM
            email_activity
        WHERE
            email_id = 7518
            AND event NOT IN ('deferred', 'bounced', 'delivered')
        UNION ALL
        SELECT
            *
        FROM
            email_activity
        WHERE
            email_id = 7518
            AND event IN ('deferred', 'bounced', 'delivered')
            AND email_activity.created_at = (
                SELECT
                    MAX(created_at)
                FROM
                    email_activity AS ea
                WHERE
                    email_id = 7518
                    AND event IN ('deferred', 'bounced', 'delivered')
                    AND email_id = email_activity.email_id
                    AND user_id = email_activity.user_id
            )
    ) AS t
GROUP BY
    email_id,
    event
ORDER BY
    email_id,
    event;

解决方案1和2的Output:

电子邮件ID 事件 计数事件 唯一计数事件
7518 反弹 1个 1个
7518 点击 2个 2个
7518 发表 3个 3个

笔记:

  • 如果同一user_id和同一created_at不止一行,它们的event由 'deferred'、'bounced' 或 'delivered' 的组合组成,你会得到意想不到的结果。 在这种情况下,必须明确其中哪一个应该优先计算。 然后,必须根据该规则修改代码。
  • 如果event可以为空,则必须明确聚合中如何处理NULL 然后,必须根据该规则修改代码。

示例表创建:

示例表可以通过以下 sql 创建:

CREATE TABLE IF NOT EXISTS email_activity(id SERIAL PRIMARY KEY, email_id INT, user_id INT, event VARCHAR(16), created_at DATE);
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'click', 25, '2021-1-20');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'click', 73, '2021-1-5');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'bounced', 45, '2021-1-19');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'deferred', 45, '2021-1-17');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'delivered', 19, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'delivered', 25, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'delivered', 73,'2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'click', 25, '2021-1-20');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'click', 73, '2021-1-5');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'bounced', 45, '2021-1-19');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'deferred', 45, '2021-1-17');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'delivered', 19, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'delivered', 25, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'delivered', 73,'2021-1-1');

我在 postgres db 中做了这个,如下所示。 第4步是可以直接使用的主要查询。 我刚刚添加了最初的 3 个步骤,以便更好地理解每个子查询。

    create table email_event(id serial, email_id integer, user_id integer, event varchar(10), created_at date);
    
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'click', 25, '1-20-2021');
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'click', 73, '1-5-2021');
    insert into email_event(email_id, event, user_id, created_at)values(7518, 'bounced', 45, '1-19-2021');
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'deferred', 45, '1-17-2021');
    insert into email_event(email_id, event, user_id, created_at)values(7518, 'delivered', 19, '1-1-2021');
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'delivered', 25, '1-1-2021');
    insert into email_event(email_id, event, user_id, created_at) values(7518, 'delivered', 73, '1-1-2021');
  1. 首先,我们将标记事件类别:

     select email_id, user_id,event, created_at, case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag from email_event;

在此处输入图像描述

  1. 然后我们将 pick_latest_flag 分组,并根据 user_id 和 flag 对它们进行排名。

     select a.*, row_number () over (partition by email_id, user_id, pick_latest_flag order by created_at desc) rn from ( select email_id, user_id,event, created_at, case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag from email_event ) A;

在此处输入图像描述

  1. 然后我们将根据行号过滤掉 pick_latest_flag 记录。

     select * from ( select a.*, row_number () over (partition by user_id, pick_latest_flag order by created_at desc) rn from ( select email_id, user_id,event, created_at, case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag from email_event ) A ) b where pick_latest_flag = 'N' or (pick_latest_flag = 'Y' and rn = 1);

在此处输入图像描述

  1. 在最后一步中,将它们分组在 email_id 和 event 上:

     select email_id, event, count(*) from ( select * from ( select a.*, row_number () over (partition by user_id, pick_latest_flag order by created_at desc) rn from ( select email_id, user_id,event, created_at, case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag from email_event ) A ) b where pick_latest_flag = 'N' or (pick_latest_flag = 'Y' and rn = 1) ) c group by email_id,event order by event;

在此处输入图像描述

考虑到您将email_id作为参数传递,以下查询应该能够为您提供预期的结果 -

select sum(case when f.user_id is not null then 1 else 0 end) sum_event, f.event
from (select * from email_event e1
       where exists (select id from email_event e2 where e2.email_id = e1.email_id and e2.event = 'deferred')
        and e1.event <> 'deferred') f
 where f.email_id = 7518 group by f.event;

Output测试email_id = 7518时:

+-----------+-----------+
| sum_event |   event   |
+-----------+-----------+
|     2     | click     |
+-----------+-----------+
|     1     | bounced   |
+-----------+-----------+
|     3     | delivered |
+-----------+-----------+

SQL

with cte AS
  (select *
   from `email_activity` ea1
   where `email_id` = 7518
     and `event` not in ('bounced', 'deferred', 'delivered')
      or not exists (select * from `email_activity` ea2
                     where ea2.`user_id` = ea1.`user_id`
                     and ea2.`event` IN ('bounced', 'deferred', 'delivered')
                     and ea2.`created_at` > ea1.`created_at`))
select `event`, COUNT(1) as count, COUNT(DISTINCT user_id) as unique_count
from cte
group by `event`;

演示

https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=725c38e7068fdd68cfb7315a798bdd7e

解释

公用表表达式 (CTE) 包括事件类型不是“退回”、“延迟”或“已交付”(即它是“单击”,除非有任何其他我不知道的可能性)的所有行。 它还包括事件类型在该列表中但在该列表中没有具有事件类型的更新记录的行。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM