[英]How to query rows grouped by matching string column but only count the most recent row for a specific set of keywords?
有一个包含 email 事件的表,其中每一行都以特定的传出 email 记录 fk 和特定的收件人用户 fk 为键。 在任何给定的时间,没有特定的顺序,甚至可以同时从不同的线程中,我可以将新记录放入该表中。 以下是相关专栏...
id (pk), email_id (fk), user_id (fk), event (string/name), created_at
我正在计算给定 email 的总体事件计数,例如发送了多少封电子邮件、退回了多少封邮件等。但是我需要忽略特定用户的 email 事件的特定组合,因为当更新的事件进入时它们会过时。例如,如果某行表示 email 已针对特定用户“延迟”,但后来插入了一个新的事件行,表示“已交付”或“退回”,那么我只希望最近添加的任何相关关键字的行算一次为当前state。
在阅读时执行此操作的好方法是什么? 由于我需要进行多层分组并达到我的 SQL 排骨的限制,我遇到了麻烦,这是我正在尝试增强的查询,如下所述:
select `event`, COUNT(1) as count, COUNT(DISTINCT user_id) as unique_count
from `email_activity`
where `email_id` = 7518
group by `event`
对于大多数事件,我希望它们全部计算在内而无需任何替换,因此在这些情况下仅按事件分组就可以了,例如,如果某事是“点击”或“打开”事件,只需将它们加起来。
但是,如果同一 email_id/user_id 有任意数量的“延迟”、“退回”或“已交付”事件,我只想计算具有最近 created_at 日期的事件,而忽略所有较旧的事件。
示例行集(email_id、event、user_id、created_at):
7518, "click", 25, 1-20-2021
7518, "click", 73, 1-5-2021
7518, "bounced", 45, 1-19-2021
7518, "deferred", 45, 1-17-2021
7518, "delivered", 19, 1-1-2021
7518, "delivered", 25, 1-1-2021
7518, "delivered", 73, 1-1-2021
因此,查询的 email 7518 计数为:
2 个“点击”、3 个“已交付”和 1 个“退回”作为“延迟”行将被用户 45 忽略,因为它较旧(只有“退回”、“延迟”和“已交付”事件是此事件的一部分“只计算最新的”规则,所有其他事件名称总是被计算在内)。
如果有WITH
子句和 window 函数,我更喜欢把表分成两部分,给每一行适当的优先级,用UNION ALL
合并它们,最后只聚合优先级最高的行。
WITH specific_email_activity AS (
SELECT * FROM email_activity WHERE email_id = 7518
),
specific_email_activity_with_priority AS (
SELECT
*,
1 AS rank_priority
FROM
specific_email_activity
WHERE
event NOT IN ('deferred', 'bounced', 'delivered')
UNION ALL
SELECT
*,
ROW_NUMBER () over (PARTITION BY email_id, user_id ORDER BY created_at DESC) AS rank_priority
FROM
specific_email_activity
WHERE
event IN ('deferred', 'bounced', 'delivered')
)
SELECT
email_id,
event,
COUNT(*) AS count_event,
COUNT(DISTINCT user_id) AS unique_count_event
FROM
specific_email_activity_with_priority
WHERE
rank_priority = 1
GROUP BY
email_id,
event
ORDER BY
email_id,
event;
如果您不能使用WITH
子句或 window 函数,请尝试以下代码:
SELECT
email_id,
event,
COUNT(*) AS count_event,
COUNT(DISTINCT user_id) AS unique_count_event
FROM
(
SELECT
*
FROM
email_activity
WHERE
email_id = 7518
AND event NOT IN ('deferred', 'bounced', 'delivered')
UNION ALL
SELECT
*
FROM
email_activity
WHERE
email_id = 7518
AND event IN ('deferred', 'bounced', 'delivered')
AND email_activity.created_at = (
SELECT
MAX(created_at)
FROM
email_activity AS ea
WHERE
email_id = 7518
AND event IN ('deferred', 'bounced', 'delivered')
AND email_id = email_activity.email_id
AND user_id = email_activity.user_id
)
) AS t
GROUP BY
email_id,
event
ORDER BY
email_id,
event;
电子邮件ID | 事件 | 计数事件 | 唯一计数事件 |
---|---|---|---|
7518 | 反弹 | 1个 | 1个 |
7518 | 点击 | 2个 | 2个 |
7518 | 发表 | 3个 | 3个 |
user_id
和同一created_at
不止一行,它们的event
由 'deferred'、'bounced' 或 'delivered' 的组合组成,你会得到意想不到的结果。 在这种情况下,必须明确其中哪一个应该优先计算。 然后,必须根据该规则修改代码。event
可以为空,则必须明确聚合中如何处理NULL
。 然后,必须根据该规则修改代码。示例表可以通过以下 sql 创建:
CREATE TABLE IF NOT EXISTS email_activity(id SERIAL PRIMARY KEY, email_id INT, user_id INT, event VARCHAR(16), created_at DATE);
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'click', 25, '2021-1-20');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'click', 73, '2021-1-5');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'bounced', 45, '2021-1-19');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'deferred', 45, '2021-1-17');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'delivered', 19, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'delivered', 25, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(7518, 'delivered', 73,'2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'click', 25, '2021-1-20');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'click', 73, '2021-1-5');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'bounced', 45, '2021-1-19');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'deferred', 45, '2021-1-17');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'delivered', 19, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'delivered', 25, '2021-1-1');
INSERT INTO email_activity(email_id, event, user_id, created_at) VALUES(9999, 'delivered', 73,'2021-1-1');
我在 postgres db 中做了这个,如下所示。 第4步是可以直接使用的主要查询。 我刚刚添加了最初的 3 个步骤,以便更好地理解每个子查询。
create table email_event(id serial, email_id integer, user_id integer, event varchar(10), created_at date);
insert into email_event(email_id, event, user_id, created_at) values(7518, 'click', 25, '1-20-2021');
insert into email_event(email_id, event, user_id, created_at) values(7518, 'click', 73, '1-5-2021');
insert into email_event(email_id, event, user_id, created_at)values(7518, 'bounced', 45, '1-19-2021');
insert into email_event(email_id, event, user_id, created_at) values(7518, 'deferred', 45, '1-17-2021');
insert into email_event(email_id, event, user_id, created_at)values(7518, 'delivered', 19, '1-1-2021');
insert into email_event(email_id, event, user_id, created_at) values(7518, 'delivered', 25, '1-1-2021');
insert into email_event(email_id, event, user_id, created_at) values(7518, 'delivered', 73, '1-1-2021');
首先,我们将标记事件类别:
select email_id, user_id,event, created_at, case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag from email_event;
然后我们将 pick_latest_flag 分组,并根据 user_id 和 flag 对它们进行排名。
select a.*, row_number () over (partition by email_id, user_id, pick_latest_flag order by created_at desc) rn from ( select email_id, user_id,event, created_at, case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag from email_event ) A;
然后我们将根据行号过滤掉 pick_latest_flag 记录。
select * from ( select a.*, row_number () over (partition by user_id, pick_latest_flag order by created_at desc) rn from ( select email_id, user_id,event, created_at, case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag from email_event ) A ) b where pick_latest_flag = 'N' or (pick_latest_flag = 'Y' and rn = 1);
在最后一步中,将它们分组在 email_id 和 event 上:
select email_id, event, count(*) from ( select * from ( select a.*, row_number () over (partition by user_id, pick_latest_flag order by created_at desc) rn from ( select email_id, user_id,event, created_at, case when event in ('bounced', 'deferred', 'delivered') then 'Y' else 'N' end as pick_latest_flag from email_event ) A ) b where pick_latest_flag = 'N' or (pick_latest_flag = 'Y' and rn = 1) ) c group by email_id,event order by event;
考虑到您将email_id作为参数传递,以下查询应该能够为您提供预期的结果 -
select sum(case when f.user_id is not null then 1 else 0 end) sum_event, f.event
from (select * from email_event e1
where exists (select id from email_event e2 where e2.email_id = e1.email_id and e2.event = 'deferred')
and e1.event <> 'deferred') f
where f.email_id = 7518 group by f.event;
Output测试email_id = 7518时:
+-----------+-----------+
| sum_event | event |
+-----------+-----------+
| 2 | click |
+-----------+-----------+
| 1 | bounced |
+-----------+-----------+
| 3 | delivered |
+-----------+-----------+
with cte AS
(select *
from `email_activity` ea1
where `email_id` = 7518
and `event` not in ('bounced', 'deferred', 'delivered')
or not exists (select * from `email_activity` ea2
where ea2.`user_id` = ea1.`user_id`
and ea2.`event` IN ('bounced', 'deferred', 'delivered')
and ea2.`created_at` > ea1.`created_at`))
select `event`, COUNT(1) as count, COUNT(DISTINCT user_id) as unique_count
from cte
group by `event`;
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=725c38e7068fdd68cfb7315a798bdd7e
公用表表达式 (CTE) 包括事件类型不是“退回”、“延迟”或“已交付”(即它是“单击”,除非有任何其他我不知道的可能性)的所有行。 它还包括事件类型在该列表中但在该列表中没有具有事件类型的更新记录的行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.