简体   繁体   English

BigQuery SQL - 如果两列连续出现,则连接它们

[英]BigQuery SQL - Concatenate two columns if they are on consecutive days

I am looking for a way to adjust this sql query running in BigQuery to return single count total for Sent EventTypes that happen two or even three days in a row.我正在寻找一种方法来调整在 BigQuery 中运行的 sql 查询,以返回连续两天甚至三天发生的已发送事件类型的单个计数总数。

SELECT date(EventDate) as EventDate, EventType, count(*) as count FROM `Database.Table`
    where date(EventDate) > DATE_SUB (CURRENT_DATE, INTERVAL 100 DAY)
    Group by 1,2 
    ORDER by 1,2

Response from above Query:来自上述查询的响应:

| Row    | EventDate | EventType | count |
| ------ | --------- |-----------|-------|
| 1      | 2019-02-06|  Sent     |    4  |
| 2      | 2019-02-07|  Sent     |    5  |
| 3      | 2019-02-12|  NotSent  |    7  |
| 4      | 2019-02-13|  Bounces  |    22 |
| 5      | 2019-02-14|  Bounces  |    22 |
| 6      | 2019-03-06|  Sent     |    2  |
| 7      | 2019-03-07|  Sent     |    4  |
| 8      | 2019-03-07|  NotSent  |    5  |
| 9      | 2019-03-12|  Bounces  |    7  |
| 10     | 2019-03-13|  Sent     |    22 |
| 11     | 2019-04-05|  Sent     |    2  |

Response I would like to get to:我想得到的回应:

| Row    | EventDate | EventType | count |
| ------ | --------- |-----------|-------|
| 1      | 2019-02-06|  Sent     |    9  |
| 2      | 2019-02-12|  NotSent  |    7  |
| 3      | 2019-02-13|  Bounces  |    22 |
| 4      | 2019-02-14|  Bounces  |    22 |
| 5      | 2019-03-06|  Sent     |    6  |
| 6      | 2019-03-07|  NotSent  |    5  |
| 7      | 2019-03-12|  Bounces  |    7  |
| 8      | 2019-03-13|  Sent     |    22 |
| 9      | 2019-04-05|  Sent     |    2  |

Something along those line, so I am able to concatenate two counts with the EventType of 'Sent' for consecutive days, and show other EventTypes without concatenating them, such as Bounces and NotSent.沿着这条线的东西,所以我可以连续几天将两个计数与“已发送”的 EventType 连接起来,并显示其他 EventType 而不连接它们,例如 Bounces 和 NotSent。

I wrote a query that merges all consecutive 2 days in the table.我写了一个查询,合并表中所有连续的 2 天。
It gives the exact same output you want.它提供了您想要的完全相同的 output。

I think you meant '2019-03-06' in the 5th row, so I fixed it in my dummy data section.我认为您的意思是第 5 行中的“2019-03-06”,所以我在我的虚拟数据部分中修复了它。

WITH
data AS (
  SELECT CAST('2019-02-06' as date) as EventDate, 4 as count union all
  SELECT CAST('2019-02-07' as date) as EventDate, 5 as count union all
  SELECT CAST('2019-02-12' as date) as EventDate, 7 as count union all
  SELECT CAST('2019-02-13' as date) as EventDate, 22 as count union all
  SELECT CAST('2019-03-06' as date) as EventDate, 2 as count
),
data_with_steps AS (
  SELECT *, 
    IF(DATE_DIFF(EventDate, LAG(EventDate) OVER (ORDER BY EventDate), day) > 2, 1, 0) as new_step
  FROM data
),
data_grouped AS (
  SELECT *, 
    SUM(new_step) OVER (ORDER BY EventDate) as step_group
  FROM data_with_steps
)
SELECT MIN(EventDate) as EventDate, sum(count) as count
FROM data_grouped
GROUP BY step_group

So, how does it work?那么它是怎样工作的?
First, I calculate the date difference to previous day.首先,我计算与前一天的日期差。 If it's more than 2 days, I set value 1, otherwise 0 for the new column new_step .如果超过 2 天,我将值设置为 1,否则为新列new_step设置为 0。
Then, I calculate the cumulative sum of new_step column and name it as step_group.然后,我计算new_step列的累积和并将其命名为 step_group。
The output of the first two steps is:前两步的output为:
在此处输入图像描述

At final step, I group table by step_group and get minimum date as event date, and sum counts to obtain group count.在最后一步,我按 step_group 对表进行分组,并获得最小日期作为事件日期,并对计数求和以获得组计数。
在此处输入图像描述

Edit: To add other events without grouping by, I added a new version.编辑:要添加其他事件而不分组,我添加了一个新版本。 I think the most intuitive and easiest way is to use Union All for that problem.我认为最直观和最简单的方法是使用Union All来解决这个问题。 So you can use that updated query to include other events without grouping.因此,您可以使用该更新后的查询来包含其他事件而无需分组。

WITH
data AS (
  SELECT CAST('2019-02-06' as date) as EventDate, 'Sent' as EventType, 4 as count union all
  SELECT CAST('2019-02-07' as date) as EventDate, 'Sent' as EventType, 5 as count union all
  SELECT CAST('2019-02-12' as date) as EventDate, 'Sent' as EventType, 7 as count union all
  SELECT CAST('2019-02-13' as date) as EventDate, 'Sent' as EventType, 22 as count union all
  SELECT CAST('2019-03-06' as date) as EventDate, 'Sent' as EventType, 2 as count union all
  SELECT CAST('2019-02-12' as date) as EventDate, 'NotSent' as EventType, 7 as count union all
  SELECT CAST('2019-03-07' as date) as EventDate, 'NotSent' as EventType, 5 as count union all
  SELECT CAST('2019-02-13' as date) as EventDate, 'Bounces' as EventType, 22 as count union all
  SELECT CAST('2019-02-14' as date) as EventDate, 'Bounces' as EventType, 22 as count union all
  SELECT CAST('2019-03-12' as date) as EventDate, 'Bounces' as EventType, 7 as count
),
data_with_steps AS (
  SELECT *, 
    IF(DATE_DIFF(EventDate, LAG(EventDate) OVER (ORDER BY EventDate), day) > 2, 1, 0) as new_step
  FROM data
  WHERE EventType = 'Sent'
),
data_grouped AS (
  SELECT *, 
    SUM(new_step) OVER (ORDER BY EventDate) as step_group
  FROM data_with_steps
)
SELECT EventType, MIN(EventDate) as EventDate, sum(count) as count
FROM data_grouped
GROUP BY EventType, step_group

UNION ALL

SELECT EventType, EventDate, count
FROM data
WHERE EventType != 'Sent'

This is a gaps-and-islands problem.这是一个差距和孤岛问题。 The simplest method is to use row_number() and subtraction to identify the "islands".最简单的方法是使用row_number()和减法来识别“岛屿”。 And then aggregate:然后聚合:

select min(row), eventType, min(eventDate), sum(count)
from (select t.*,
             row_number() over (partition by eventType order by eventDate) as seqnum
      from t
     ) t
group by eventType, dateadd(eventDate, interval -seqnum day)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM