简体   繁体   English

BigQuery:如何在滚动时间戳窗口内对行进行分组和计数?

[英]BigQuery: how to group and count rows within rolling timestamp window?

I have some experience with MongoDB and I'm learning about BigQuery. 我在MongoDB上有一些经验,并且正在学习BigQuery。 I'm trying to perform the following task, and I don't know how to do it using BigQuery's standard SQL. 我正在尝试执行以下任务,但我不知道如何使用BigQuery的标准SQL来执行此任务。

I have a table with the following data. 我有一个包含以下数据的表。 It contains events that occur on different website urls. 它包含发生在不同网站URL上的事件。 Timestamp represents when the given event occurred. 时间戳表示给定事件发生的时间。 For example, the first row means, "event 'xx' occurred on url 'a.html' at 2016-10-18 15:55:16 UTC." 例如,第一行表示“事件'xx'发生在世界标准时间2016-10-18 15:55:16的URL'a.html'上。”

event_id |    url    |          timestamp   
-----------------------------------------------------------
   xx         a.html      2016-10-18 15:55:16 UTC
   xx         a.html      2016-10-19 16:68:55 UTC
   xx         a.html      2016-10-25 20:55:57 UTC
   yy         b.html      2016-10-18 15:58:09 UTC
   yy         a.html      2016-10-18 08:32:43 UTC
   zz         a.html      2016-10-20 04:44:22 UTC
   zz         c.html      2016-10-21 02:12:34 UTC

I want to count the number of each event that occurred on each url over a over a rolling 3 day window. 我想计算一个3天滚动窗口中每个网址上发生的每个事件的数量。 In other words, I want to be able to say the following: 换句话说,我想说以下话:

  • "on the url 'a.html', during the interval [2016-10-18 00:00:00 UTC, 2016-10-21 00:00:00 UTC), event 'xx' occurred twice." “在时间间隔[2016-10-18 00:00:00 UTC,2016-10-21 00:00:00 UTC)的url'a.html'上,事件'xx'发生了两次。”

  • "on the url 'a.html', during the interval [2016-10-19 00:00:00 UTC, 2016-10-22 00:00:00 UTC), event 'xx' occurred once." “在时间间隔[2016-10-19 00:00:00 UTC,2016-10-22 00:00:00 UTC)的url'a.html'上,事件'xx'发生了一次。”

  • "on the url 'a.html', during the interval [2016-10-20 00:00:00 UTC, 2016-10-23 00:00:00 UTC), event 'xx' occurred zero times." “在时间间隔[2016-10-20 00:00:00 UTC,2016-10-23 00:00:00 UTC)的url'a.html'上,事件'xx'发生了零次。 (NOTE: THIS DOES NOT NEED TO BE RETURNED AS A ROW. The absence of this row can imply that the event occurred zero times.) (注意:不需要将它作为一行返回。缺少此行可以表示该事件发生了0次。)

Some notes: my database contains over 100k rows per day, and the occurrence of events varies. 一些注意事项:我的数据库每天包含超过10万行,并且事件的发生情况各不相同。 Meaning, in 1 day, event 'xx' will occur ~10,000 times and event 'zz' will occur ~0-2 times. 意思是,在1天之内,事件“ xx”将发生〜10,000次,事件“ zz”将发生〜0-2次。

Given my limited SQL knowledge, I didn't want to provide structure for the resulting table, because I figured that might incorrectly limit possible answers. 鉴于我有限的SQL知识,我不想为结果表提供结构,因为我认为这可能会错误地限制可能的答案。 Thanks! 谢谢!

Below is for BigQuery Standard SQL (see Enabling Standard SQL 以下是适用于BigQuery标准SQL的信息(请参阅启用标准SQL

I am using ts as a field name (instead timestamp as it is in your example) and assume this field is of TIMESTAMP data type 我使用ts作为字段名称(而不是示例中的timestamp ),并假定此字段为TIMESTAMP数据类型

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
)
SELECT 
  url, event_id, day, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling3daysEvents
FROM dailyAggregations
-- ORDER BY url, event_id, day

The value of 259200 is actually 3x24x3600 so sets 3 days range, so you can set whatever actual rolling period you need 259200的值实际上是3x24x3600,因此设置3天范围,因此您可以设置所需的任何实际滚动周期

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM