简体   繁体   English

用于滑动窗口聚合的Bigquery SQL

[英]Bigquery SQL for sliding window aggregate

Hi I have a table that looks like this 嗨我有一张看起来像这样的桌子

Date         Customer   Pageviews
2014/03/01   abc          5
2014/03/02   xyz          8
2014/03/03   abc          6

I want to get page view aggregates grouped by week but showing aggregates for past 30 days - (sliding window aggregates with window-size of 30 days for every week) 我想获得按周分组的页面视图聚合,但显示过去30天的聚合 - (每周30天的窗口大小的滑动窗口聚合)

I am using google bigquery 我正在使用google bigquery

EDIT: Gordon - re your comment about "Customer", Actually what I need is slightly more complicated thats why I included customer in the table above. 编辑:戈登 - 你对“客户”的评论,实际上我需要的是稍微复杂一点,这就是为什么我把客户列入上表。 I am looking to get the number of customers who had >n pageviews in a 30day window every week. 我希望每周在30天的窗口中获得超过n次网页浏览量的客户数量。 something like this 这样的事情

Date        Customers>10 pageviews in 30day window
2014/02/01  10
2014/02/08  5
2014/02/15  6
2014/02/22  15

However to keep it simple, I will work my way if I could just get a sliding window aggregate of pageviews ignoring customers altogether. 然而,为了保持简单,如果我只能得到一个滑动窗口聚合的网页浏览而忽略了客户,我会按照自己的方式工作。 something like this 这样的事情

Date        count of pageviews in 30day window
2014/02/01  50
2014/02/08  55
2014/02/15  65
2014/02/22  75

How about this: 这个怎么样:

SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
  SELECT changes,
         LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
         LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
         LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
         login,
         week
  FROM (
    SELECT SUM(payload_pull_request_changed_files) changes, 
           UTC_USEC_TO_WEEK(created_at, 1) week,
           actor_attributes_login login,
    FROM [publicdata:samples.github_timeline]
    WHERE payload_pull_request_changed_files > 0
    GROUP BY week, login
))
HAVING changes28days > 0

For each user it counts how many changes they have submitted per week. 对于每个用户,它会计算每周提交的更改数量。 Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. 然后使用LAG()我们可以查看下一行,他们提交的-1,2和-3周的变化数量。 Then we just add those 4 weeks to see how many changes were submitted on the last 28 days. 然后,我们只需添加这4周,即可查看过去28天内提交的更改数量。

Now you can wrap everything in a new query to filter users with changes>X, and count them. 现在,您可以将所有内容包装在新查询中,以过滤更改> X的用户,并对其进行计数。

I have created the following "Times" table: 我创建了以下“Times”表:

Table Details: Dim_Periods
Schema
Date    TIMESTAMP   
Year    INTEGER         
Month   INTEGER         
day         INTEGER         
QUARTER INTEGER     
DAYOFWEEK   INTEGER     
MonthStart  TIMESTAMP   
MonthEnd    TIMESTAMP   
WeekStart   TIMESTAMP   
WeekEnd TIMESTAMP   
Back30Days  TIMESTAMP   -- the date 30 days before "Date"
Back7Days   TIMESTAMP   -- the date 7 days before "Date"

and I use such query to handle "running sums" 我使用这样的查询来处理“运行总和”

SELECT Date,Count(*) as MovingCNT
FROM

(SELECT Date,
                Back7Days 
                    FROM DWH.Dim_Periods  
                 where Date < timestamp(current_date()) AND
                             Date >= (DATE_ADD (CURRENT_TIMESTAMP(), -5, 'month'))
                )P
                CROSS JOIN EACH
    (SELECT repository_url,repository_created_at
    FROM publicdata:samples.github_timeline
                ) L
        WHERE timestamp(repository_created_at)>= Back7Days 
              AND timestamp(repository_created_at)<= Date

GROUP EACH BY Date

Note that it can be used for "Month to date", Week to Date" "30 days back" etc. aggregations as well. However, performance is not the best and the query can take a while on larger data sets due to the Cartesian join. Hope this helps 请注意,它可以用于“月到日”,“每周更新”,“30天后”等聚合。但是,性能不是最好的,并且由于笛卡尔,查询可能需要一段时间才能处理更大的数据集加入。希望这有帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM