简体   繁体   English

BigQuery:如何在窗口函数上合并HLL草图? (在滚动窗口中计数不同的值)

[英]BigQuery: How to merge HLL Sketches over a window function? (Count distinct values over a rolling window)

Example relevant table schema: 相关表格示例示例:

+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING  |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC   | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC   | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC   | fake_id_i24385787 |
+---------------------------+-------------------+

I need to count active distinct users over a large data set over a rolling time period (90 days), and am running into issues due to the size of the dataset. 我需要在一个滚动的时间段内(90天)计算一个大型数据集上活跃的独立用户的数量,并且由于数据集的大小而出现问题。

At first, I attempted to use a window function, similar to the answer here. 首先,我尝试使用窗口函数,类似于此处的答案。 https://stackoverflow.com/a/27574474 https://stackoverflow.com/a/27574474

WITH
  daily AS (
  SELECT
    DATE(activity_date) day,
    user_id
  FROM
    `fake-table`)
SELECT
  day,
  SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
  daily
GROUP BY
  1
ORDER BY
  1 DESC

However, this resulted in getting the distinct number of users per day, then summing these up - but distincts could be duplicated within the window, if they appeared multiple times. 但是,这导致每天获得不同数量的用户,然后将这些数量进行汇总-但是,如果不同用户出现多次,则它们可以在窗口中重复。 So this is not a true accurate measure of distinct users over 90 days. 因此,这并不是对90天内不同用户的真实准确衡量。

The next thing I tried is to use the following solution https://stackoverflow.com/a/47659590 - concatenating all the distinct user_ids for each window to an array and then counting the distincts within this. 我尝试的下一件事是使用以下解决方案https://stackoverflow.com/a/47659590-将每个窗口的所有不同的user_id连接到一个数组,然后计算其中的不同。

WITH daily AS (
  SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
  FROM `fake-table`  
  GROUP BY day
), temp2 AS (
  SELECT
    day, 
    STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
  FROM daily
)

SELECT day, 
  (SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2

order by 1 desc

However this quickly ran out of memory with anything large. 但是,这很快就用光了所有大容量的内存。

Next was to use a HLL sketch to represent the distinct IDs in a much smaller value, so memory would be less of an issue. 接下来是使用HLL草图以较小的值表示不同的ID,因此内存将不再是问题。 I thought my problems were solved, but I'm getting an error when running the following: The error is simply "Function MERGE_PARTIAL is not supported." 我以为我的问题已解决,但是运行以下命令时出现错误:错误仅是“不支持MERGE_PARTIAL函数”。 I tried with MERGE as well and got the same error. 我也尝试了MERGE,并遇到了相同的错误。 It only happens when using the window function. 仅在使用窗口功能时发生。 Creating the sketches for each day's value works fine. 为每天的价值创建草图效果很好。

I read through the BigQuery Standard SQL documentation and don't see anything about HLL_COUNT.MERGE_PARTIAL and HLL_COUNT.MERGE with window functions. 我通读了BigQuery Standard SQL文档,但没有看到关于带有窗口函数的HLL_COUNT.MERGE_PARTIAL和HLL_COUNT.MERGE的任何信息。 Presumably this should take the 90 sketches and combine them into one HLL sketch, representing the distinct values between the 90 original sketches? 大概应该采用90个草图并将它们组合为一个HLL草图,代表90个原始草图之间的不同值?

WITH
  daily AS (
  SELECT
    DATE(activity_date) day,
    HLL_COUNT.INIT(user_id) sketch
  FROM
    `fake-table`
  GROUP BY
    1
  ORDER BY
    1 DESC),

  rolling AS (
  SELECT
    day,
    HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
    FROM daily)

SELECT
  day,
  HLL_COUNT.EXTRACT(rolling_sketch)
FROM
  rolling
ORDER BY
  1 

"Image of the error - Function MERGE_PARTIAL is not supported" “错误图像-不支持功能MERGE_PARTIAL”

Any ideas why this error happens or how to adjust? 任何想法为什么会发生此错误或如何调整?

Below is for BigQuery Standard SQL and does exactly what you want with use of window function 以下是适用于BigQuery Standard SQL的信息,它确实可以使用window函数来实现您想要的功能

#standardSQL
SELECT day,
  (SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch)  rolling_sketch
FROM (
  SELECT day, 
    ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch_arr 
  FROM (
    SELECT day, HLL_COUNT.INIT(id) ids_sketch
    FROM `project.dataset.table`
    GROUP BY day
  )
)

You can test, play with above using [totally] dummy data as in below example 您可以使用[全部]伪数据来测试,玩游戏,如下例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, DATE '2019-01-01' day UNION ALL
  SELECT 2, '2019-01-01' UNION ALL
  SELECT 3, '2019-01-01' UNION ALL
  SELECT 1, '2019-01-02' UNION ALL
  SELECT 4, '2019-01-02' UNION ALL
  SELECT 2, '2019-01-03' UNION ALL
  SELECT 3, '2019-01-03' UNION ALL
  SELECT 4, '2019-01-03' UNION ALL
  SELECT 5, '2019-01-03' UNION ALL
  SELECT 1, '2019-01-04' UNION ALL
  SELECT 4, '2019-01-04' UNION ALL
  SELECT 2, '2019-01-05' UNION ALL
  SELECT 3, '2019-01-05' UNION ALL
  SELECT 5, '2019-01-05' UNION ALL
  SELECT 6, '2019-01-05' 
)
SELECT day,
  (SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch)  rolling_sketch
FROM (
  SELECT day, 
    ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) rolling_sketch_arr 
  FROM (
    SELECT day, HLL_COUNT.INIT(id) ids_sketch
    FROM `project.dataset.table`
    GROUP BY day
  )
)
-- ORDER BY day

with result 结果

Row day         rolling_sketch   
1   2019-01-01  3    
2   2019-01-02  4    
3   2019-01-03  5    
4   2019-01-04  5    
5   2019-01-05  6    

Combine HLL_COUNT.INIT and HLL_COUNT.MERGE . 结合HLL_COUNT.INITHLL_COUNT.MERGE This solution uses a 90 days cross join with GENERATE_ARRAY(1, 90) instead of OVER . 此解决方案使用90天的交叉联接与GENERATE_ARRAY(1, 90)而不是OVER

#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
 , HLL_COUNT.MERGE(sketch) unique_90_day_users
 , HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
 , HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
  SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
  FROM `bigquery-public-data.stackoverflow.posts_questions` 
  WHERE EXTRACT(YEAR FROM creation_date)=2017
  GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM