繁体   English   中英

BigQuery:如何在窗口函数上合并HLL草图? (在滚动窗口中计数不同的值)

[英]BigQuery: How to merge HLL Sketches over a window function? (Count distinct values over a rolling window)

相关表格示例示例:

+---------------------------+-------------------+
| activity_date - TIMESTAMP | user_id - STRING  |
+---------------------------+-------------------+
| 2017-02-22 17:36:08 UTC   | fake_id_i24385787 |
+---------------------------+-------------------+
| 2017-02-22 04:27:08 UTC   | fake_id_234885747 |
+---------------------------+-------------------+
| 2017-02-22 08:36:08 UTC   | fake_id_i24385787 |
+---------------------------+-------------------+

我需要在一个滚动的时间段内(90天)计算一个大型数据集上活跃的独立用户的数量,并且由于数据集的大小而出现问题。

首先,我尝试使用窗口函数,类似于此处的答案。 https://stackoverflow.com/a/27574474

WITH
  daily AS (
  SELECT
    DATE(activity_date) day,
    user_id
  FROM
    `fake-table`)
SELECT
  day,
  SUM(APPROX_COUNT_DISTINCT(user_id)) OVER (ORDER BY day ROWS BETWEEN 89 PRECEDING AND CURRENT ROW) ninty_day_window_apprx
FROM
  daily
GROUP BY
  1
ORDER BY
  1 DESC

但是,这导致每天获得不同数量的用户,然后将这些数量进行汇总-但是,如果不同用户出现多次,则它们可以在窗口中重复。 因此,这并不是对90天内不同用户的真实准确衡量。

我尝试的下一件事是使用以下解决方案https://stackoverflow.com/a/47659590-将每个窗口的所有不同的user_id连接到一个数组,然后计算其中的不同。

WITH daily AS (
  SELECT date(activity_date) day, STRING_AGG(DISTINCT user_id) users
  FROM `fake-table`  
  GROUP BY day
), temp2 AS (
  SELECT
    day, 
    STRING_AGG(users) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) users
  FROM daily
)

SELECT day, 
  (SELECT APPROX_COUNT_DISTINCT(id) FROM UNNEST(SPLIT(users)) AS id) Unique90Days
FROM temp2

order by 1 desc

但是,这很快就用光了所有大容量的内存。

接下来是使用HLL草图以较小的值表示不同的ID,因此内存将不再是问题。 我以为我的问题已解决,但是运行以下命令时出现错误:错误仅是“不支持MERGE_PARTIAL函数”。 我也尝试了MERGE,并遇到了相同的错误。 仅在使用窗口功能时发生。 为每天的价值创建草图效果很好。

我通读了BigQuery Standard SQL文档,但没有看到关于带有窗口函数的HLL_COUNT.MERGE_PARTIAL和HLL_COUNT.MERGE的任何信息。 大概应该采用90个草图并将它们组合为一个HLL草图,代表90个原始草图之间的不同值?

WITH
  daily AS (
  SELECT
    DATE(activity_date) day,
    HLL_COUNT.INIT(user_id) sketch
  FROM
    `fake-table`
  GROUP BY
    1
  ORDER BY
    1 DESC),

  rolling AS (
  SELECT
    day,
    HLL_COUNT.MERGE_PARTIAL(sketch) OVER (ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch
    FROM daily)

SELECT
  day,
  HLL_COUNT.EXTRACT(rolling_sketch)
FROM
  rolling
ORDER BY
  1 

“错误图像-不支持功能MERGE_PARTIAL”

任何想法为什么会发生此错误或如何调整?

以下是适用于BigQuery Standard SQL的信息,它确实可以使用window函数来实现您想要的功能

#standardSQL
SELECT day,
  (SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch)  rolling_sketch
FROM (
  SELECT day, 
    ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 89 PRECEDING AND CURRENT ROW) rolling_sketch_arr 
  FROM (
    SELECT day, HLL_COUNT.INIT(id) ids_sketch
    FROM `project.dataset.table`
    GROUP BY day
  )
)

您可以使用[全部]伪数据来测试,玩游戏,如下例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 id, DATE '2019-01-01' day UNION ALL
  SELECT 2, '2019-01-01' UNION ALL
  SELECT 3, '2019-01-01' UNION ALL
  SELECT 1, '2019-01-02' UNION ALL
  SELECT 4, '2019-01-02' UNION ALL
  SELECT 2, '2019-01-03' UNION ALL
  SELECT 3, '2019-01-03' UNION ALL
  SELECT 4, '2019-01-03' UNION ALL
  SELECT 5, '2019-01-03' UNION ALL
  SELECT 1, '2019-01-04' UNION ALL
  SELECT 4, '2019-01-04' UNION ALL
  SELECT 2, '2019-01-05' UNION ALL
  SELECT 3, '2019-01-05' UNION ALL
  SELECT 5, '2019-01-05' UNION ALL
  SELECT 6, '2019-01-05' 
)
SELECT day,
  (SELECT HLL_COUNT.MERGE(sketch) FROM UNNEST(rolling_sketch_arr) sketch)  rolling_sketch
FROM (
  SELECT day, 
    ARRAY_AGG(ids_sketch) OVER(ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) rolling_sketch_arr 
  FROM (
    SELECT day, HLL_COUNT.INIT(id) ids_sketch
    FROM `project.dataset.table`
    GROUP BY day
  )
)
-- ORDER BY day

结果

Row day         rolling_sketch   
1   2019-01-01  3    
2   2019-01-02  4    
3   2019-01-03  5    
4   2019-01-04  5    
5   2019-01-05  6    

结合HLL_COUNT.INITHLL_COUNT.MERGE 此解决方案使用90天的交叉联接与GENERATE_ARRAY(1, 90)而不是OVER

#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
 , HLL_COUNT.MERGE(sketch) unique_90_day_users
 , HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
 , HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
  SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
  FROM `bigquery-public-data.stackoverflow.posts_questions` 
  WHERE EXTRACT(YEAR FROM creation_date)=2017
  GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp

在此处输入图片说明

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM