如何获取累计用户总数但忽略前一天已经出现的用户？使用大查询

Question

So I want to calculate cumulative users per day but if the users exist is previous days they will not counted.所以我想计算每天的累积用户，但如果用户存在是前几天，他们将不计算在内。

date_key      user_id
2022-01-01     001
2022-01-01     002
2022-01-02     001
2022-01-02     003
2022-01-03     002
2022-01-03     003
2022-01-04     002
2022-01-04     004

on daily basis we can get每天我们可以得到

date_key     total_user
2022-01-01      2
2022-01-02      2
2022-01-03      2
2022-01-04      2

if we simply calculate cumulative we can get 2,4,6,8 for each day the goal is to get the table like this如果我们简单地计算累积，我们每天可以得到 2,4,6,8 目标是得到这样的表格

date_key     total_user
2022-01-01      2
2022-01-02      3
2022-01-03      3
2022-01-04      4

im using this query to get the result, since the data is really2 huge.我使用这个查询来获得结果，因为数据真的很大。 the query takes forever to complete.查询需要永远完成。

select b.date_key,count(distinct a.user_id) total_user
from t1 a
join t1 b 
   on b.date_key >= a.date_key 
   and date_trunc(a.date_key,month) = date_trunc(b.date_key,month)
group by 1
order by 1

and yes the calculation should be on reset when the month is changing.是的，当月份变化时，计算应该重置。

and btw I'm using google bigquery顺便说一句，我正在使用谷歌 bigquery

Answer 1

Number each user's appearance by order of date.按日期顺序对每个用户的外观进行编号。 Count only the ones seen for the first time:只计算第一次看到的那些：

with data as (
    select *,
        row_number() over (partition by date_trunc(date_key, month), userid
                           order by date_key) as rn
    from T
)
select date_key,
    sum(count(case when rn = 1 then 1 end)) -- or countif(rn = 1)
        over (partition by date_trunc(date_key, month)
              order by date_key) as cum_monthly_users
from data
group by date_key;

https://dbfiddle.uk/?rdbms=postgres_14&fiddle=dc426d79a7786fc8a5b25a22f0755e27 https://dbfiddle.uk/?rdbms=postgres_14&fiddle=dc426d79a7786fc8a5b25a22f0755e27

Answer 2

cumulative total users but ignoring the users who already appear in previous day?累计用户总数，但忽略前一天已经出现的用户？

the calculation should be on reset when the month is changing当月份变化时，计算应该重置

the data is really2 huge数据真的很大

Consider below approach考虑以下方法

select date_key, 
  ( select hll_count.merge(u) 
    from unnest(users) u
  ) as total_user
from (
  select date_key, date_trunc(date(date_key), month) year_month,
    array_agg(users) over(partition by date_trunc(date(date_key), month) order by date_key) users
  from (
    select date_key, hll_count.init(user_id) users
    from your_table
    group by date_key
  )
)

if applied to sample data in your question - output is如果应用于您问题中的示例数据 - output 是

Note: not [obviously] above ##1 and 2 are met - and output as expected, but also here we use HyperLogLog++ functions which will effectivelly address above #3注意：没有[明显]上面的##1和2被满足-和output如预期的那样，但在这里我们使用HyperLogLog++函数将有效地解决上面的#3

HLL++ functions are approximate aggregate functions. HLL++ 函数是近似聚合函数。 Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical error.与精确聚合函数（如 COUNT(DISTINCT)）相比，近似聚合通常需要更少的 memory，但也会引入统计错误。 This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.这使得 HLL++ 函数适用于线性 memory 使用不切实际的大型数据流，以及已经近似的数据。

如何获取累计用户总数但忽略前一天已经出现的用户？使用大查询

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-08-30 10:26:38

解决方案2
0 2022-08-30 17:33:50

如何获取累计用户总数但忽略前一天已经出现的用户？ 使用大查询

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-08-30 10:26:38

解决方案2 0 2022-08-30 17:33:50

如何获取累计用户总数但忽略前一天已经出现的用户？使用大查询

解决方案1
1 已采纳 2022-08-30 10:26:38

解决方案2
0 2022-08-30 17:33:50