[英]How to get cumulative total users but ignoring the users who already appear in previous day? using bigquery
So I want to calculate cumulative users per day but if the users exist is previous days they will not counted.所以我想计算每天的累积用户,但如果用户存在是前几天,他们将不计算在内。
date_key user_id
2022-01-01 001
2022-01-01 002
2022-01-02 001
2022-01-02 003
2022-01-03 002
2022-01-03 003
2022-01-04 002
2022-01-04 004
on daily basis we can get每天我们可以得到
date_key total_user
2022-01-01 2
2022-01-02 2
2022-01-03 2
2022-01-04 2
if we simply calculate cumulative we can get 2,4,6,8 for each day the goal is to get the table like this如果我们简单地计算累积,我们每天可以得到 2,4,6,8 目标是得到这样的表格
date_key total_user
2022-01-01 2
2022-01-02 3
2022-01-03 3
2022-01-04 4
im using this query to get the result, since the data is really2 huge.我使用这个查询来获得结果,因为数据真的很大。 the query takes forever to complete.
查询需要永远完成。
select b.date_key,count(distinct a.user_id) total_user
from t1 a
join t1 b
on b.date_key >= a.date_key
and date_trunc(a.date_key,month) = date_trunc(b.date_key,month)
group by 1
order by 1
and yes the calculation should be on reset when the month is changing.是的,当月份变化时,计算应该重置。
and btw I'm using google bigquery顺便说一句,我正在使用谷歌 bigquery
Number each user's appearance by order of date.按日期顺序对每个用户的外观进行编号。 Count only the ones seen for the first time:
只计算第一次看到的那些:
with data as (
select *,
row_number() over (partition by date_trunc(date_key, month), userid
order by date_key) as rn
from T
)
select date_key,
sum(count(case when rn = 1 then 1 end)) -- or countif(rn = 1)
over (partition by date_trunc(date_key, month)
order by date_key) as cum_monthly_users
from data
group by date_key;
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=dc426d79a7786fc8a5b25a22f0755e27 https://dbfiddle.uk/?rdbms=postgres_14&fiddle=dc426d79a7786fc8a5b25a22f0755e27
- cumulative total users but ignoring the users who already appear in previous day?
累计用户总数,但忽略前一天已经出现的用户?
- the calculation should be on reset when the month is changing
当月份变化时,计算应该重置
- the data is really2 huge
数据真的很大
Consider below approach考虑以下方法
select date_key,
( select hll_count.merge(u)
from unnest(users) u
) as total_user
from (
select date_key, date_trunc(date(date_key), month) year_month,
array_agg(users) over(partition by date_trunc(date(date_key), month) order by date_key) users
from (
select date_key, hll_count.init(user_id) users
from your_table
group by date_key
)
)
if applied to sample data in your question - output is如果应用于您问题中的示例数据 - output 是
Note: not [obviously] above ##1 and 2 are met - and output as expected, but also here we use HyperLogLog++ functions which will effectivelly address above #3注意:没有[明显]上面的##1和2被满足-和output如预期的那样,但在这里我们使用HyperLogLog++函数将有效地解决上面的#3
HLL++ functions are approximate aggregate functions.
HLL++ 函数是近似聚合函数。 Approximate aggregation typically requires less memory than exact aggregation functions, like COUNT(DISTINCT), but also introduces statistical error.
与精确聚合函数(如 COUNT(DISTINCT))相比,近似聚合通常需要更少的 memory,但也会引入统计错误。 This makes HLL++ functions appropriate for large data streams for which linear memory usage is impractical, as well as for data that is already approximate.
这使得 HLL++ 函数适用于线性 memory 使用不切实际的大型数据流,以及已经近似的数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.