[英]BigQuery missing rows with SUM OVER PARTITION BY
特尔;博士:
鉴于此表:
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
)
如何获取包含缺失日期/产品组合 ( 2020-11-02 - premium
) 的表格,其中diff
的后备值为0
。
理想情况下,适用于多种产品。 可以像这样获得所有产品的列表:
SELECT ARRAY_AGG(DISTINCT product) FROM subscriptions
我希望能够获得每天的订阅计数,无论是针对所有产品还是仅针对某些产品。
我认为这很容易实现的方法是准备一个如下所示的数据库:
|---------------------|------------------|------------------|
| date | product | total |
|---------------------|------------------|------------------|
| 2020-11-01 | premium | 100 |
|---------------------|------------------|------------------|
| 2020-11-01 | basic | 50 |
|---------------------|------------------|------------------|
使用此表,我可以轻松地按日期和产品或仅按日期分组并汇总总数。
在我得到结果表之前,我已经生成了一个表,其中我计算了每天和产品的订阅差异。 每个产品有多少新订阅者,有多少不再订阅。
该表如下所示:
|---------------------|------------------|------------------|
| date | product | diff |
|---------------------|------------------|------------------|
| 2020-11-01 | premium | 50 |
|---------------------|------------------|------------------|
| 2020-11-01 | basic | -20 |
|---------------------|------------------|------------------|
也就是说,11月1日,高级用户总数增加了50个,基本用户总数减少了20个。
现在的问题是,如果一个产品没有任何更改,则此临时表缺少日期点,请参见下面的示例。
当我开始时没有产品表,我只有日期和差异列。
为了从第二个表到第一个表,我使用了这个完美的查询:
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, 150 as diff
UNION ALL SELECT TIMESTAMP("2020-11-02"), -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), 60
)
SELECT
*,
SUM(diff) OVER (ORDER BY date) as total_subscriptions
FROM subscriptions
ORDER BY date
但是当我添加产品列并尝试计算每天和产品的总和时,缺少一些数据点。
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
)
SELECT
*,
SUM(diff) OVER (PARTITION BY product ORDER BY date) as total_subscriptions
FROM subscriptions
ORDER BY date
——
|---------------------|------------------|------------------|
| date | product | total |
|---------------------|------------------|------------------|
| 2020-11-01 | basic | 100 |
|---------------------|------------------|------------------|
| 2020-11-01 | premium | 50 |
|---------------------|------------------|------------------|
| 2020-11-02 | basic | 90 |
|---------------------|------------------|------------------|
| 2020-11-03 | basic | 130 |
|---------------------|------------------|------------------|
| 2020-11-03 | premium | 70 |
|---------------------|------------------|------------------|
如果我现在显示每天的订阅总数,我会得到:
150 -> 90 -> 200
但我希望:
150 -> 140 -> 200
每天的高级订阅总数也是如此:
50 -> 0 -> 70
但我希望:
50 -> 50 -> 70
我相信解决此问题的最佳选择是添加缺少的日期/产品组合。
我该怎么做?
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
),
dates AS (
SELECT *
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2020-11-01 00:00:00', '2020-11-03 00:00:00', INTERVAL 1 DAY)) as date
),
products AS (
SELECT DISTINCT product FROM subscriptions
)
SELECT dates.date, products.product, subscriptions.diff
FROM dates
CROSS JOIN products
LEFT JOIN subscriptions
ON subscriptions.date = dates.date AND subscriptions.product = products.product
如果我正确地跟随您,一种方法是可以生成您想要的期间的固定日期列表,并将其与产品列表cross join
。 这为您提供了所有可能的组合。 然后,你可以带一个left join
的订阅表,最后执行窗口求和:
select d.dt, p.product, sum(s.diff) over(partition by p.product order by d.dt) total
from unnest(generate_timestamp_array(
timestamp('2020-11-01'),
timestamp('2020-11-03'),
interval 1 day)
) dt
cross join (
select 'basic' product
union all select 'premium'
) p
left join subscriptions on s.product = p.product and s.date = dt
我们可以通过动态生成日期范围和产品列表来使查询更通用:
select d.dt, p.product, sum(s.diff) over(partition by p.product order by d.dt) total
from (select min(date) min_dt, max(date) max_dt from subscriptions) d0
cross join unnest(generate_timestamp_array(d0.min_dt, d0.max_dt, interval 1 day)) dt
cross join (select distinct product from subscriptions) p
left join subscriptions on s.product = p.product and s.date = dt
-- Try this,I am creating a table for list of products and add total product in that list. Joining with your table to get data as per your requirement.
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
),
product_name as (
Select product from subscriptions group by 1
union all
Select "Total" as product
)
Select date
,product
,total_subscriptions
from (
Select a.date
,a.product
,diff
,SUM(diff) OVER (PARTITION BY a.product ORDER BY a.date) as total_subscriptions
from
(
Select date,a.product
from product_name A
join subscriptions B
on 1=1
where a.product !='Total'
group by 1,2
) A
left join subscriptions B
on A.product = B.product
and A.date = B.date
group by 1,2,3
) group by 1,2,3
union all
Select date
,product
,total_subscriptions
from
(
Select date,a.product
,diff
,SUM(diff) OVER (PARTITION BY a.product ORDER BY date) as total_subscriptions
from product_name A
join subscriptions B
on 1=1
where a.product ='Total'
group by 1,2,3
) group by 1,2,3
order by 1,2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.