[英]Combining overlapping date ranges without using a cross join in BigQuery
如果我有这个数据集:
create schema if not exists dbo;
create table if not exists dbo.player_history(team_id INT, player_id INT, active_from TIMESTAMP, active_to TIMESTAMP);
truncate table dbo.player_history;
INSERT INTO dbo.player_history VALUES(1,1,'2020-01-01', '2020-01-08');
INSERT INTO dbo.player_history VALUES(1,2,'2020-06-01', '2020-09-08');
INSERT INTO dbo.player_history VALUES(1,3,'2020-06-10', '2020-10-01');
INSERT INTO dbo.player_history VALUES(1,4,'2020-02-01', '2020-02-15');
INSERT INTO dbo.player_history VALUES(1,5,'2021-01-01', '2021-01-08');
INSERT INTO dbo.player_history VALUES(1,6,'2021-01-02', '2022-06-08');
INSERT INTO dbo.player_history VALUES(1,7,'2021-01-03', '2021-06-08');
INSERT INTO dbo.player_history VALUES(1,8,'2021-01-04', '2021-06-08');
INSERT INTO dbo.player_history VALUES(1,9,'2020-01-02', '2021-02-05');
INSERT INTO dbo.player_history VALUES(1,10,'2020-10-01', '2021-04-08');
INSERT INTO dbo.player_history VALUES(1,11,'2020-11-01', '2021-05-08');
我想合并重叠的日期范围,以便我可以识别至少有一名玩家活跃的“岛屿”。 然后我可以做一个交叉连接和一个相关的子查询来得到这样的结果:
with data_set as (
SELECT
a.team_id
, a.active_from
, ARRAY_AGG(b.active_to ORDER BY b.active_to DESC LIMIT 1)[SAFE_OFFSET(0)] AS active_to
FROM dbo.player_history a
LEFT JOIN dbo.player_history b
on a.team_id = b.team_id
where a.active_from between b.active_from and b.active_to
group by 1,2
)
select team_id
, min(active_from) as active_from
, active_to
from data_set
group by 1,3
order by active_from, active_to
这给了我想要的结果,但是对于更大的数据集,这种方法不可行,BigQuery 不建议以这种方式进行连接。 查看执行计划,它主要是导致缓慢的连接。 有什么方法可以更有效地实现所需的 output 吗?
您可以使用分区表在处理大量信息时获得更好的性能。 分区表将一个大表分成多个较小的分区,从而可以提高查询性能。 分区表基于 TIMESTAMP、DATE 或 DATETIME。
一个选项可以是:
你可以看到这个例子:
通过此查询,您将创建一个分区表并同时加载数据。 第一次只加载数据可能会花费一些时间,但是访问分区表时会快得多。
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
AS SELECT transaction_id, transaction_date FROM mydataset.mytable
然后执行查询
SELECT transaction_id, transaction_date FROM mydataset.newtable
Where transaction_date between start_date and finish_date
使用分区表有一些限制,因为它使用保存在缓存中的结果。
此外,您还可以查看此文档,了解在创建查询时要获得最佳性能需要考虑的一些要点。
一个非常快速的查询,为每支球队获取至少有一名球员活跃的时间段列表:
create temporary function test(a array<date>,b array<date>)
returns array<struct<a date,b date>>
language js
as """
var out=[];
var start=a[0];
var end=a[0];
for(var i=0;i<a.length;i++)
{
if(a[i]<=end) {if(end<b[i]) end=b[i]}
else {
var tmp={"a":start,"b":end};
out.push(tmp);
start=a[i];
end=b[i];
}
}
out.push({"a":start,"b":end});
return out;
""";
select team_id, test(array_agg(active_from order by active_from),array_agg(active_to order by active_from))
from
dbo.player_history
group by 1
如果您的球员平均只活跃几年,则此查询会提供一个列表,其中包含一支球队仅由一名或更少球员组成的所有日期。
with tbl_lst as (
Select team_id,date_diff(active_to,active_from,day),
generate_date_array(active_from, active_to, INTERVAL 1 DAY) as day_list
from dbo.player_history )
SELECT team_id,day,sum(active_players) as active_players
FROM (
SELECT team_id,day,count(1) as active_players
from tbl_lst,unnest(tbl_lst.day_list) as day
group by 1,2
Union ALL
Select team_id, day,0 from
(Select team_id,min(active_from) as team_START,max(active_from) as team_END
from dbo.player_history
group by 1),
unnest(generate_date_array(team_START, team_END, INTERVAL 1 DAY)) day
)
group by 1,2
having active_players<2
以下查询需要 16 个阶段且速度较慢,但会获取每个时间间隔的活跃玩家数量。 两张表join,表data_set
只是区间内的日期,所以10年最多3650行。
#generate a list of all dates
with dates as (
SELECT active_from as start_date from dbo.player_history
Union ALL SELECT active_to from dbo.player_history
#Union ALL Select timestamp("2050-01-01")
),
# add next date to this list
data_set as (
SELECT Distinct start_date, lead(start_date) over (order by start_date) as end_date
from dates
order by 1
)
# count player at each time
Select team_id, start_date,end_date,
count(player_id) as active_player,
string_agg(cast(player_id as string)) as player_list
from dbo.player_history
RIGHT JOIN
data_Set
on active_from<=start_date and active_to>=end_date
group by 1,2,3
having active_player<2
order by start_date
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.