繁体   English   中英

在不使用 BigQuery 中的交叉联接的情况下组合重叠的日期范围

[英]Combining overlapping date ranges without using a cross join in BigQuery

如果我有这个数据集:

create schema if not exists dbo;
create table if not exists dbo.player_history(team_id INT, player_id INT, active_from TIMESTAMP, active_to TIMESTAMP);
truncate table dbo.player_history;
INSERT INTO dbo.player_history VALUES(1,1,'2020-01-01', '2020-01-08');
INSERT INTO dbo.player_history VALUES(1,2,'2020-06-01', '2020-09-08');
INSERT INTO dbo.player_history VALUES(1,3,'2020-06-10', '2020-10-01');
INSERT INTO dbo.player_history VALUES(1,4,'2020-02-01', '2020-02-15');
INSERT INTO dbo.player_history VALUES(1,5,'2021-01-01', '2021-01-08');
INSERT INTO dbo.player_history VALUES(1,6,'2021-01-02', '2022-06-08');
INSERT INTO dbo.player_history VALUES(1,7,'2021-01-03', '2021-06-08');
INSERT INTO dbo.player_history VALUES(1,8,'2021-01-04', '2021-06-08');
INSERT INTO dbo.player_history VALUES(1,9,'2020-01-02', '2021-02-05');
INSERT INTO dbo.player_history VALUES(1,10,'2020-10-01', '2021-04-08');
INSERT INTO dbo.player_history VALUES(1,11,'2020-11-01', '2021-05-08');

我想合并重叠的日期范围,以便我可以识别至少有一名玩家活跃的“岛屿”。 然后我可以做一个交叉连接和一个相关的子查询来得到这样的结果:

with data_set as (
SELECT 
    a.team_id
    , a.active_from
    , ARRAY_AGG(b.active_to ORDER BY b.active_to DESC LIMIT 1)[SAFE_OFFSET(0)] AS active_to
FROM dbo.player_history a
LEFT JOIN dbo.player_history b
    on a.team_id = b.team_id
where a.active_from between b.active_from and b.active_to
group by 1,2
)

select team_id
    , min(active_from) as active_from
    , active_to
from data_set
group by 1,3
order by active_from, active_to

这给了我想要的结果,但是对于更大的数据集,这种方法不可行,BigQuery 不建议以这种方式进行连接。 查看执行计划,它主要是导致缓慢的连接。 有什么方法可以更有效地实现所需的 output 吗?

您可以使用分区表在处理大量信息时获得更好的性能。 分区表将一个大表分成多个较小的分区,从而可以提高查询性能。 分区表基于 TIMESTAMP、DATE 或 DATETIME。

一个选项可以是:

  1. 创建分区表
  2. 加载分区表中的数据
  3. 执行查询

你可以看到这个例子:

通过此查询,您将创建一个分区表并同时加载数据。 第一次只加载数据可能会花费一些时间,但是访问分区表时会快得多。

CREATE TABLE
  mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
  transaction_date
AS SELECT transaction_id, transaction_date FROM mydataset.mytable

然后执行查询

SELECT transaction_id, transaction_date FROM mydataset.newtable
Where transaction_date between start_date and finish_date

使用分区表有一些限制,因为它使用保存在缓存中的结果。

此外,您还可以查看此文档,了解在创建查询时要获得最佳性能需要考虑的一些要点。

一个非常快速的查询,为每支球队获取至少有一名球员活跃的时间段列表:

create temporary function test(a array<date>,b array<date>)
returns array<struct<a date,b date>>
language js
as """
var out=[];
var start=a[0];
var end=a[0];
for(var i=0;i<a.length;i++)
{
if(a[i]<=end) {if(end<b[i]) end=b[i]}
else {
    var tmp={"a":start,"b":end};
    out.push(tmp);
    start=a[i];
    end=b[i];
 }
}
out.push({"a":start,"b":end});

return out;
""";

select team_id, test(array_agg(active_from order by active_from),array_agg(active_to order by active_from))
from 
 dbo.player_history
 group by 1

你的结果显示: 在此处输入图像描述 显示开始日期在前一个时间段内是令人困惑的。

如果您的球员平均只活跃几年,则此查询会提供一个列表,其中包含一支球队仅由一名或更少球员组成的所有日期。

with tbl_lst as (
Select team_id,date_diff(active_to,active_from,day),
generate_date_array(active_from, active_to, INTERVAL 1 DAY) as day_list
from dbo.player_history )
SELECT team_id,day,sum(active_players) as active_players
FROM (
SELECT team_id,day,count(1) as active_players
from tbl_lst,unnest(tbl_lst.day_list) as day
group by 1,2
Union ALL 
Select team_id, day,0 from
(Select team_id,min(active_from) as team_START,max(active_from) as team_END 
from dbo.player_history
group by 1),
unnest(generate_date_array(team_START, team_END, INTERVAL 1 DAY)) day
)
group by 1,2
having active_players<2 

以下查询需要 16 个阶段且速度较慢,但会获取每个时间间隔的活跃玩家数量。 两张表join,表data_set只是区间内的日期,所以10年最多3650行。

#generate a list of all dates
with dates as (
SELECT active_from as start_date from dbo.player_history 
Union ALL SELECT active_to  from dbo.player_history 
#Union ALL Select timestamp("2050-01-01")
),
# add next date to this list
data_set as ( 
SELECT Distinct start_date, lead(start_date) over (order by start_date) as end_date
from dates
order by 1
)

# count player at each time
Select team_id, start_date,end_date,
count(player_id) as active_player,
string_agg(cast(player_id as string)) as player_list

from dbo.player_history 
RIGHT JOIN
data_Set
on active_from<=start_date and active_to>=end_date
group by 1,2,3
having active_player<2
order by start_date

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM