[英]How do I build an events table from three separate tables showing incremental change over time?
I'm trying to build a dataset that shows incremental change over time for some product attributes.我正在尝试构建一个数据集,以显示某些产品属性随时间的增量变化。 The data is in AWS Athena in three separate tables that each store different attributes and they can be updated independently at different times.数据位于 AWS Athena 中的三个独立表中,每个表存储不同的属性,并且可以在不同时间独立更新。 tbl1
can be joined to tbl2
and tbl2
can be joined to tbl3
. tbl1
可以连接到tbl2
并且tbl2
可以连接到tbl3
。 There is always a one-to-one relationship between the tables so tbl1.id=1
will only ever relate to tbl2.id=2
and tbl2.id=2
will only relate to tbl3.id=3
in this example:表之间始终存在一对一的关系,因此在此示例中tbl1.id=1
只会与tbl2.id=2
相关,而tbl2.id=2
只会与tbl3.id=3
相关:
tbl1
| id | updated_at | bool |
| 1 | 2019-09-10 06:00 | True |
| 1 | 2020-08-05 10:00 | False |
| 1 | 2020-09-03 15:00 | True |
tbl2
| id | tbl1_id | updated_at | desc |
| 2 | 1 | 2019-09-10 06:00 | thing 1 |
tbl3
| id | tbl2_id | updated_at | value |
| 3 | 2 | 2019-09-10 06:00 | 100 |
| 3 | 2 | 2019-09-19 09:00 | 50 |
| 3 | 2 | 2019-12-02 11:00 | 20 |
I'm trying to write a query that joins this data into a single table and has a row for each incremental update.我正在尝试编写一个查询,将这些数据连接到一个表中,并且每个增量更新都有一行。 From the above tables there was the initial insert on 2019-09-10 then four other changes made across tbl1
and tbl3
so it should end up as five rows that look like:从上表中可以看出,在 2019 年 9 月 10 日进行了初始插入,然后在tbl1
和tbl3
中进行了其他四项更改,因此最终应为五行,如下所示:
| tbl1_id | tbl1_updated_at | bool | tbl2_id | tbl2_updated_at | desc | tbl3_id | tbl3_updated_at | value |
| 1 | 2019-09-10 06:00 | True | 2 | 2019-09-10 06:00 | thing1 | 3 | 2019-09-10 06:00 | 100 |
| 1 | 2019-09-10 06:00 | True | 2 | 2019-09-10 06:00 | thing1 | 3 | 2019-09-19 09:00 | 50 |
| 1 | 2019-09-10 06:00 | True | 2 | 2019-09-10 06:00 | thing1 | 3 | 2019-12-02 11:00 | 20 |
| 1 | 2020-08-05 10:00 | False | 2 | 2019-09-10 06:00 | thing1 | 3 | 2019-12-02 11:00 | 20 |
| 1 | 2020-09-03 15:00 | True | 2 | 2019-09-10 06:00 | thing1 | 3 | 2019-12-02 11:00 | 20 |
I started with the idea of joining everything together and using some WHERE
clauses like:我从将所有内容连接在一起并使用一些WHERE
子句的想法开始,例如:
select
*
from
tbl1
left join tbl2 on tbl1.id = tbl2.tbl1_id
left join tbl3 on tbl2.id = tbl3.tbl2_id
where
???
But couldn't get it working and not sure if this would even work.但无法让它工作,也不确定这是否会奏效。 Perhaps there's some sort of window functions that would do it?也许有某种 window 函数可以做到这一点? It feels like it should be possible to do this in SQL but after two days of trying I'm completely at a loss as to how!感觉应该可以在 SQL 中做到这一点,但经过两天的尝试,我完全不知道该怎么做!
This is quite complicated.这是相当复杂的。 It would be simpler if you had the tbl1
id in all the tables.如果您在所有表中都有tbl1
id,那会更简单。
In any case, the idea is to union all
the columns together along with the tbl1
id and updated_at
.无论如何,我们的想法是将union all
列与tbl1
id 和updated_at
结合在一起。 Then aggregate, so there is one row per id
and date
.然后聚合,所以每个id
和date
有一行。
Finally, use last_value()
with the ignore nulls
option to get the most recent value that is populated:最后,使用带有ignore nulls
选项的last_value()
来获取填充的最新值:
with t as (
select id, updated_at, max(bool) as bool, max(descr) as descr, max(value) as value
from (select tbl1.id, tbl1.updated_at, tbl1.bool, null as descr, null as value
from tbl1
union all
select tbl2.tbl1_id, tbl2.updated_at, null, tbl2.descr, null
from tbl2
union all
select tbl2.tbl1_id, tbl2.updated_at, null, null, tbl3.value
from tbl2 join
tbl3
on tbl2.id = tbl3.tbl2_id
) t
group by id, updated_at
)
select id, updated_at,
last_value(bool ignore nulls) over (partition by id order by updated_at) as bool,
last_value(descr ignore nulls) over (partition by id order by updated_at) as descr,
last_value(value ignore nulls) over (partition by id order by updated_at) as value
from t;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.