[英]How do I select columns whenever they change?
I'm trying to create a slowly changing dimension (type 2 dimension) and am a bit lost on how to logically write it out.我正在尝试创建一个缓慢变化的维度(类型 2 维度),并且对如何逻辑地写出它有点迷茫。 Say that we have a source table with a grain of Person | Country | Department | Login Time
假设我们有一个带有Person | Country | Department | Login Time
粒度的源表Person | Country | Department | Login Time
Person | Country | Department | Login Time
Person | Country | Department | Login Time
. Person | Country | Department | Login Time
。 I want to create this dimension table with Person | Country | Department | Eff Start time | Eff End Time
我想用Person | Country | Department | Eff Start time | Eff End Time
创建这个维度表Person | Country | Department | Eff Start time | Eff End Time
Person | Country | Department | Eff Start time | Eff End Time
Person | Country | Department | Eff Start time | Eff End Time
. Person | Country | Department | Eff Start time | Eff End Time
。
Data could look like this:数据可能如下所示:
Person | Country | Department | Login Time
------------------------------------------
Bob | CANADA | Marketing | 2009-01-01
Bob | CANADA | Marketing | 2009-02-01
Bob | USA | Marketing | 2009-03-01
Bob | USA | Sales | 2009-04-01
Bob | MEX | Product | 2009-05-01
Bob | MEX | Product | 2009-06-01
Bob | MEX | Product | 2009-07-01
Bob | CANADA | Marketing | 2009-08-01
What I want in the Type 2 dimension would look like this:我想要在类型 2 维度中的内容如下所示:
Person | Country | Department | Eff Start time | Eff End Time
------------------------------------------------------------------
Bob | CANADA | Marketing | 2009-01-01 | 2009-03-01
Bob | USA | Marketing | 2009-03-01 | 2009-04-01
Bob | USA | Sales | 2009-04-01 | 2009-05-01
Bob | MEX | Product | 2009-05-01 | 2009-08-01
Bob | CANADA | Marketing | 2009-08-01 | NULL
Assume that Bob's name, Country and Department hasn't been updated since 2009-08-01
so it's left as NULL
假设 Bob 的姓名、国家和部门自2009-08-01
以来未更新,因此保留为NULL
What function would work best here?什么功能在这里最有效? This is on Netezza, which uses a flavor of Postgres.这是在 Netezza 上,它使用了 Postgres 的风格。
Obviously GROUP BY
would not work here because of same groupings later on (I added in Bob | CANADA | Marketing
at the last row to show this.显然GROUP BY
在这里不起作用,因为稍后进行了相同的分组(我在最后一行添加了Bob | CANADA | Marketing
以显示这一点。
EDIT编辑
Including a hash column on Person, Country, and Department, would make sense, correct?包括一个关于人员、国家和部门的哈希列,是否有意义,对吗? Thinking of using logic of使用逻辑的思考
SELECT PERSON, COUNTRY, DEPARTMENT
FROM table t1
where
person = person
AND t1.hash <> hash_function(person, country, department)
create table so (
person varchar(32)
,country varchar(32)
,department varchar(32)
,login_time date
) distribute on random;
insert into so values ('Bob','CANADA','Marketing','2009-01-01');
insert into so values ('Bob','CANADA','Marketing','2009-02-01');
insert into so values ('Bob','USA','Marketing','2009-03-01');
insert into so values ('Bob','USA','Sales','2009-04-01');
insert into so values ('Bob','MEX','Product','2009-05-01');
insert into so values ('Bob','MEX','Product','2009-06-01');
insert into so values ('Bob','MEX','Product','2009-07-01');
insert into so values ('Bob','CANADA','Marketing','2009-08-01');
/* ************************************************************************** */
with prm as ( --Create an ordinal primary key.
select
*
,row_number() over (
partition by person
order by login_time
) rwn
from
so
), chn as ( --Chain events to their previous and next event.
select
cur.rwn
,cur.person
,cur.country
,cur.department
,cur.login_time cur_login
,case
when
cur.country = prv.country
and cur.department = prv.department
then 1
else 0
end prv_equal
,case
when
(
cur.country = nxt.country
and cur.department = nxt.department
) or nxt.rwn is null --No next record should be equivalent to matching.
then 1
else 0
end nxt_equal
,case prv_equal
when 0 then cur_login
else null
end eff_login_start_sparse
,case
when eff_login_start_sparse is null
then max(eff_login_start_sparse) over (
partition by cur.person
order by rwn
rows unbounded preceding --The secret sauce.
)
else eff_login_start_sparse
end eff_login_start
,case nxt_equal
when 0 then cur_login
else null
end eff_login_end
from
prm cur
left outer join prm nxt on
cur.person = nxt.person
and cur.rwn + 1 = nxt.rwn
left outer join prm prv on
cur.person = prv.person
and cur.rwn - 1 = prv.rwn
), grp as ( --Group by login starts.
select
person
,country
,department
,eff_login_start
,max(eff_login_end) eff_login_end
from
chn
group by
person
,country
,department
,eff_login_start
), led as ( --Change the effective end to be the next start, if desired.
select
person
,country
,department
,eff_login_start
,case
when eff_login_end is null
then null
else
lead(eff_login_start) over (
partition by person
order by eff_login_start
)
end eff_login_end
from
grp
)
select * from led order by eff_login_start;
This code returns the following table.此代码返回下表。
PERSON | COUNTRY | DEPARTMENT | EFF_LOGIN_START | EFF_LOGIN_END
--------+---------+------------+-----------------+---------------
Bob | CANADA | Marketing | 2009-01-01 | 2009-03-01
Bob | USA | Marketing | 2009-03-01 | 2009-04-01
Bob | USA | Sales | 2009-04-01 | 2009-05-01
Bob | MEX | Product | 2009-05-01 | 2009-08-01
Bob | CANADA | Marketing | 2009-08-01 |
I must have solved this four or five times in the past few years and keep neglecting to write it down formally.这几年我肯定已经解决了四五次了,一直忽略正式写下来。 I'm glad to have the chance to do it, so this is a great question.我很高兴有机会这样做,所以这是一个很好的问题。
When attempting this, I like writing down the problem in matrix form.尝试这样做时,我喜欢以矩阵形式写下问题。 Here's the input, presuming that all values have the same key in the SCD.这是输入,假设所有值在 SCD 中都具有相同的键。
Cv | Ce
----|----
A | 10
A | 11
B | 14
C | 16
D | 18
D | 25
D | 34
A | 40
Where Cv is the value that we'll need to compare against (again, presuming that the key value for the SCD is equal in this data; we'll be partitioning over the key value the entire time so it's irrelevant to the solution) and Ce is the event time.其中 Cv 是我们需要比较的值(再次假设 SCD 的键值在此数据中相等;我们将在整个时间内对键值进行分区,因此它与解决方案无关)和Ce 是事件时间。
First, we need an ordinal primary key.首先,我们需要一个序数主键。 I've designated this Ck in the table.我在表中指定了这个 Ck。 This will allow us to join the table to itself to get the previous and next events.这将允许我们将表连接到自身以获取上一个和下一个事件。 I've called these columns Pk (previous key), Nk (next key), Pv, and Nv.我将这些列称为 Pk(上一个键)、Nk(下一个键)、Pv 和 Nv。
Cv | Ce | Ck | Pk | Pv | Nk | Nv |
----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A |
A | 11 | 2 | 1 | A | 3 | B |
B | 14 | 3 | 2 | A | 4 | C |
C | 16 | 4 | 3 | B | 5 | D |
D | 18 | 5 | 4 | C | 6 | D |
D | 25 | 6 | 5 | D | 7 | D |
D | 34 | 7 | 6 | D | 8 | A |
A | 40 | 8 | 7 | D | | |
Now we need some columns to see if we're at the beginning or end of a contiguous event block.现在我们需要一些列来查看我们是否处于连续事件块的开头或结尾。 I'll call these Pc and Nc, for contiguous.我将这些 Pc 和 Nc 称为连续的。 Pc is defined as Pv = Cv => true. Pc 定义为 Pv = Cv => true。 1 represents true and 0 represents false. 1 代表真,0 代表假。 Nc is defined similarly, except that the null case defaults to true (we'll see why in a minute) Nc 的定义类似,除了 null 情况默认为 true(稍后我们将看到原因)
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc |
----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 |
A | 40 | 8 | 7 | D | | | 0 | 1 |
Now you can start to see how the 1,1 combination of Pc,Nc is a completely useless record.现在您可以开始看到 Pc,Nc 的 1,1 组合如何是一个完全无用的记录。 We know this intuitively, since Bob's Mex/Product combination on the 6th row is pretty much useless information when building an SCD.我们凭直觉就知道这一点,因为在构建 SCD 时,第 6 行的 Bob 的 Mex/Product 组合几乎是无用的信息。
So let's get rid of the useless information.所以让我们摆脱无用的信息。 I'll add two new columns here: an almost-complete effective start time called Sn and an actually-complete effective end time called Ee.我将在此处添加两个新列:称为 Sn 的几乎完整的有效开始时间和称为 Ee 的实际完整的有效结束时间。 Sn is is populated with Ce when Pc is 0 and Ee is populated with Ce when Nc is 0.当 Pc 为 0 时,用 Ce 填充 Sn,当 Nc 为 0 时用 Ce 填充 Ee。
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee |
----|----|----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 | 10 | |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 | | 11 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 | 14 | 14 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 | 16 | 16 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 | 18 | |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 | | |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 | | 34 |
A | 40 | 8 | 7 | D | | | 0 | 1 | 40 | |
This looks really close , but we still have the problem that we can't group by Cv (person/country/department).这看起来很接近,但我们仍然存在无法按Cv(人/国家/部门)分组的问题。 What we need is for Sn to populate all those nulls with the previous value of Sn.我们需要的是让 Sn 用之前的 Sn 值填充所有这些空值。 You could join this table to itself on rwn < rwn
and get the maximum, but I'm going to be lazy and use Netezza's analytic functions and the rows unbounded preceding
clause.您可以在rwn < rwn
这个表连接到自身并获得最大值,但我会偷懒并使用 Netezza 的分析函数和rows unbounded preceding
子句。 It's a shortcut to the method I just described.这是我刚刚描述的方法的快捷方式。 So we're going to create another column called Es, efffective start, defined as follows.所以我们将创建另一个名为 Es 的列,有效开始,定义如下。
case
when Sn is null
then max(Sn) over (
partition by k --key value of the SCD
order by Ck
rows unbounded preceding
)
else Sn
end Es
With that definition, we get this.有了这个定义,我们就明白了。
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee | Es |
----|----|----|----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 | 10 | | 10 |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 | | 11 | 10 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 | 14 | 14 | 14 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 | 16 | 16 | 16 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 | 18 | | 18 |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 | | | 18 |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 | | 34 | 18 |
A | 40 | 8 | 7 | D | | | 0 | 1 | 40 | | 40 |
The rest is trivial.其余的都是微不足道的。 Group by Es and grab the max of Ee to obtain this table.按 Es 分组并获取 Ee 的最大值以获得此表。
Cv | Es | Ee |
----|----|----|
A | 10 | 11 |
B | 14 | 14 |
C | 16 | 16 |
D | 18 | 34 |
A | 40 | |
If you want to populate the effective end time with the next start, join the table again to itself or use the lead()
window function to grab it.如果您想用下一次开始填充有效结束时间,请再次将表连接到自身或使用lead()
窗口函数来抓取它。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.