简体   繁体   English

每当列更改时如何选择列?

[英]How do I select columns whenever they change?

I'm trying to create a slowly changing dimension (type 2 dimension) and am a bit lost on how to logically write it out.我正在尝试创建一个缓慢变化的维度(类型 2 维度),并且对如何逻辑地写出它有点迷茫。 Say that we have a source table with a grain of Person | Country | Department | Login Time假设我们有一个带有Person | Country | Department | Login Time粒度的源表Person | Country | Department | Login Time Person | Country | Department | Login Time Person | Country | Department | Login Time . Person | Country | Department | Login Time I want to create this dimension table with Person | Country | Department | Eff Start time | Eff End Time我想用Person | Country | Department | Eff Start time | Eff End Time创建这个维度表Person | Country | Department | Eff Start time | Eff End Time Person | Country | Department | Eff Start time | Eff End Time Person | Country | Department | Eff Start time | Eff End Time . Person | Country | Department | Eff Start time | Eff End Time

Data could look like this:数据可能如下所示:

Person | Country | Department | Login Time
------------------------------------------
Bob    | CANADA  | Marketing  | 2009-01-01
Bob    | CANADA  | Marketing  | 2009-02-01
Bob    | USA     | Marketing  | 2009-03-01
Bob    | USA     | Sales      | 2009-04-01
Bob    | MEX     | Product    | 2009-05-01
Bob    | MEX     | Product    | 2009-06-01
Bob    | MEX     | Product    | 2009-07-01
Bob    | CANADA  | Marketing  | 2009-08-01

What I want in the Type 2 dimension would look like this:我想要在类型 2 维度中的内容如下所示:

Person | Country | Department | Eff Start time | Eff End Time
------------------------------------------------------------------
Bob    | CANADA  | Marketing  | 2009-01-01     | 2009-03-01
Bob    | USA     | Marketing  | 2009-03-01     | 2009-04-01
Bob    | USA     | Sales      | 2009-04-01     | 2009-05-01
Bob    | MEX     | Product    | 2009-05-01     | 2009-08-01
Bob    | CANADA  | Marketing  | 2009-08-01     | NULL 

Assume that Bob's name, Country and Department hasn't been updated since 2009-08-01 so it's left as NULL假设 Bob 的姓名、国家和部门自2009-08-01以来未更新,因此保留为NULL

What function would work best here?什么功能在这里最有效? This is on Netezza, which uses a flavor of Postgres.这是在 Netezza 上,它使用了 Postgres 的风格。

Obviously GROUP BY would not work here because of same groupings later on (I added in Bob | CANADA | Marketing at the last row to show this.显然GROUP BY在这里不起作用,因为稍后进行了相同的分组(我在最后一行添加了Bob | CANADA | Marketing以显示这一点。

EDIT编辑

Including a hash column on Person, Country, and Department, would make sense, correct?包括一个关于人员、国家和部门的哈希列,是否有意义,对吗? Thinking of using logic of使用逻辑的思考

SELECT PERSON, COUNTRY, DEPARTMENT
FROM table t1
where 
    person = person 
    AND t1.hash <> hash_function(person, country, department)

Answer回答

create table so (
  person varchar(32)
  ,country varchar(32)
  ,department varchar(32)
  ,login_time date
) distribute on random;

insert into so values ('Bob','CANADA','Marketing','2009-01-01');
insert into so values ('Bob','CANADA','Marketing','2009-02-01');
insert into so values ('Bob','USA','Marketing','2009-03-01');
insert into so values ('Bob','USA','Sales','2009-04-01');
insert into so values ('Bob','MEX','Product','2009-05-01');
insert into so values ('Bob','MEX','Product','2009-06-01');
insert into so values ('Bob','MEX','Product','2009-07-01');
insert into so values ('Bob','CANADA','Marketing','2009-08-01');

/* ************************************************************************** */

with prm as ( --Create an ordinal primary key.
  select
    *
    ,row_number() over (
      partition by person
      order by login_time
    ) rwn
  from
    so
), chn as ( --Chain events to their previous and next event.
  select
    cur.rwn
    ,cur.person
    ,cur.country
    ,cur.department
    ,cur.login_time cur_login
    ,case
      when
        cur.country = prv.country
        and cur.department = prv.department
        then 1
      else 0
    end prv_equal
    ,case
      when
        (
          cur.country = nxt.country
          and cur.department = nxt.department
        ) or nxt.rwn is null --No next record should be equivalent to matching.
        then 1
      else 0
    end nxt_equal
    ,case prv_equal
      when 0 then cur_login
      else null
    end eff_login_start_sparse
    ,case
      when eff_login_start_sparse is null
        then max(eff_login_start_sparse) over (
          partition by cur.person
          order by rwn
          rows unbounded preceding --The secret sauce.
        )
      else eff_login_start_sparse
    end eff_login_start
    ,case nxt_equal
      when 0 then cur_login
      else null
    end eff_login_end
  from
    prm cur
    left outer join prm nxt on
      cur.person = nxt.person
      and cur.rwn + 1 = nxt.rwn
    left outer join prm prv on
      cur.person = prv.person
      and cur.rwn - 1 = prv.rwn
), grp as ( --Group by login starts.
  select
    person
    ,country
    ,department
    ,eff_login_start
    ,max(eff_login_end) eff_login_end
  from
    chn
  group by
    person
    ,country
    ,department
    ,eff_login_start
), led as ( --Change the effective end to be the next start, if desired.
  select
    person
    ,country
    ,department
    ,eff_login_start
    ,case
      when eff_login_end is null
        then null
      else
        lead(eff_login_start) over (
          partition by person
          order by eff_login_start
        )
    end eff_login_end
  from
    grp
)
select * from led order by eff_login_start;

This code returns the following table.此代码返回下表。

 PERSON | COUNTRY | DEPARTMENT | EFF_LOGIN_START | EFF_LOGIN_END
--------+---------+------------+-----------------+---------------
 Bob    | CANADA  | Marketing  | 2009-01-01      | 2009-03-01
 Bob    | USA     | Marketing  | 2009-03-01      | 2009-04-01
 Bob    | USA     | Sales      | 2009-04-01      | 2009-05-01
 Bob    | MEX     | Product    | 2009-05-01      | 2009-08-01
 Bob    | CANADA  | Marketing  | 2009-08-01      |

Explanation解释

I must have solved this four or five times in the past few years and keep neglecting to write it down formally.这几年我肯定已经解决了四五次了,一直忽略正式写下来。 I'm glad to have the chance to do it, so this is a great question.我很高兴有机会这样做,所以这是一个很好的问题。

When attempting this, I like writing down the problem in matrix form.尝试这样做时,我喜欢以矩阵形式写下问题。 Here's the input, presuming that all values have the same key in the SCD.这是输入,假设所有值在 SCD 中都具有相同的键。

 Cv | Ce
----|----
 A  | 10
 A  | 11
 B  | 14
 C  | 16
 D  | 18
 D  | 25
 D  | 34
 A  | 40

Where Cv is the value that we'll need to compare against (again, presuming that the key value for the SCD is equal in this data; we'll be partitioning over the key value the entire time so it's irrelevant to the solution) and Ce is the event time.其中 Cv 是我们需要比较的值(再次假设 SCD 的键值在此数据中相等;我们将在整个时间内对键值进行分区,因此它与解决方案无关)和Ce 是事件时间。

First, we need an ordinal primary key.首先,我们需要一个序数主键。 I've designated this Ck in the table.我在表中指定了这个 Ck。 This will allow us to join the table to itself to get the previous and next events.这将允许我们将表连接到自身以获取上一个和下一个事件。 I've called these columns Pk (previous key), Nk (next key), Pv, and Nv.我将这些列称为 Pk(上一个键)、Nk(下一个键)、Pv 和 Nv。

 Cv | Ce | Ck | Pk | Pv | Nk | Nv |
----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  |
 A  | 11 | 2  | 1  | A  | 3  | B  |
 B  | 14 | 3  | 2  | A  | 4  | C  |
 C  | 16 | 4  | 3  | B  | 5  | D  |
 D  | 18 | 5  | 4  | C  | 6  | D  |
 D  | 25 | 6  | 5  | D  | 7  | D  |
 D  | 34 | 7  | 6  | D  | 8  | A  |
 A  | 40 | 8  | 7  | D  |    |    |

Now we need some columns to see if we're at the beginning or end of a contiguous event block.现在我们需要一些列来查看我们是否处于连续事件块的开头或结尾。 I'll call these Pc and Nc, for contiguous.我将这些 Pc 和 Nc 称为连续的。 Pc is defined as Pv = Cv => true. Pc 定义为 Pv = Cv => true。 1 represents true and 0 represents false. 1 代表真,0 代表假。 Nc is defined similarly, except that the null case defaults to true (we'll see why in a minute) Nc 的定义类似,除了 null 情况默认为​​ true(稍后我们将看到原因)

 Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc |
----|----|----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  | 0  | 1  |
 A  | 11 | 2  | 1  | A  | 3  | B  | 1  | 0  |
 B  | 14 | 3  | 2  | A  | 4  | C  | 0  | 0  |
 C  | 16 | 4  | 3  | B  | 5  | D  | 0  | 0  |
 D  | 18 | 5  | 4  | C  | 6  | D  | 0  | 1  |
 D  | 25 | 6  | 5  | D  | 7  | D  | 1  | 1  |
 D  | 34 | 7  | 6  | D  | 8  | A  | 1  | 0  |
 A  | 40 | 8  | 7  | D  |    |    | 0  | 1  |

Now you can start to see how the 1,1 combination of Pc,Nc is a completely useless record.现在您可以开始看到 Pc,Nc 的 1,1 组合如何是一个完全无用的记录。 We know this intuitively, since Bob's Mex/Product combination on the 6th row is pretty much useless information when building an SCD.我们凭直觉就知道这一点,因为在构建 SCD 时,第 6 行的 Bob 的 Mex/Product 组合几乎是无用的信息。

So let's get rid of the useless information.所以让我们摆脱无用的信息。 I'll add two new columns here: an almost-complete effective start time called Sn and an actually-complete effective end time called Ee.我将在此处添加两个新列:称为 Sn 的几乎完整的有效开始时间和称为 Ee 的实际完整的有效结束时间。 Sn is is populated with Ce when Pc is 0 and Ee is populated with Ce when Nc is 0.当 Pc 为 0 时,用 Ce 填充 Sn,当 Nc 为 0 时用 Ce 填充 Ee。

 Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee |
----|----|----|----|----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  | 0  | 1  | 10 |    |
 A  | 11 | 2  | 1  | A  | 3  | B  | 1  | 0  |    | 11 |
 B  | 14 | 3  | 2  | A  | 4  | C  | 0  | 0  | 14 | 14 |
 C  | 16 | 4  | 3  | B  | 5  | D  | 0  | 0  | 16 | 16 |
 D  | 18 | 5  | 4  | C  | 6  | D  | 0  | 1  | 18 |    |
 D  | 25 | 6  | 5  | D  | 7  | D  | 1  | 1  |    |    |
 D  | 34 | 7  | 6  | D  | 8  | A  | 1  | 0  |    | 34 |
 A  | 40 | 8  | 7  | D  |    |    | 0  | 1  | 40 |    |

This looks really close , but we still have the problem that we can't group by Cv (person/country/department).这看起来很接近,但我们仍然存在无法按Cv(人/国家/部门)分组的问题。 What we need is for Sn to populate all those nulls with the previous value of Sn.我们需要的是让 Sn 用之前的 Sn 值填充所有这些空值。 You could join this table to itself on rwn < rwn and get the maximum, but I'm going to be lazy and use Netezza's analytic functions and the rows unbounded preceding clause.您可以在rwn < rwn这个表连接到自身并获得最大值,但我会偷懒并使用 Netezza 的分析函数和rows unbounded preceding子句。 It's a shortcut to the method I just described.这是我刚刚描述的方法的快捷方式。 So we're going to create another column called Es, efffective start, defined as follows.所以我们将创建另一个名为 Es 的列,有效开始,定义如下。

case
  when Sn is null
    then max(Sn) over (
      partition by k --key value of the SCD
      order by Ck
      rows unbounded preceding
    )
  else Sn
end Es

With that definition, we get this.有了这个定义,我们就明白了。

 Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee | Es |
----|----|----|----|----|----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  | 0  | 1  | 10 |    | 10 |
 A  | 11 | 2  | 1  | A  | 3  | B  | 1  | 0  |    | 11 | 10 |
 B  | 14 | 3  | 2  | A  | 4  | C  | 0  | 0  | 14 | 14 | 14 |
 C  | 16 | 4  | 3  | B  | 5  | D  | 0  | 0  | 16 | 16 | 16 |
 D  | 18 | 5  | 4  | C  | 6  | D  | 0  | 1  | 18 |    | 18 |
 D  | 25 | 6  | 5  | D  | 7  | D  | 1  | 1  |    |    | 18 |
 D  | 34 | 7  | 6  | D  | 8  | A  | 1  | 0  |    | 34 | 18 |
 A  | 40 | 8  | 7  | D  |    |    | 0  | 1  | 40 |    | 40 |

The rest is trivial.其余的都是微不足道的。 Group by Es and grab the max of Ee to obtain this table.按 Es 分组并获取 Ee 的最大值以获得此表。

 Cv | Es | Ee |
----|----|----|
 A  | 10 | 11 |
 B  | 14 | 14 |
 C  | 16 | 16 |
 D  | 18 | 34 |
 A  | 40 |    |

If you want to populate the effective end time with the next start, join the table again to itself or use the lead() window function to grab it.如果您想用下一次开始填充有效结束时间,请再次将表连接到自身或使​​用lead()窗口函数来抓取它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM