How do I select columns whenever they change?

Question

Data could look like this:

Person | Country | Department | Login Time
------------------------------------------
Bob    | CANADA  | Marketing  | 2009-01-01
Bob    | CANADA  | Marketing  | 2009-02-01
Bob    | USA     | Marketing  | 2009-03-01
Bob    | USA     | Sales      | 2009-04-01
Bob    | MEX     | Product    | 2009-05-01
Bob    | MEX     | Product    | 2009-06-01
Bob    | MEX     | Product    | 2009-07-01
Bob    | CANADA  | Marketing  | 2009-08-01

What I want in the Type 2 dimension would look like this:

Person | Country | Department | Eff Start time | Eff End Time
------------------------------------------------------------------
Bob    | CANADA  | Marketing  | 2009-01-01     | 2009-03-01
Bob    | USA     | Marketing  | 2009-03-01     | 2009-04-01
Bob    | USA     | Sales      | 2009-04-01     | 2009-05-01
Bob    | MEX     | Product    | 2009-05-01     | 2009-08-01
Bob    | CANADA  | Marketing  | 2009-08-01     | NULL

Assume that Bob's name, Country and Department hasn't been updated since 2009-08-01 so it's left as NULL

What function would work best here? This is on Netezza, which uses a flavor of Postgres.

Obviously GROUP BY would not work here because of same groupings later on (I added in Bob | CANADA | Marketing at the last row to show this.

EDIT

Including a hash column on Person, Country, and Department, would make sense, correct? Thinking of using logic of

SELECT PERSON, COUNTRY, DEPARTMENT
FROM table t1
where 
    person = person 
    AND t1.hash <> hash_function(person, country, department)

Answer 1

Answer

create table so (
  person varchar(32)
  ,country varchar(32)
  ,department varchar(32)
  ,login_time date
) distribute on random;

insert into so values ('Bob','CANADA','Marketing','2009-01-01');
insert into so values ('Bob','CANADA','Marketing','2009-02-01');
insert into so values ('Bob','USA','Marketing','2009-03-01');
insert into so values ('Bob','USA','Sales','2009-04-01');
insert into so values ('Bob','MEX','Product','2009-05-01');
insert into so values ('Bob','MEX','Product','2009-06-01');
insert into so values ('Bob','MEX','Product','2009-07-01');
insert into so values ('Bob','CANADA','Marketing','2009-08-01');

/* ************************************************************************** */

with prm as ( --Create an ordinal primary key.
  select
    *
    ,row_number() over (
      partition by person
      order by login_time
    ) rwn
  from
    so
), chn as ( --Chain events to their previous and next event.
  select
    cur.rwn
    ,cur.person
    ,cur.country
    ,cur.department
    ,cur.login_time cur_login
    ,case
      when
        cur.country = prv.country
        and cur.department = prv.department
        then 1
      else 0
    end prv_equal
    ,case
      when
        (
          cur.country = nxt.country
          and cur.department = nxt.department
        ) or nxt.rwn is null --No next record should be equivalent to matching.
        then 1
      else 0
    end nxt_equal
    ,case prv_equal
      when 0 then cur_login
      else null
    end eff_login_start_sparse
    ,case
      when eff_login_start_sparse is null
        then max(eff_login_start_sparse) over (
          partition by cur.person
          order by rwn
          rows unbounded preceding --The secret sauce.
        )
      else eff_login_start_sparse
    end eff_login_start
    ,case nxt_equal
      when 0 then cur_login
      else null
    end eff_login_end
  from
    prm cur
    left outer join prm nxt on
      cur.person = nxt.person
      and cur.rwn + 1 = nxt.rwn
    left outer join prm prv on
      cur.person = prv.person
      and cur.rwn - 1 = prv.rwn
), grp as ( --Group by login starts.
  select
    person
    ,country
    ,department
    ,eff_login_start
    ,max(eff_login_end) eff_login_end
  from
    chn
  group by
    person
    ,country
    ,department
    ,eff_login_start
), led as ( --Change the effective end to be the next start, if desired.
  select
    person
    ,country
    ,department
    ,eff_login_start
    ,case
      when eff_login_end is null
        then null
      else
        lead(eff_login_start) over (
          partition by person
          order by eff_login_start
        )
    end eff_login_end
  from
    grp
)
select * from led order by eff_login_start;

This code returns the following table.

 PERSON | COUNTRY | DEPARTMENT | EFF_LOGIN_START | EFF_LOGIN_END
--------+---------+------------+-----------------+---------------
 Bob    | CANADA  | Marketing  | 2009-01-01      | 2009-03-01
 Bob    | USA     | Marketing  | 2009-03-01      | 2009-04-01
 Bob    | USA     | Sales      | 2009-04-01      | 2009-05-01
 Bob    | MEX     | Product    | 2009-05-01      | 2009-08-01
 Bob    | CANADA  | Marketing  | 2009-08-01      |

Explanation

I must have solved this four or five times in the past few years and keep neglecting to write it down formally. I'm glad to have the chance to do it, so this is a great question.

When attempting this, I like writing down the problem in matrix form. Here's the input, presuming that all values have the same key in the SCD.

 Cv | Ce
----|----
 A  | 10
 A  | 11
 B  | 14
 C  | 16
 D  | 18
 D  | 25
 D  | 34
 A  | 40

Where Cv is the value that we'll need to compare against (again, presuming that the key value for the SCD is equal in this data; we'll be partitioning over the key value the entire time so it's irrelevant to the solution) and Ce is the event time.

First, we need an ordinal primary key. I've designated this Ck in the table. This will allow us to join the table to itself to get the previous and next events. I've called these columns Pk (previous key), Nk (next key), Pv, and Nv.

 Cv | Ce | Ck | Pk | Pv | Nk | Nv |
----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  |
 A  | 11 | 2  | 1  | A  | 3  | B  |
 B  | 14 | 3  | 2  | A  | 4  | C  |
 C  | 16 | 4  | 3  | B  | 5  | D  |
 D  | 18 | 5  | 4  | C  | 6  | D  |
 D  | 25 | 6  | 5  | D  | 7  | D  |
 D  | 34 | 7  | 6  | D  | 8  | A  |
 A  | 40 | 8  | 7  | D  |    |    |

Now we need some columns to see if we're at the beginning or end of a contiguous event block. I'll call these Pc and Nc, for contiguous. Pc is defined as Pv = Cv => true. 1 represents true and 0 represents false. Nc is defined similarly, except that the null case defaults to true (we'll see why in a minute)

 Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc |
----|----|----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  | 0  | 1  |
 A  | 11 | 2  | 1  | A  | 3  | B  | 1  | 0  |
 B  | 14 | 3  | 2  | A  | 4  | C  | 0  | 0  |
 C  | 16 | 4  | 3  | B  | 5  | D  | 0  | 0  |
 D  | 18 | 5  | 4  | C  | 6  | D  | 0  | 1  |
 D  | 25 | 6  | 5  | D  | 7  | D  | 1  | 1  |
 D  | 34 | 7  | 6  | D  | 8  | A  | 1  | 0  |
 A  | 40 | 8  | 7  | D  |    |    | 0  | 1  |

Now you can start to see how the 1,1 combination of Pc,Nc is a completely useless record. We know this intuitively, since Bob's Mex/Product combination on the 6th row is pretty much useless information when building an SCD.

So let's get rid of the useless information. I'll add two new columns here: an almost-complete effective start time called Sn and an actually-complete effective end time called Ee. Sn is is populated with Ce when Pc is 0 and Ee is populated with Ce when Nc is 0.

 Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee |
----|----|----|----|----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  | 0  | 1  | 10 |    |
 A  | 11 | 2  | 1  | A  | 3  | B  | 1  | 0  |    | 11 |
 B  | 14 | 3  | 2  | A  | 4  | C  | 0  | 0  | 14 | 14 |
 C  | 16 | 4  | 3  | B  | 5  | D  | 0  | 0  | 16 | 16 |
 D  | 18 | 5  | 4  | C  | 6  | D  | 0  | 1  | 18 |    |
 D  | 25 | 6  | 5  | D  | 7  | D  | 1  | 1  |    |    |
 D  | 34 | 7  | 6  | D  | 8  | A  | 1  | 0  |    | 34 |
 A  | 40 | 8  | 7  | D  |    |    | 0  | 1  | 40 |    |

This looks really close , but we still have the problem that we can't group by Cv (person/country/department). What we need is for Sn to populate all those nulls with the previous value of Sn. You could join this table to itself on rwn < rwn and get the maximum, but I'm going to be lazy and use Netezza's analytic functions and the rows unbounded preceding clause. It's a shortcut to the method I just described. So we're going to create another column called Es, efffective start, defined as follows.

case
  when Sn is null
    then max(Sn) over (
      partition by k --key value of the SCD
      order by Ck
      rows unbounded preceding
    )
  else Sn
end Es

With that definition, we get this.

 Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee | Es |
----|----|----|----|----|----|----|----|----|----|----|----|
 A  | 10 | 1  |    |    | 2  | A  | 0  | 1  | 10 |    | 10 |
 A  | 11 | 2  | 1  | A  | 3  | B  | 1  | 0  |    | 11 | 10 |
 B  | 14 | 3  | 2  | A  | 4  | C  | 0  | 0  | 14 | 14 | 14 |
 C  | 16 | 4  | 3  | B  | 5  | D  | 0  | 0  | 16 | 16 | 16 |
 D  | 18 | 5  | 4  | C  | 6  | D  | 0  | 1  | 18 |    | 18 |
 D  | 25 | 6  | 5  | D  | 7  | D  | 1  | 1  |    |    | 18 |
 D  | 34 | 7  | 6  | D  | 8  | A  | 1  | 0  |    | 34 | 18 |
 A  | 40 | 8  | 7  | D  |    |    | 0  | 1  | 40 |    | 40 |

The rest is trivial. Group by Es and grab the max of Ee to obtain this table.

 Cv | Es | Ee |
----|----|----|
 A  | 10 | 11 |
 B  | 14 | 14 |
 C  | 16 | 16 |
 D  | 18 | 34 |
 A  | 40 |    |

If you want to populate the effective end time with the next start, join the table again to itself or use the lead() window function to grab it.

How do I select columns whenever they change?

Question

1 answers

solution1
1 2016-03-24 15:53:06

Answer

Explanation

How do I select columns whenever they change?

Question

1 answers

solution1 1 2016-03-24 15:53:06

Answer

Explanation

solution1
1 2016-03-24 15:53:06