简体   繁体   中英

Partition by consecutive dates

I have a table with two columns. X being the unique identifier. I want to get the row number when I partition by column Y only if Z is in consecutive order. For example, I have this table

   X    Y   Z 
   A    1   1-jan
   A    1   2-jan
   A    1   3-jan
   B    3   1-jan
   B    3   2-jan
   A    1   5-jan

The result should look like this:

   X    Y   Z      rn
   A    1   1-jan  1
   A    1   2-jan  2
   A    1   3-jan  3
   B    3   1-jan  1
   B    3   2-jan  2
   A    1   5-jan  1

The code I am using right now:

  select X, Y, Z, ROW_NUMBER() over (partition by Y order by Z) as rn

I am getting this as my result (This is not the result I want):

   X    Y   Z      rn
   A    1   1-jan  1
   A    1   2-jan  2
   A    1   3-jan  3
   B    3   5-jan  1
   B    3   6-jan  2
   A    1   5-jan  4  <---- Column Z is not 4-Jan therefore it should be the not be row 4. It should be a new row 1

  

You first need to create data that can be used to partition your table.

The below uses LAG() to determine if a row is a "new partition", then SUM() OVER () to propagate that flag forward and make a "partition id", then finally uses ROW_NUMBER() with that identifier.

WITH
  gap_marker AS
(
  SELECT
    yourTable.*,
    IIF(
      LAG(z) OVER (PARTITION BY y ORDER BY z)
      =
      DATEADD(day, -1, z), 
      0,
      1
    )
      AS new_date_range
  FROM
    yourTable
), 
  date_range_partition AS
(
  SELECT
    gap_marker.*,
    SUM(new_date_range) OVER (PARTITION BY y ORDER BY z)   AS date_range_id
  FROM
    gap_marker
)
SELECT
  x, y, z,
  ROW_NUMBER() OVER (PARTITION BY y, date_range_id ORDER BY z)   AS rn
FROM
  date_range_partition

Alternatively, you could calculate an amount to deduct from the current rn , to reset to 1 when a date is skipped.

WITH
  enumerated AS
(
  SELECT
    yourTable.*,
    ROW_NUMBER() OVER (PARTITION BY y ORDER BY z)   AS rn,
    DATEDIFF(
      day,
      LAG(z) OVER (PARTITION BY y ORDER BY z),
      z
    )
      AS delta
  FROM
    yourTable
)
SELECT
  x, y, z,
  rn - MAX(IIF(delta = 1, 0, rn - 1)) OVER (PARTITION BY y ORDER BY z) AS rn
FROM
  enumerated

Finally, you could use DATEDIFF() if your rows are always whole days apart. Window functions can be used to work out what you should compare the current row against, and avoid ROW_NUMBER() altogether.

WITH
  check_previous AS
(
  SELECT
    yourTable.*,
    IIF(
      LAG(z) OVER (PARTITION BY y ORDER BY z)
      =
      DATEADD(day, -1, z), 
      NULL,
      z
    )
      AS new_base_date
  FROM
    yourTable
)
SELECT
  x, y, z,
  DATEDIFF(
    day,
    MAX(new_base_date) OVER (PARTITION BY y ORDER BY z),
    z
  ) + 1
    AS rn
FROM
  check_previous

Demo of all three; https://dbfiddle.uk/K8x8gOqh

Supposing that column Z is a date column, you could try the following:

SELECT X, Y, Z,
  ROW_NUMBER() OVER (PARTITION BY X, GRP ORDER BY Z) AS RN
FROM
(
  SELECT *,
    DATEDIFF(DAY, ROW_NUMBER() OVER (PARTITION BY X ORDER BY Z), Z) AS GRP
  FROM table_name
) T
ORDER BY X, Z

If the Z column datatype is not date, then you may generate the groups of consecutive values as the following:

SELECT X, Y, Z,
  ROW_NUMBER() OVER (PARTITION BY X, GRP ORDER BY Z) AS RN
FROM
(
  SELECT *,
    CAST(SUBSTRING(Z, 0, CHARINDEX('-', Z)) AS INT) - 
     ROW_NUMBER() OVER (PARTITION BY X ORDER BY SUBSTRING(Z, CHARINDEX('-', Z)+1, LEN(Z)), CAST(SUBSTRING(Z, 0, CHARINDEX('-', Z)) AS INT)) AS GRP
  FROM table_name2
) T
ORDER BY X, MONTH(SUBSTRING(Z, CHARINDEX('-', Z)+1, LEN(Z))+' 1 1'), CAST(SUBSTRING(Z, 0, CHARINDEX('-', Z)) AS INT)

See a demo .

I solved this problem using postgresql. Extract the logic and convert into your sql dialect.

DDL statement:

create table demo
(
x varchar(10) not null,
y int not null,
z date)

insert into demo(x,y,z) values
('A',1,'2022-01-01'),
('A',1,'2022-01-02'),
('A',1,'2022-01-03'),
('B',3,'2022-01-01'),
('B',3,'2022-01-02'),
('A',1,'2022-01-05');

query:

with base_data as (
select x,y,z,
row_number() over(partition by x,y) as sno
from demo
)
,staging_data as  (  
select x,y,z, z - coalesce(lag(z) over(partition by x,y),z-1::INT) as diff
from base_data)
select 
x,y,z,row_number() over(partition by x,diff)
from staging_data

z-1::INT - instead use date_add(z,-1)- Hope this change will work in sqlserver

output:

x|y|z         |row_number|
-+-+----------+----------+
A|1|2022-01-01|         1|
A|1|2022-01-02|         2|
A|1|2022-01-03|         3|
A|1|2022-01-05|         1|
B|3|2022-01-01|         1|
B|3|2022-01-02|         2|

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM