简体   繁体   English

RANK()在RANK重置的分区上

[英]RANK() OVER PARTITION with RANK resetting

How can I get a RANK that restarts at partition change? 如何获得在分区更改时重新启动的RANK? I have this table: 我有这张桌子:

ID    Date        Value  
1     2015-01-01  1  
2     2015-01-02  1 <redundant  
3     2015-01-03  2  
4     2015-01-05  2 <redundant  
5     2015-01-06  1  
6     2015-01-08  1 <redundant  
7     2015-01-09  1 <redundant  
8     2015-01-10  2  
9     2015-01-11  3  
10    2015-01-12  3 <redundant  

and I'm trying to delete all the rows where the Value is not changed from the previous entry (marked with < redundant ). 我正在尝试删除所有未从上一个条目更改值的行(标记为<redundant )。 I've tried using cursors but it takes too long, as the table has ~50 million rows. 我尝试过使用游标,但这需要太长时间,因为该表有大约5000万行。

I've also tried using RANK: 我也尝试过使用RANK:

SELECT ID, Date, Value,
RANK() over(partition by Value order by Date ASC) Rank,
FROM DataLogging 
ORDER BY Date ASC 

but I get: 但我得到:

ID    Date        Value  Rank   (Rank)
1     2015-01-01  1      1      (1)
2     2015-01-02  1      2      (2)
3     2015-01-03  2      1      (1)
4     2015-01-05  2      2      (2)
5     2015-01-06  1      3      (1)
6     2015-01-08  1      4      (2)
7     2015-01-09  1      5      (3)
8     2015-01-10  2      3      (1)
9     2015-01-11  3      1      (1)
10    2015-01-12  3      2      (2)

in parantheses is the Rank I would want, so that I can filter out rows with Rank = 1 and delete the rest of the rows. 在parantheses中是我想要的Rank,这样我就可以过滤掉Rank = 1的行并删除其余的行。

EDIT: I've accepted the answer that seemed the easiest to write, but unfortunately none of the answers runs fast enough for deleting the rows. 编辑:我已经接受了似乎最容易编写的答案,但不幸的是,没有一个答案运行得足够快以删除行。 In the end I've decided to use the CURSOR afterall. 最后我决定使用CURSOR毕竟。 I've split the data in chuncks of about 250k rows and the cursor runs through and deletes the rows in ~11 mins per batch of 250k rows, and the answers below, with DELETE, take ~35 mins per batch of 250k rows. 我已经将数据拆分成大约250k行的块,并且光标贯穿并删除每批250k行约11分钟的行,下面的答案(DELETE)每批250k行需要约35分钟。

Here is a somewhat convoluted way to do it: 这是一个有点复杂的方法:

WITH CTE AS
(
    SELECT  *, 
            ROW_NUMBER() OVER(ORDER BY [Date]) RN1,
            ROW_NUMBER() OVER(PARTITION BY Value ORDER BY [Date]) RN2
    FROM dbo.YourTable
), CTE2 AS
(
    SELECT *, ROW_NUMBER() OVER(PARTITION BY Value, RN1 - RN2 ORDER BY [Date]) N
    FROM CTE
)
SELECT *
FROM CTE2
ORDER BY ID;

The results are: 结果是:

╔════╦════════════╦═══════╦═════╦═════╦═══╗
║ ID ║    Date    ║ Value ║ RN1 ║ RN2 ║ N ║
╠════╬════════════╬═══════╬═════╬═════╬═══╣
║  1 ║ 2015-01-01 ║     1 ║   1 ║   1 ║ 1 ║
║  2 ║ 2015-01-02 ║     1 ║   2 ║   2 ║ 2 ║
║  3 ║ 2015-01-03 ║     2 ║   3 ║   1 ║ 1 ║
║  4 ║ 2015-01-05 ║     2 ║   4 ║   2 ║ 2 ║
║  5 ║ 2015-01-06 ║     1 ║   5 ║   3 ║ 1 ║
║  6 ║ 2015-01-08 ║     1 ║   6 ║   4 ║ 2 ║
║  7 ║ 2015-01-09 ║     1 ║   7 ║   5 ║ 3 ║
║  8 ║ 2015-01-10 ║     2 ║   8 ║   3 ║ 1 ║
║  9 ║ 2015-01-11 ║     3 ║   9 ║   1 ║ 1 ║
║ 10 ║ 2015-01-12 ║     3 ║  10 ║   2 ║ 2 ║
╚════╩════════════╩═══════╩═════╩═════╩═══╝

To delete the rows you don't want, you just need to do: 要删除您不想要的行,您只需要执行以下操作:

DELETE FROM CTE2
WHERE N > 1;

If you want to delete the rows, I would suggest you use lag() : 如果你想删除行,我建议你使用lag()

with todelete as (
      select t.*, lag(value) over (order by date) as prev_value
      from t
     )
delete from todelete
    where value = prev_value;

I'm not quite sure what rank() has to do with the problem. 我不太确定rank()与问题有什么关系。

EDIT: 编辑:

To see the rows not deleted with the same logic: 要查看使用相同逻辑删除的行:

with todelete as (
      select t.*, lag(value) over (order by date) as prev_value
      from t
     )
select *
from todelete
where value <> prev_value or prev_value is null;

The where clause is just the inverse of the where clause in the first query, taking NULL values into account. where子句只是第一个查询中where子句的反转,将NULL值考虑在内。

select * 
from  ( select ID, Date, Value, lag(Value, 1, 0) over (order by ID) as ValueLag 
        from table ) tt
where ValueLag is null or ValueLag <> Value  

if the order is Date then over (order by Date) 如果订单是日期然后结束(按日期排序)

this should show you good and bad - it is based on ID - it you need date then revise 这应该告诉你好坏 - 它是基于ID - 你需要约会然后修改
it may look like a long way around but it should be pretty efficient 它可能看起来很长,但应该非常有效

declare @tt table  (id tinyint, val tinyint);
insert into @tt values 
( 1, 1),
( 2, 1),
( 3, 2),
( 4, 2),
( 5, 1),
( 6, 1),
( 7, 1),
( 8, 2),
( 9, 3),
(10, 3);

select id, val, LAG(val) over (order by id) as lagVal
from @tt;

-- find the good
select id, val 
from ( select id, val, LAG(val) over (order by id) as lagVal
       from @tt 
     ) tt
where  lagVal is null or lagVal <> val 

-- select the bad 
select tt.id, tt.val 
  from @tt tt
  left join ( select id, val 
                from ( select id, val, LAG(val) over (order by id) as lagVal
                         from @tt 
                     ) ttt
               where   ttt.lagVal is null or ttt.lagVal <> ttt.val 
            ) tttt 
    on tttt.id = tt.id 
 where tttt.id is null

This is interesting so I'd thought I'd jump in. Unfortunately, creating a solution with RANK() (or rather, ROW_NUMBER() ) without first transforming the data looks to be unobtainable. 这很有趣,所以我想我会跳进去。不幸的是,在没有首先转换数据的情况下使用RANK() (或者更确切地说, ROW_NUMBER() )创建解决方案看起来是无法获得的。 In an attempt to transform the data, I came up with this solution that uses 1 ROW_NUMBER() : 为了转换数据,我提出了使用1 ROW_NUMBER()解决方案:

;WITH Ordered AS
(
    SELECT ROW_NUMBER() OVER (ORDER BY [Date]) AS [Row], *
    FROM DataLogging
),
Final AS
(
    SELECT
        o1.*, NULLIF(o1.Value - ISNULL(o2.Value, o1.Value - 1), 0) [Change]
    FROM
        Ordered o1
        LEFT JOIN Ordered o2 ON
            o1.[Row] = o2.[Row] + 1
)
SELECT * FROM Final

In the last Change column, the value will be NULL if there is no change in value (but will have the difference if there is a change). 在最后一个“ Change列中,如果值没有变化,则该值将为NULL (但如果存在更改,则将具有差异)。

So to do the delete, change the select to 所以要删除,将选择更改为

DELETE FROM DataLogging where Change IS NULL

Edit: Lag would work here too but I was visualizing the solution as I went along and completely forgot about that. 编辑: Lag也可以在这里工作,但是当我走过去时,我正在想象解决方案并完全忘记了这一点。

Worked for my case! 为我的案子工作! thanks I had to fetch the report_to change for an employee with respect to the previous reports_to valueand effdt. 谢谢我必须获取一个关于之前的reports_to value和effdt的员工的report_to更改。 In other words, fetcth min effective date row for each reports_to change for an employee. 换句话说,每个reports_to的fetcth min生效日期行为员工更改。

with tocheck as ( select T.emplid,T.reports_to,T.effdt, lag(reports_to) over (order by effdt) as prev_value from PS_JOB t ) select * from tocheck where reports_to <> prev_value or prev_value is null; 使用tocheck as(选择T.emplid,T.reports_to,T.effdt,lag(reports_to)over(order by effdt)作为来自PS_JOB的prev_value t)select * from tocheck其中reports_to <> prev_value或prev_value为null;

added changes further as p 进一步增加了变化p

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM