简体   繁体   中英

Aggregating neighboring rows with partitioning

I have a huge data set on MS SQL 2012 where a special aggregation must be done. Here is an example of dataset.

Key PartitionID StartTime                   Duration    Name
1   1           23/05/2019 18:18:28.125     1           X   
2   1           23/05/2019 18:18:28.480     2           Y   
3   1           23/05/2019 18:18:29.622     1           X   
4   1           23/05/2019 18:18:32.513     2           X   
5   2           23/05/2019 18:21:13.973     3           X   
6   2           23/05/2019 18:21:14.945     4           X   
7   2           23/05/2019 18:21:21.949     5           X   
8   2           23/05/2019 18:21:30.871     2           X   
9   2           23/05/2019 18:21:35.710     4           X   
10  2           23/05/2019 18:21:48.550     1           X   
11  2           23/05/2019 18:22:00.144     3           X   
12  2           23/05/2019 18:22:01.094     6           X   
13  2           23/05/2019 18:22:03.354     1           X   
14  3           23/05/2019 18:24:44.219     6           X   
15  3           23/05/2019 18:24:46.076     1           Y   
16  3           23/05/2019 18:24:52.399     4           X   
17  3           23/05/2019 18:25:03.620     6           X   
18  3           23/05/2019 18:25:11.208     1           X   
19  3           23/05/2019 18:25:12.616     4           X   
20  3           23/05/2019 18:25:28.019     6           X   
21  3           23/05/2019 18:25:31.384     2           Y   
21  3           23/05/2019 18:25:32.334     2           Y   
21  3           23/05/2019 18:25:33.344     2           X   

I have to create new column that is partitioning the data into sets based on Name, the CalculatedID must be different for the same Name when separated by a different Name. In other words if neighboring rows have the same Name then they also have the same CalculatedId.

The result should be similar to this:

Key PartitionID StartTime                   Duration    Name    CalculatedID
1   1           23/05/2019 18:18:28.125     1           X       1
2   1           23/05/2019 18:18:28.480     2           Y       2
3   1           23/05/2019 18:18:29.622     1           X       3
4   1           23/05/2019 18:18:32.513     2           X       3
5   2           23/05/2019 18:21:13.973     3           X       1
6   2           23/05/2019 18:21:14.945     4           X       1
7   2           23/05/2019 18:21:21.949     5           X       1
8   2           23/05/2019 18:21:30.871     2           X       1
9   2           23/05/2019 18:21:35.710     4           X       1
10  2           23/05/2019 18:21:48.550     1           X       1
11  2           23/05/2019 18:22:00.144     3           X       1
12  2           23/05/2019 18:22:01.094     6           X       1
13  2           23/05/2019 18:22:03.354     1           X       1
14  3           23/05/2019 18:24:44.219     6           X       1
15  3           23/05/2019 18:24:46.076     1           Y       2
16  3           23/05/2019 18:24:52.399     4           X       3
17  3           23/05/2019 18:25:03.620     6           X       3
18  3           23/05/2019 18:25:11.208     1           X       3
19  3           23/05/2019 18:25:12.616     4           X       3
20  3           23/05/2019 18:25:28.019     6           X       3
21  3           23/05/2019 18:25:31.384     2           Y       4
21  3           23/05/2019 18:25:32.334     2           Y       4
21  3           23/05/2019 18:25:33.344     2           X       5

I would really want to avoid looping through the data as the sets are easily over 10M.

This can be done using a common table expression with lag to get the previous value for Name for each raw based on the values of PartitionId and StartTime, and then use sum as a window function to get a comulative sum of the rows where the previous name is different then the current name.

First, create and populate sample table ( Please save us this step in your future questions):

DECLARE @T AS TABLE
(
    [Key] int,
    PartitionID int,
    StartTime datetime,
    Duration int,   
    Name char(1)
)

INSERT INTO @T ([Key] ,PartitionID, StartTime, Duration, Name) VALUES
(1 , 1, '2019-05-23T18:18:28.125', 1, 'X'),   
(2 , 1, '2019-05-23T18:18:28.480', 2, 'Y'),   
(3 , 1, '2019-05-23T18:18:29.622', 1, 'X'),   
(4 , 1, '2019-05-23T18:18:32.513', 2, 'X'),   
(5 , 2, '2019-05-23T18:21:13.973', 3, 'X'),   
(6 , 2, '2019-05-23T18:21:14.945', 4, 'X'),   
(7 , 2, '2019-05-23T18:21:21.949', 5, 'X'),   
(8 , 2, '2019-05-23T18:21:30.871', 2, 'X'),   
(9 , 2, '2019-05-23T18:21:35.710', 4, 'X'),   
(10, 2, '2019-05-23T18:21:48.550', 1, 'X'),   
(11, 2, '2019-05-23T18:22:00.144', 3, 'X'),   
(12, 2, '2019-05-23T18:22:01.094', 6, 'X'),   
(13, 2, '2019-05-23T18:22:03.354', 1, 'X'),   
(14, 3, '2019-05-23T18:24:44.219', 6, 'X'),   
(15, 3, '2019-05-23T18:24:46.076', 1, 'Y'),   
(16, 3, '2019-05-23T18:24:52.399', 4, 'X'),   
(17, 3, '2019-05-23T18:25:03.620', 6, 'X'),   
(18, 3, '2019-05-23T18:25:11.208', 1, 'X'),   
(19, 3, '2019-05-23T18:25:12.616', 4, 'X'),   
(20, 3, '2019-05-23T18:25:28.019', 6, 'X'),   
(21, 3, '2019-05-23T18:25:31.384', 2, 'Y'),   
(21, 3, '2019-05-23T18:25:32.334', 2, 'Y'),   
(21, 3, '2019-05-23T18:25:33.344', 2, 'X')

The common table expression:

;WITH CTE AS
(
    SELECT  [Key] ,PartitionID, StartTime, Duration, Name,
            LAG(Name) OVER(PARTITION BY PartitionID ORDER BY StartTime) As PrevName
    FROM @T
)

The query:

SELECT  [Key] ,PartitionID, StartTime, Duration, Name, 
        SUM(IIF(Name = PrevName, 0, 1)) OVER(PARTITION BY PartitionID ORDER BY StartTime) As CalculatedId
FROM CTE
ORDER BY [Key]

Results:

Key PartitionID StartTime               Duration    Name    CalculatedId
1   1           23.05.2019 18:18:28     1           X       1
2   1           23.05.2019 18:18:28     2           Y       2
3   1           23.05.2019 18:18:29     1           X       3
4   1           23.05.2019 18:18:32     2           X       3
5   2           23.05.2019 18:21:13     3           X       1
6   2           23.05.2019 18:21:14     4           X       1
7   2           23.05.2019 18:21:21     5           X       1
8   2           23.05.2019 18:21:30     2           X       1
9   2           23.05.2019 18:21:35     4           X       1
10  2           23.05.2019 18:21:48     1           X       1
11  2           23.05.2019 18:22:00     3           X       1
12  2           23.05.2019 18:22:01     6           X       1
13  2           23.05.2019 18:22:03     1           X       1
14  3           23.05.2019 18:24:44     6           X       1
15  3           23.05.2019 18:24:46     1           Y       2
16  3           23.05.2019 18:24:52     4           X       3
17  3           23.05.2019 18:25:03     6           X       3
18  3           23.05.2019 18:25:11     1           X       3
19  3           23.05.2019 18:25:12     4           X       3
20  3           23.05.2019 18:25:28     6           X       3
21  3           23.05.2019 18:25:31     2           Y       4
21  3           23.05.2019 18:25:32     2           Y       4
21  3           23.05.2019 18:25:33     2           X       5  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM