I have an SQL database and I have to identify certain 'groups' of rows based on an identifier.
Basically i have one column with another identifier and a column with time difference between the rows. The table is ordered by these values as show in this example:
ID | timedifference |
---|---|
A | 21 |
A | 30 |
A | 60 |
A | 50 |
B | 32 |
B | 120 |
B | 20 |
C | 124 |
C | 10 |
I want to group the rows that belong together with the same identifier and use a clause so the identifier value changes when one of the following conditions are met:
ID | timedifference | GroupID |
---|---|---|
A | 21 | 1 |
A | 30 | 1 |
A | 60 | 2 |
A | 50 | 3 |
B | 32 | 4 |
B | 120 | 5 |
B | 20 | 5 |
C | 124 | 6 |
C | 10 | 6 |
You can use SQL Window Functions to access the preceding row. However, you need to provide a rule on how to order the query results. You say that
Basically i have one column with another identifier and a column with time difference between the rows. The table is ordered by these values as show in this example
But - as pointed out by the comments - this is not the case in your example listing.
Creating Order
For this answer, I assume that there is well defined way to order the rows in your example: I added a column item_order
to the table.
id | time_difference | item_order |
---|---|---|
A | 21 | 0 |
A | 30 | 1 |
A | 60 | 2 |
A | 50 | 3 |
B | 32 | 4 |
B | 120 | 5 |
B | 20 | 6 |
C | 124 | 7 |
C | 10 | 8 |
Accessing Preceding Row
SQL Window Functions let you access rows outside the current row of a query result: The LAG()
window function gives you access the preceding row in your ordered result set (ie "Window"). The OVER( ORDER BY item_order ASC )
defines this window and its order.
For instance
SELECT
time_difference,
LAG(time_difference, 1) OVER (ORDER BY item_order) AS "previous_row_time_difference",
item_order
FROM test_table
ORDER BY item_order
Will result in | time_difference | previous_row_time_difference | item_order | |:--- |:--- |:--- | | 21 | NULL | 0 | | 30 | 21 | 1 | | 60 | 30 | 2 | | 50 | 60 | 3 | | 32 | 50 | 4 | | 120 | 32 | 5 | | 20 | 120 | 6 | | 124 | 20 | 7 | | 10 | 124 | 8 |
Comparing Current with Preceding Row
You can use a SQL CASE
statement to check if your condition is met:
SELECT id,
time_difference,
CASE
WHEN LAG(id, 1) OVER ( ORDER BY item_order ASC) IS NULL -- this will be true in the very first row
OR LAG(id, 1) OVER ( ORDER BY item_order ASC) != id -- is the ID value different from the previous row
OR time_difference > 44 -- is time_difference bigger than 44
THEN 1
ELSE 0
END AS "has_different_group_from_preceding_row"
FROM test_table
ORDER BY item_order
This returns 1
in the has_different_group_from_preceding_row
column for any row which meets your condition:
id | time_difference | has_different_group_from_preceding_row |
---|---|---|
A | 21 | 1 |
A | 30 | 0 |
A | 60 | 1 |
A | 50 | 1 |
B | 32 | 1 |
B | 120 | 1 |
B | 20 | 0 |
C | 124 | 1 |
C | 10 | 0 |
Create Group ID
At last, we need to add the incrementing category counter in the group_id
column. One option is to sum all the values from has_different_group_from_preceding_row
which occur in previous rows.
For this we add ROW_NUMBER() OVER (ORDER BY item_order ASC) AS "row_num"
to the query above and turn it into a subquery using WITH
.
-- create the subquery to detect all category changes
WITH detect_differene_between_rows AS(
SELECT id,
time_difference,
CASE
WHEN LAG(id, 1) OVER ( ORDER BY item_order ASC) IS NULL -- this will be true in the very first row
OR LAG(id, 1) OVER ( ORDER BY item_order ASC) != id -- is the ID value different from the previous row
OR time_difference > 44 -- is time_difference bigger than 44
THEN 1
ELSE 0
END AS "has_different_group_from_preceding_row",
ROW_NUMBER() OVER (ORDER BY item_order ASC) AS "row_num" -- create the row number
FROM test_table
ORDER BY item_order
)
SELECT id, time_difference,
-- another subquery to sum up the values of `has_difference_from_previous_row` from preceding rows
(
SELECT
sum(has_different_group_from_preceding_row)
FROM detect_differene_between_rows sdbr2
WHERE sdbr2.row_num <= sdbr1.row_num
) AS "group_id"
FROM detect_differene_between_rows AS sdbr1;
id | time_difference | group_id |
---|---|---|
A | 21 | 1 |
A | 30 | 1 |
A | 60 | 2 |
A | 50 | 3 |
B | 32 | 4 |
B | 120 | 5 |
B | 20 | 5 |
C | 124 | 6 |
C | 10 | 6 |
Further Considerations
Updating group_id
column: The SQL query above does not update the group_id
column of the original table: It is only a SELECT
. Running an UPDATE
on a table from a subquery on that very same table is tricky. There is a way to update a table from a join, but this requires a primary key to match rows in the join. One workaround would be to create a temporary table and INSERT
records from the SELECT
above.
Performance: The summation from preceding rows will be time consuming. I assume that the group_id
is populated once only. Windows Functions are part of the SQL standard and should be supported across most DBs. But I would not be surprised if other SQL dialects have a more performant way to solve this problem.
Column Naming: The column name ID
(or " id
") is usually the primary key of a table. In this example, the ID
column is used for something that is more like a category ('A', 'B', etc). I would change the same accordingly. Otherwise this may cause issues for developers as they will assume that ID
is a unique identifier for each row/record.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.