简体   繁体   中英

SQL create identifiers based on conditions

I have an SQL database and I have to identify certain 'groups' of rows based on an identifier.

Basically i have one column with another identifier and a column with time difference between the rows. The table is ordered by these values as show in this example:

ID timedifference
A 21
A 30
A 60
A 50
B 32
B 120
B 20
C 124
C 10

I want to group the rows that belong together with the same identifier and use a clause so the identifier value changes when one of the following conditions are met:

  • timedifference > 44 OR ID value is different from the previous row This should result in the following table:
ID timedifference GroupID
A 21 1
A 30 1
A 60 2
A 50 3
B 32 4
B 120 5
B 20 5
C 124 6
C 10 6

You can use SQL Window Functions to access the preceding row. However, you need to provide a rule on how to order the query results. You say that

Basically i have one column with another identifier and a column with time difference between the rows. The table is ordered by these values as show in this example

But - as pointed out by the comments - this is not the case in your example listing.

Creating Order

For this answer, I assume that there is well defined way to order the rows in your example: I added a column item_order to the table.

id time_difference item_order
A 21 0
A 30 1
A 60 2
A 50 3
B 32 4
B 120 5
B 20 6
C 124 7
C 10 8

Accessing Preceding Row

SQL Window Functions let you access rows outside the current row of a query result: The LAG() window function gives you access the preceding row in your ordered result set (ie "Window"). The OVER( ORDER BY item_order ASC ) defines this window and its order.

For instance

SELECT
    time_difference,
    LAG(time_difference, 1) OVER (ORDER BY item_order) AS "previous_row_time_difference",
    item_order
FROM test_table
ORDER BY item_order

Will result in | time_difference | previous_row_time_difference | item_order | |:--- |:--- |:--- | | 21 | NULL | 0 | | 30 | 21 | 1 | | 60 | 30 | 2 | | 50 | 60 | 3 | | 32 | 50 | 4 | | 120 | 32 | 5 | | 20 | 120 | 6 | | 124 | 20 | 7 | | 10 | 124 | 8 |

Comparing Current with Preceding Row

You can use a SQL CASE statement to check if your condition is met:

SELECT id,
       time_difference,
       CASE
           WHEN LAG(id, 1) OVER ( ORDER BY item_order ASC) IS NULL -- this will be true in the very first row
                   OR LAG(id, 1) OVER ( ORDER BY item_order ASC) != id -- is the ID value different from the previous row
                   OR time_difference > 44 -- is time_difference bigger than 44
               THEN 1
               ELSE 0
       END AS "has_different_group_from_preceding_row"
FROM test_table
ORDER BY item_order

This returns 1 in the has_different_group_from_preceding_row column for any row which meets your condition:

id time_difference has_different_group_from_preceding_row
A 21 1
A 30 0
A 60 1
A 50 1
B 32 1
B 120 1
B 20 0
C 124 1
C 10 0

Create Group ID

At last, we need to add the incrementing category counter in the group_id column. One option is to sum all the values from has_different_group_from_preceding_row which occur in previous rows.

For this we add ROW_NUMBER() OVER (ORDER BY item_order ASC) AS "row_num" to the query above and turn it into a subquery using WITH .

-- create the subquery to detect all category changes
WITH detect_differene_between_rows AS(
    SELECT id,
           time_difference,
           CASE
               WHEN LAG(id, 1) OVER ( ORDER BY item_order ASC) IS NULL -- this will be true in the very first row
                       OR LAG(id, 1) OVER ( ORDER BY item_order ASC) != id -- is the ID value different from the previous row
                       OR time_difference > 44 -- is time_difference bigger than 44
                   THEN 1
                   ELSE 0
           END AS "has_different_group_from_preceding_row",
           ROW_NUMBER() OVER (ORDER BY item_order ASC) AS "row_num" -- create the row number
    FROM test_table
    ORDER BY item_order
)

SELECT id, time_difference,
       -- another subquery to sum up the values of `has_difference_from_previous_row` from preceding rows
       (
           SELECT
               sum(has_different_group_from_preceding_row)
           FROM detect_differene_between_rows sdbr2
           WHERE sdbr2.row_num <= sdbr1.row_num
       ) AS "group_id"
FROM detect_differene_between_rows AS sdbr1;
id time_difference group_id
A 21 1
A 30 1
A 60 2
A 50 3
B 32 4
B 120 5
B 20 5
C 124 6
C 10 6

Further Considerations

  • Updating group_id column: The SQL query above does not update the group_id column of the original table: It is only a SELECT . Running an UPDATE on a table from a subquery on that very same table is tricky. There is a way to update a table from a join, but this requires a primary key to match rows in the join. One workaround would be to create a temporary table and INSERT records from the SELECT above.

  • Performance: The summation from preceding rows will be time consuming. I assume that the group_id is populated once only. Windows Functions are part of the SQL standard and should be supported across most DBs. But I would not be surprised if other SQL dialects have a more performant way to solve this problem.

  • Column Naming: The column name ID (or " id ") is usually the primary key of a table. In this example, the ID column is used for something that is more like a category ('A', 'B', etc). I would change the same accordingly. Otherwise this may cause issues for developers as they will assume that ID is a unique identifier for each row/record.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM