SQL create identifiers based on conditions

Question

I have an SQL database and I have to identify certain 'groups' of rows based on an identifier.

Basically i have one column with another identifier and a column with time difference between the rows. The table is ordered by these values as show in this example:

ID	timedifference
A	21
A	30
A	60
A	50
B	32
B	120
B	20
C	124
C	10

I want to group the rows that belong together with the same identifier and use a clause so the identifier value changes when one of the following conditions are met:

timedifference > 44 OR ID value is different from the previous row This should result in the following table:

ID	timedifference	GroupID
A	21	1
A	30	1
A	60	2
A	50	3
B	32	4
B	120	5
B	20	5
C	124	6
C	10	6

Answer 1

You can use SQL Window Functions to access the preceding row. However, you need to provide a rule on how to order the query results. You say that

Basically i have one column with another identifier and a column with time difference between the rows. The table is ordered by these values as show in this example

But - as pointed out by the comments - this is not the case in your example listing.

Creating Order

For this answer, I assume that there is well defined way to order the rows in your example: I added a column item_order to the table.

id	time_difference	item_order
A	21	0
A	30	1
A	60	2
A	50	3
B	32	4
B	120	5
B	20	6
C	124	7
C	10	8

Accessing Preceding Row

SQL Window Functions let you access rows outside the current row of a query result: The LAG() window function gives you access the preceding row in your ordered result set (ie "Window"). The OVER( ORDER BY item_order ASC ) defines this window and its order.

For instance

SELECT
    time_difference,
    LAG(time_difference, 1) OVER (ORDER BY item_order) AS "previous_row_time_difference",
    item_order
FROM test_table
ORDER BY item_order

Will result in | time_difference | previous_row_time_difference | item_order | |:--- |:--- |:--- | | 21 | NULL | 0 | | 30 | 21 | 1 | | 60 | 30 | 2 | | 50 | 60 | 3 | | 32 | 50 | 4 | | 120 | 32 | 5 | | 20 | 120 | 6 | | 124 | 20 | 7 | | 10 | 124 | 8 |

Comparing Current with Preceding Row

You can use a SQL CASE statement to check if your condition is met:

SELECT id,
       time_difference,
       CASE
           WHEN LAG(id, 1) OVER ( ORDER BY item_order ASC) IS NULL -- this will be true in the very first row
                   OR LAG(id, 1) OVER ( ORDER BY item_order ASC) != id -- is the ID value different from the previous row
                   OR time_difference > 44 -- is time_difference bigger than 44
               THEN 1
               ELSE 0
       END AS "has_different_group_from_preceding_row"
FROM test_table
ORDER BY item_order

This returns 1 in the has_different_group_from_preceding_row column for any row which meets your condition:

id	time_difference	has_different_group_from_preceding_row
A	21	1
A	30	0
A	60	1
A	50	1
B	32	1
B	120	1
B	20	0
C	124	1
C	10	0

Create Group ID

At last, we need to add the incrementing category counter in the group_id column. One option is to sum all the values from has_different_group_from_preceding_row which occur in previous rows.

For this we add ROW_NUMBER() OVER (ORDER BY item_order ASC) AS "row_num" to the query above and turn it into a subquery using WITH .

-- create the subquery to detect all category changes
WITH detect_differene_between_rows AS(
    SELECT id,
           time_difference,
           CASE
               WHEN LAG(id, 1) OVER ( ORDER BY item_order ASC) IS NULL -- this will be true in the very first row
                       OR LAG(id, 1) OVER ( ORDER BY item_order ASC) != id -- is the ID value different from the previous row
                       OR time_difference > 44 -- is time_difference bigger than 44
                   THEN 1
                   ELSE 0
           END AS "has_different_group_from_preceding_row",
           ROW_NUMBER() OVER (ORDER BY item_order ASC) AS "row_num" -- create the row number
    FROM test_table
    ORDER BY item_order
)

SELECT id, time_difference,
       -- another subquery to sum up the values of `has_difference_from_previous_row` from preceding rows
       (
           SELECT
               sum(has_different_group_from_preceding_row)
           FROM detect_differene_between_rows sdbr2
           WHERE sdbr2.row_num <= sdbr1.row_num
       ) AS "group_id"
FROM detect_differene_between_rows AS sdbr1;

id	time_difference	group_id
A	21	1
A	30	1
A	60	2
A	50	3
B	32	4
B	120	5
B	20	5
C	124	6
C	10	6

Further Considerations

Updating group_id column: The SQL query above does not update the group_id column of the original table: It is only a SELECT . Running an UPDATE on a table from a subquery on that very same table is tricky. There is a way to update a table from a join, but this requires a primary key to match rows in the join. One workaround would be to create a temporary table and INSERT records from the SELECT above.
Performance: The summation from preceding rows will be time consuming. I assume that the group_id is populated once only. Windows Functions are part of the SQL standard and should be supported across most DBs. But I would not be surprised if other SQL dialects have a more performant way to solve this problem.
Column Naming: The column name ID (or " id ") is usually the primary key of a table. In this example, the ID column is used for something that is more like a category ('A', 'B', etc). I would change the same accordingly. Otherwise this may cause issues for developers as they will assume that ID is a unique identifier for each row/record.

SQL create identifiers based on conditions

Question

1 answers

solution1
0 2022-01-20 13:41:40

SQL create identifiers based on conditions

Question

1 answers

solution1 0 2022-01-20 13:41:40

solution1
0 2022-01-20 13:41:40