Performant way to self-join and filter by revised rows

Question

I'm trying to select all rows in this table, with the constraint that revised id's are selected instead of the original ones. So, if a row has a revision, that revision is selected instead of that row, if there are multiple revision numbers the highest revision number is preferred.

I think an example table, output, and query will explain this better:

Table:

+----+-------+-------------+-----------------+-------------+
| id | value | original_id | revision_number | is_revision |
+----+-------+-------------+-----------------+-------------+
|  1 | abcd  | null        | null            |           0 |
|  2 | zxcv  | null        | null            |           0 |
|  3 | qwert | null        | null            |           0 |
|  4 | abd   | 1           | 1               |           1 |
|  5 | abcde | 1           | 2               |           1 |
|  6 | zxcvb | 2           | 1               |           1 |
|  7 | poiu  | null        | null            |           0 |
+----+-------+-------------+-----------------+-------------+

Desired Output:

+----+-------+-------------+-----------------+
| id | value | original_id | revision_number |
+----+-------+-------------+-----------------+
|  3 | qwert | null        | null            |
|  5 | abcde | 1           | 2               |
|  6 | zxcvb | 2           | 1               |
|  7 | poiu  | null        | null            |
+----+-------+-------------+-----------------+

View Called revisions_max :

SELECT 
    responses.original_id AS original_id,
    MAX(responses.revision_number) AS revision
FROM
    responses
 WHERE
    original_id IS NOT NULL   
GROUP BY responses.original_id

My Current Query:

SELECT
    responses.*
FROM
    responses
WHERE
    id NOT IN (
        SELECT
            original_id
        FROM
            revisions_max
    )
AND
    is_revision = 0

UNION

SELECT
    responses.*
FROM
    responses
INNER JOIN revisions_max ON revisions_max.original_id = responses.original_id
    AND revisions_max.revision_number = responses.revision_number

This query works, but takes 0.06 seconds to run. With a table of only 2000 rows. This table will quickly start expanding to tens or hundreds of thousands of rows. The query under the union is what takes most of the time.

What can I do to improve this queries performance?

Answer 1

The approach I would take with any other DBMS is to use NOT EXISTS :

SELECT  r1.*
FROM    Responses AS r1
WHERE   NOT EXISTS
        (   SELECT  1
            FROM    Responses AS r2
            WHERE   r2.original_id = COALESCE(r1.original_id, r1.id)
            AND     r2.revision_number > COALESCE(r1.revision_number, 0)
        );

To remove any rows where a higher revision number exists for the same id (or original_id if it is populated). However, in MySQL, LEFT JOIN/IS NULL will perform better than NOT EXISTS ¹ . As such I would rewrite the above as:

SELECT  r1.*
FROM    Responses AS r1
        LEFT JOIN Responses AS r2
            ON r2.original_id = COALESCE(r1.original_id, r1.id)
            AND r2.revision_number > COALESCE(r1.revision_number, 0)
WHERE   r2.id IS NULL;

Example on DBFiddle

I realise that you have said that you don't want to use LEFT JOIN and check for nulls, but I don't see that there is a better solution.

^{1. At least this was the case historically, I don't actively use MySQL so don't keep up to date with developments in the optimiser}

Answer 2

How about using coalesce() ?

SELECT COALESCE(y.id, x.id)                           AS id,
       COALESCE(y.value, x.value)                     AS value,
       COALESCE(y.original_id, x.original_id)         AS original_id,
       COALESCE(y.revision_number, x.revision_number) AS revision_number
FROM   responses x
       LEFT JOIN (SELECT r1.*
                  FROM   responses r1
                         INNER JOIN (SELECT responses.original_id          AS
                                            original_id,
                                            Max(responses.revision_number) AS
                                            revision
                                     FROM   responses
                                     WHERE  original_id IS NOT NULL
                                     GROUP  BY responses.original_id) rev
                                 ON r1.original_id = rev.original_id
                                    AND r1.revision_number = rev.revision) y
              ON x.id = y.original_id
WHERE  y.id IS NOT NULL
        OR x.original_id IS NULL;

Performant way to self-join and filter by revised rows

Question

2 answers

solution1
1 2017-08-01 16:19:24

solution2
1 ACCPTED 2017-08-01 16:29:03

Performant way to self-join and filter by revised rows

Question

2 answers

solution1 1 2017-08-01 16:19:24

solution2 1 ACCPTED 2017-08-01 16:29:03

solution1
1 2017-08-01 16:19:24

solution2
1 ACCPTED 2017-08-01 16:29:03