I'm trying to select all rows in this table, with the constraint that revised id's are selected instead of the original ones. So, if a row has a revision, that revision is selected instead of that row, if there are multiple revision numbers the highest revision number is preferred.
I think an example table, output, and query will explain this better:
Table:
+----+-------+-------------+-----------------+-------------+
| id | value | original_id | revision_number | is_revision |
+----+-------+-------------+-----------------+-------------+
| 1 | abcd | null | null | 0 |
| 2 | zxcv | null | null | 0 |
| 3 | qwert | null | null | 0 |
| 4 | abd | 1 | 1 | 1 |
| 5 | abcde | 1 | 2 | 1 |
| 6 | zxcvb | 2 | 1 | 1 |
| 7 | poiu | null | null | 0 |
+----+-------+-------------+-----------------+-------------+
Desired Output:
+----+-------+-------------+-----------------+
| id | value | original_id | revision_number |
+----+-------+-------------+-----------------+
| 3 | qwert | null | null |
| 5 | abcde | 1 | 2 |
| 6 | zxcvb | 2 | 1 |
| 7 | poiu | null | null |
+----+-------+-------------+-----------------+
View Called revisions_max
:
SELECT
responses.original_id AS original_id,
MAX(responses.revision_number) AS revision
FROM
responses
WHERE
original_id IS NOT NULL
GROUP BY responses.original_id
My Current Query:
SELECT
responses.*
FROM
responses
WHERE
id NOT IN (
SELECT
original_id
FROM
revisions_max
)
AND
is_revision = 0
UNION
SELECT
responses.*
FROM
responses
INNER JOIN revisions_max ON revisions_max.original_id = responses.original_id
AND revisions_max.revision_number = responses.revision_number
This query works, but takes 0.06
seconds to run. With a table of only 2000 rows. This table will quickly start expanding to tens or hundreds of thousands of rows. The query under the union
is what takes most of the time.
What can I do to improve this queries performance?
The approach I would take with any other DBMS is to use NOT EXISTS
:
SELECT r1.*
FROM Responses AS r1
WHERE NOT EXISTS
( SELECT 1
FROM Responses AS r2
WHERE r2.original_id = COALESCE(r1.original_id, r1.id)
AND r2.revision_number > COALESCE(r1.revision_number, 0)
);
To remove any rows where a higher revision number exists for the same id (or original_id if it is populated). However, in MySQL, LEFT JOIN/IS NULL
will perform better than NOT EXISTS
1 . As such I would rewrite the above as:
SELECT r1.*
FROM Responses AS r1
LEFT JOIN Responses AS r2
ON r2.original_id = COALESCE(r1.original_id, r1.id)
AND r2.revision_number > COALESCE(r1.revision_number, 0)
WHERE r2.id IS NULL;
I realise that you have said that you don't want to use LEFT JOIN
and check for nulls, but I don't see that there is a better solution.
1. At least this was the case historically, I don't actively use MySQL so don't keep up to date with developments in the optimiser
How about using coalesce()
?
SELECT COALESCE(y.id, x.id) AS id,
COALESCE(y.value, x.value) AS value,
COALESCE(y.original_id, x.original_id) AS original_id,
COALESCE(y.revision_number, x.revision_number) AS revision_number
FROM responses x
LEFT JOIN (SELECT r1.*
FROM responses r1
INNER JOIN (SELECT responses.original_id AS
original_id,
Max(responses.revision_number) AS
revision
FROM responses
WHERE original_id IS NOT NULL
GROUP BY responses.original_id) rev
ON r1.original_id = rev.original_id
AND r1.revision_number = rev.revision) y
ON x.id = y.original_id
WHERE y.id IS NOT NULL
OR x.original_id IS NULL;
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.