Optimizing MySQL select distinct order by limit safely

Question

I have a problematic query that I know how to write faster, but technically the SQL is invalid and it has no guarantee of working correctly into the future.

The original, slow query looks like this:

SELECT sql_no_cache DISTINCT r.field_1 value
FROM table_middle m
JOIN table_right r on r.id = m.id
WHERE ((r.field_1) IS NOT NULL) 
AND (m.kind IN ('partial')) 
ORDER BY r.field_1 
LIMIT 26

This takes about 37 seconds . Explain output:

+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
| id | select_type | table | type   | possible_keys         | key           | key_len | rows    | Extra                                                     |
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
|  1 | SIMPLE      | r     | range  | PRIMARY,index_field_1 | index_field_1 | 9       | 1544595 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | m     | eq_ref | PRIMARY,index_kind    | PRIMARY       | 4       |       1 | Using where; Distinct                                     |
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+

The faster version looks like this; the order by clause is pushed down into a subquery, which is joined on and is in turn limited with distinct:

SELECT sql_no_cache DISTINCT value 
FROM (
  SELECT r.field_1 value
  FROM table_middle m
  JOIN table_right r ON r.id = m.id
  WHERE ((r.field_1) IS NOT NULL) 
  AND (m.kind IN ('partial')) 
  ORDER BY r.field_1 
) t
LIMIT 26

This takes about 2.7 seconds . Explain output:

+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
| id | select_type | table      | type   | possible_keys         | key        | key_len | rows    | Extra                                                     |
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                  | NULL       | NULL    | 1346348 | Using temporary                                           |
|  2 | DERIVED     | m          | ref    | PRIMARY,index_kind    | index_kind | 99      | 1539558 | Using where; Using index; Using temporary; Using filesort |
|  2 | DERIVED     | r          | eq_ref | PRIMARY,index_field_1 | PRIMARY    | 4       |       1 | Using where                                               |
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+

There are three million rows in table_right and table_middle, and all mentioned columns are individually indexed. The query should be read as having an arbitrary where clause - it's dynamically generated. The query can't be rewritten in any way that prevents the where clause being easily replaced, and similarly the indexes can't be changed - MySQL doesn't support enough indexes for the number of potential filter field combinations.

Has anyone seen this problem before - specifically, select / distinct / order by / limit performing very poorly - and is there another way to write the same query with good performance that doesn't rely on unspecified implementation behaviour?

(AFAIK MariaDB, for example, ignores order by in a subquery because it should not logically affect the set-theoretic semantics of the query.)

For the more incredulous

Here's how you can create a version of database for yourself! This should output a SQL script you can run with mysql command-line client:

#!/usr/bin/env ruby
puts "create database testy;"
puts "use testy;"
puts "create table table_right(id int(11) not null primary key, field_0 int(11), field_1 int(11), field_2 int(11));"
puts "create table table_middle(id int(11) not null primary key, field_0 int(11), field_1 int(11), field_2 int(11));"
puts "begin;"
3_000_000.times do |x|
  puts "insert into table_middle values (#{x},#{x*10},#{x*100},#{x*1000});"
  puts "insert into table_right values (#{x},#{x*10},#{x*100},#{x*1000});"
end
puts "commit;"

Indexes aren't important for reproducing the effect. The script above is untested; it's an approximation of a pry session I had when reproducing the problem manually.

Replace the m.kind in ('partial') with m.field_1 > 0 or something similar that's trivially true. Observe the large difference in performance between the two different techniques, and how the sorting semantics are preserved (tested using MySQL 5.5). The unreliability of the semantics are, of course, precisely the reason I'm asking the question.

Answer 1

Please provide SHOW CREATE TABLE . In the absence of that, I will guess that these are missing and may be useful:

m:  (kind, id)
r:  (field_1, id)

You can turn off MariaDB's ignoring of the subquery's ORDER BY .

Optimizing MySQL select distinct order by limit safely

Question

For the more incredulous

1 answers

solution1
0 2016-01-26 05:41:47

Optimizing MySQL select distinct order by limit safely

Question

For the more incredulous

1 answers

solution1 0 2016-01-26 05:41:47

solution1
0 2016-01-26 05:41:47