How can I speed up this query that joins a table on itself?

Question

We have a `users' table that holds information about our users. One of the fields within this table is called 'query'. I am trying to SELECT the user id's of all users that have the same query. So my output should look like this:

user1_id    user2_id    common_query
   43          2            "foo"
   117         433          "bar"
   1           119          "baz"
   1           52           "qux"

Unfortunately, I can't get this query to finish in under an hour (the users table is pretty big). This is my current query:

SELECT u1.id,
       u2.id,
       u1.query
FROM users u1
INNER JOIN users u2
        ON u1.query = u2.query
       AND u1.id <> u2.id

My explain:

+----+-------------+-------+-------+----------------------+----------------------+---------+---------------------------------+----------+--------------------------+
| id | select_type | table | type  | possible_keys        | key                  | key_len | ref                             | rows     | Extra                    |
+----+-------------+-------+-------+----------------------+----------------------+---------+---------------------------------+----------+--------------------------+
|  1 | SIMPLE      | u1    | index | index_users_on_query | index_users_on_query | 768     | NULL                            | 10905267 | Using index              |
|  1 | SIMPLE      | u2    | ref   | index_users_on_query | index_users_on_query | 768     | u1.query                        |       11 | Using where; Using index |
+----+-------------+-------+-------+----------------------+----------------------+---------+---------------------------------+----------+--------------------------+

As you can see from the explain, the users table is indexed on query and the index appears to be being used in my SELECT. I'm wondering why the 'rows' column on table u2 has a value of 11, and not 1. Is there anything I can do to speed this query up? Is my '<>' comparison within the join bad practice? Also, the id field is the primary key

Answer 1

The main driver of the query is the equality on the query field--if it's indexed. The <> to the id is probably not very specific and it shows by the type of select being used for it is 'ref'

Below only applies if 'query' is not indexed....

If id is the primary key you could just do this:

CREATE INDEX index_1  ON users (query);

The result of adding such an index will be a covering index for the query and will result in the fastest execution for the query.

Answer 2

My biggest concern is the key_len , which indicates that MySQL must compare up to 768 bytes in order to lookup each index entry.

For this query, a hash index on query could be much more performant (as it would involve substantially shorter comparisons, at the cost of calculating hashes and being unable to sort records using that index):

ALTER TABLE users ADD INDEX (query) USING HASH

You might also consider making this a composite on (query, id) so that MySQL need not scan into the record itself to test the <> criterion.

Answer 3

How many queries do you have? You can add table UsersInQueries:

id   queryId   userId
0      5         453   
1      23        732 
2      15        761

then select from this table and group by queryId

Answer 4

If you only have up to two users per query, you could do this instead:

select query, min(id) as FirstID, max(id) as SecondId
from users
group by query
having count(*) > 1

If you have more than two users with the same query, can you explain why you would want all pairs of such users?

How can I speed up this query that joins a table on itself?

Question

4 answers

solution1
1 2012-11-19 19:10:50

solution2
1 ACCPTED 2012-11-19 19:22:05

solution3
0 2012-11-19 19:10:50

solution4
0 2012-11-19 19:21:20

How can I speed up this query that joins a table on itself?

Question

4 answers

solution1 1 2012-11-19 19:10:50

solution2 1 ACCPTED 2012-11-19 19:22:05

solution3 0 2012-11-19 19:10:50

solution4 0 2012-11-19 19:21:20

solution1
1 2012-11-19 19:10:50

solution2
1 ACCPTED 2012-11-19 19:22:05

solution3
0 2012-11-19 19:10:50

solution4
0 2012-11-19 19:21:20