简体   繁体   English

帮我把SUBQUERY变成JOIN

[英]Help me turn a SUBQUERY into a JOIN

Two tables. 两张桌子。

emails id (int10) | 电子邮件id(int10)| ownership (int10) 所有权(int10)

messages emailid (int10) indexed | 消息emailid(int10)已索引| message (mediumtext) 消息(中等文本)

Subquery (which is terrible in mysql). 子查询(在mysql中很糟糕)。

SELECT COUNT(*) FROM messages WHERE message LIKE '%word%' AND emailid IN (SELECT id FROM emails WHERE ownership = 32) SELECT COUNT(*)FROM messages WHERE message LIKE'%word%'AND emailid IN(SELECT id FROM emails WHERE ownership = 32)


The usage here is that I run a search on emails (which is obviously simplified in the sample above), that generates a list of say 3,000 email id's. 这里的用法是我在电子邮件上运行搜索(在上面的示例中显然简化了),它会生成一个包含3,000个电子邮件ID的列表。 I then want to do a search against messages because i need to do a text match - from only those 3000 emails against the message. 然后,我想对邮件进行搜索,因为我需要进行文本匹配 - 仅针对该邮件的3000封电子邮件。

The query against messages is expensive (message is not indexed) but this is fine because it would only ever be checking against a few rows. 对消息的查询是昂贵的(消息没有索引),但这很好,因为它只会检查几行。

Ideas: 思路:

i) A join. i)加入。 My attempts at this so far have not worked and have resulted in full table scans of the message table (ie the emailid index not used) ii) temporary table. 到目前为止,我对此的尝试都没有奏效,导致对消息表进行全表扫描(即未使用的emailid索引)ii)临时表。 This could work I think. 我认为这可行。 iii) cache ids in client and run 2 queries. iii)缓存客户端中的id并运行2个查询。 This does work. 这确实有效。 Not elegant. 不优雅。 iv) subquery. iv)子查询。 mySQL subqueries run the 2nd query each time so this does not work. mySQL子查询每次都运行第二个查询,所以这不起作用。 maybe fixed in mysql 6. 也许修复在mysql 6中。

Ok, here is what I have so far. 好的,这是我到目前为止所拥有的。 These are the actual field names (I had simplified a bit in question). 这些是实际的字段名称(我简化了一些问题)。

The query: 查询:

SELECT COUNT(*) FROM ticket LEFT JOIN ticket_subject 
ON (ticket_subject.ticketid = ticket.id) 
WHERE category IN (1) 
AND ticket_subject.subject LIKE "%about%"

The results: 结果:

1   SIMPLE  ticket  ref     PRIMARY,category    category    4   const   28874    
1   SIMPLE  ticket_subject  eq_ref  PRIMARY     PRIMARY     4   deskpro.ticket.id   1   Using where

It takes 0.41 seconds and returns a count(*) of 113. 它需要0.41秒并返回113的计数(*)。

Running: 运行:

SELECT COUNT (*) FROM ticket WHERE category IN (1)

Takes 0.01 seconds and finds 33,000 results. 需要0.01秒才能找到33,000个结果。

Running 运行

SELECT COUNT (*) FROM ticket_subject WHERE subject LIKE "%about%"

Takes 0.14 seconds and finds 1,300 results. 需要0.14秒并找到1,300个结果。

Both the ticket table and ticket_subject table have 300,000 rows. 票证表和ticket_subject表都有300,000行。

There is an index on ticket_subject.ticketid and ticket.category. ticket_subject.ticketid和ticket.category上有一个索引。

I realise now that using the LIKE syntax was a mistake - as it has been a bit of a red herring about FULLTEXT. 我现在意识到使用LIKE语法是一个错误 - 因为它有点像FULLTEXT的红色鲱鱼。 THis is not the issue. 这不是问题。 The issue is: 问题是:

1) Table A - very fast query, run on index. 1)表A - 非常快速的查询,在索引上运行。 0.001 seconds 2) Table B - moderate to slow query, no index - does full table scan. 0.001秒2)表B - 中等到慢的查询,没有索引 - 进行全表扫描。 0.1 seconds. 0.1秒

Both of these results are fine. 这两个结果都很好。 The problem is I have to JOIN them and the search takes 0.3 seconds; 问题是我必须加入它们,搜索需要0.3秒; which to me makes no sense because the slow aspects of the combined query on Table B should be quicker because we are now only searching over a fraction of that table - ie it should not be doing a full table scan because the field that is being JOINED on is indexed. 这对我来说没有意义,因为表B上的组合查询的缓慢方面应该更快,因为我们现在只搜索该表的一小部分 - 即它不应该进行全表扫描,因为正在加入的字段on已编入索引。

Remember to take advantage of Boolean short-circuit evaluation : 记得利用布尔短路评估

SELECT COUNT(*) 
FROM messages 
join emails ON emails.id = messages.emailid
WHERE ownership = 32 AND message LIKE '%word%'

This filters by ownership before it evaluates the LIKE predicate. 这会在评估LIKE谓词之前按ownership进行过滤。 Always put your cheaper expressions on the left. 总是把更便宜的表达放在左边。

Also, I agree with @Martin Smith and @MJB that you should consider using MySQL's FULLTEXT indexing to make this faster. 另外,我同意@Martin Smith和@MJB你应该考虑使用MySQL的FULLTEXT索引来加快速度。


Re your comment and additional information, here's some analysis: 重新评论和其他信息,这里有一些分析:

explain SELECT COUNT(*) FROM ticket WHERE category IN (1)\G

           id: 1
  select_type: SIMPLE
        table: ticket
         type: ref
possible_keys: category
          key: category
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index

The note "Using index" is a good thing to see because it means it can satisfy the query just by reading the index data structure, not even touching the data of the table. 注意“使用索引”是一件好事,因为它意味着它只需通过读取索引数据结构就可以满足查询,甚至不会触及表的数据。 This is certain to run very fast. 这肯定会跑得很快。

explain SELECT COUNT(*) FROM ticket_subject WHERE subject LIKE '%about%'\G

           id: 1
  select_type: SIMPLE
        table: ticket_subject
         type: ALL
possible_keys: NULL        <---- no possible keys
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
        Extra: Using where

This shows that there are no possible keys that can benefit the wildcard LIKE predicate. 这表明没有可能的密钥可以使通配符LIKE谓词受益。 It uses the condition in the WHERE clause, but it has to evaluate it by running a table-scan. 它使用WHERE子句中的条件,但必须通过运行表扫描来评估它。

explain SELECT COUNT(*) FROM ticket LEFT JOIN ticket_subject 
ON (ticket_subject.ticketid = ticket.id) 
WHERE category IN (1) 
AND ticket_subject.subject LIKE '%about%'\G

           id: 1
  select_type: SIMPLE
        table: ticket
         type: ref
possible_keys: PRIMARY,category
          key: category
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index

           id: 1
  select_type: SIMPLE
        table: ticket_subject
         type: ref
possible_keys: ticketid
          key: ticketid
      key_len: 4
          ref: test.ticket.id
         rows: 1
        Extra: Using where

Likewise, accessing the ticket table is quick, but that's spoiled by the table-scan incurred by the LIKE condition. 同样,访问票证表很快,但是由LIKE条件引起的表扫描破坏了这一点。

ALTER TABLE ticket_subject ENGINE=MyISAM;

CREATE FULLTEXT INDEX ticket_subject_fulltext ON ticket_subject(subject);

explain SELECT COUNT(*) FROM ticket JOIN ticket_subject  
ON (ticket_subject.ticketid = ticket.id)  
WHERE category IN (1)  AND MATCH(ticket_subject.subject) AGAINST('about')

           id: 1
  select_type: SIMPLE
        table: ticket
         type: ref
possible_keys: PRIMARY,category
          key: category
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index

           id: 1
  select_type: SIMPLE
        table: ticket_subject
         type: fulltext
possible_keys: ticketid,ticket_subject_fulltext
          key: ticket_subject_fulltext          <---- now it uses an index
      key_len: 0
          ref: 
         rows: 1
        Extra: Using where

You're never going to make LIKE perform well. 你永远不会让LIKE表现得很好。 See my presentation Practical Full-Text Search in MySQL . 请参阅我的演示文稿MySQL中的实用全文搜索


Re your comment: Okay, I've done some experiments on a dataset of similar size (the Users and Badges tables in the Stack Overflow data dump :-). 重新评论:好的,我已经对类似大小的数据集(Stack Overflow数据转储中的用户和徽章表)进行了一些实验:-)。 Here's what I found: 这是我发现的:

select count(*) from users
where reputation > 50000

+----------+
| count(*) |
+----------+
|       37 |
+----------+
1 row in set (0.00 sec)

That's really fast, because I have an index on the reputation column. 这真的很快,因为我在声誉列上有一个索引。

           id: 1
  select_type: SIMPLE
        table: users
         type: range
possible_keys: users_reputation_userid_displayname
          key: users_reputation_userid_displayname
      key_len: 4
          ref: NULL
         rows: 37
        Extra: Using where; Using index

select count(*) from badges
where badges.creationdate like '%06-24%'

+----------+
| count(*) |
+----------+
|     1319 |
+----------+
1 row in set, 1 warning (0.63 sec)

That's as expected, since the table has 700k rows, and it has to do a table-scan. 这是预期的,因为该表有700k行,并且它必须进行表扫描。 Now let's do the join: 现在让我们来加入:

select count(*) from users join badges using (userid)
where users.reputation > 50000 and badges.creationdate like '%06-24%'

+----------+
| count(*) |
+----------+
|       19 |
+----------+
1 row in set, 1 warning (0.03 sec)

That doesn't seem so bad. 这似乎并不那么糟糕。 Here's the explain report: 这是解释报告:

           id: 1
  select_type: SIMPLE
        table: users
         type: range
possible_keys: PRIMARY,users_reputation_userid_displayname
          key: users_reputation_userid_displayname
      key_len: 4
          ref: NULL
         rows: 37
        Extra: Using where; Using index

           id: 1
  select_type: SIMPLE
        table: badges
         type: ref
possible_keys: badges_userid
          key: badges_userid
      key_len: 8
          ref: testpattern.users.UserId
         rows: 1
        Extra: Using where

This does seem like it's using indexes intelligently for the join, and it helps that I have a compound index including userid and reputation. 这看起来似乎是智能地为连接使用索引,它有助于我有一个复合索引,包括用户ID和声誉。 Remember that MySQL can use only one index per table, so it's important to get define the right compound indexes for the query you need to do. 请记住,MySQL每个表只能使用一个索引,因此为您需要的查询定义正确的复合索引非常重要。


Re your comment: OK, I've tried this where reputation > 5000, and where reputation > 500, and where reputation > 50. These should match a much larger set of users. 重新评论:好的,我已经尝试了这个名声> 5000,信誉> 500,信誉> 50的地方。这些应该与更大的用户组相匹配。

select count(*) from users join badges using (userid)
where users.reputation > 5000 and badges.creationdate like '%06-24%'

+----------+
| count(*) |
+----------+
|      194 |
+----------+
1 row in set, 1 warning (0.27 sec)

select count(*) from users join badges using (userid)
where users.reputation > 500 and badges.creationdate like '%06-24%'

+----------+
| count(*) |
+----------+
|      624 |
+----------+
1 row in set, 1 warning (0.93 sec)

select count(*) from users join badges using (userid)
where users.reputation > 50 and badges.creationdate like '%06-24%'
--------------

+----------+
| count(*) |
+----------+
|     1067 |
+----------+
1 row in set, 1 warning (1.72 sec)

The explain report is the same in all cases, but if the query finds more matching rows in the Users table, then it naturally has to evaluate the LIKE predicate against a lot more matching rows in the Badges table. 解释报告在所有情况下都是相同的,但如果查询在Users表中找到更多匹配的行,那么它自然必须针对徽章表中的更多匹配行评估LIKE谓词。

It's true that there is some cost to doing a join. 确实,加入会有一些成本。 It's a little surprising that it's so dramatically expensive. 这有点令人惊讶,它的价格非常昂贵。 But this can be mitigated if you use indexes. 但是如果使用索引,这可以减轻。

I know you said you have a query that can't use an index, but perhaps it's time to consider creating a redundant column with some transformed version of the data of your original column, so you can index it. 我知道你说你有一个不能使用索引的查询,但也许是时候考虑创建一个冗余列,其中包含原始列数据的某些转换版本,因此您可以对其进行索引。 In the example above, I might create a column creationdate_day and populate it from DAYOFYEAR(creationdate) . 在上面的示例中,我可能会创建一个列creationdate_day并从DAYOFYEAR(creationdate)填充它。


Here's what I mean: 这就是我的意思:

ALTER TABLE Badges ADD COLUMN creationdate_day SMALLINT;
UPDATE Badges SET creationdate_day = DAYOFYEAR(creationdate);
CREATE INDEX badge_creationdate_day ON Badges(creationdate_day);

select count(*) from users join badges using (userid)
where users.reputation > 50 and badges.creationdate_day = dayofyear('2010-06-24')

+----------+
| count(*) |
+----------+
|     1067 |
+----------+
1 row in set, 1 warning (0.01 sec)  <---- not too shabby!

Here's the explain report: 这是解释报告:

          id: 1
  select_type: SIMPLE
        table: badges
         type: ref
possible_keys: badges_userid,badge_creationdate_day
          key: badge_creationdate_day    <---- here is our new index
      key_len: 3
          ref: const
         rows: 1318
        Extra: Using where

           id: 1
  select_type: SIMPLE
        table: users
         type: eq_ref
possible_keys: PRIMARY,users_reputation_userid_displayname
          key: PRIMARY
      key_len: 8
          ref: testpattern.badges.UserId
         rows: 1
        Extra: Using where
SELECT COUNT(*) 
FROM messages 
join emails ON emails.id = messages.emailid
WHERE message LIKE '%word%' 
AND ownership = 32

The problem though is with the '%word%' This will always require a scan of message. 但问题是'%word%'这将始终需要扫描消息。 You might want to look into full text search if you are using MyISAM . 如果您使用的是MyISAM则可能需要查看全文搜索

I think this is what you are looking for: 我认为这就是你要找的东西:

select count(*)
from messages m
  inner join emails e
    on e.id = m.emailid
where m.message like '%word%'
  and e.ownership = 32

Hard to tell for sure how it will perform. 很难确定它将如何表现。 If the FTS is because of the starting wildcard on WORD, then doing it this way won't solve the problem. 如果FTS是因为WORD上的起始通配符,那么这样做就不能解决问题。 But the good news is that perhaps the join will limit the records in the messages table you have to look at. 但好消息是,加入可能会限制您必须查看的消息表中的记录。

Is it possible for you to turn the join the other way around? 您是否有可能以相反的方式转换联接? It seems that the second query is a less expensive one and since the whole thing is a simple join then you want to perform the less expensive query to narrow the data-set as much and then do a join to your more expensive query. 似乎第二个查询是一个较便宜的查询,因为整个事情是一个简单的连接,那么你想要执行较便宜的查询来缩小数据集的范围,然后连接到更昂贵的查询。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM