简体   繁体   English

MySQL:避免由GROUP BY子句引起的临时/文件排序

[英]MySQL : Avoid Temporary/Filesort Caused by GROUP BY Clause

I've got a fairly simple query that seeks to display the number of email addresses that are subscribed along with the number unsubscribed, grouped by client. 我有一个相当简单的查询,试图显示订阅的电子邮件地址的数量以及取消订阅的数字,按客户分组。

The query: 查询:

SELECT
    client_id,
    COUNT(CASE WHEN subscribed = 1 THEN subscribed END) AS subs,
    COUNT(CASE WHEN subscribed = 0 THEN subscribed END) AS unsubs
FROM
    contacts_emailAddresses
LEFT JOIN contacts ON contacts.id = contacts_emailAddresses.contact_id
GROUP BY
    client_id

Schema of relevant tables follows. 下面是相关表的模式。 contacts_emailAddresses is a junction table between contacts (which has the client_id) and emailAddresses (which is not actually used in this query). contacts_emailAddresses是联系人(具有client_id)和emailAddresses(在此查询中实际未使用)之间的联结表。

CREATE TABLE `contacts` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `firstname` varchar(255) NOT NULL DEFAULT '',
  `middlename` varchar(255) NOT NULL DEFAULT '',
  `lastname` varchar(255) NOT NULL DEFAULT '',
  `gender` varchar(5) DEFAULT NULL,
  `client_id` mediumint(10) unsigned DEFAULT NULL,
  `datasource` varchar(10) DEFAULT NULL,
  `external_id` int(10) unsigned DEFAULT NULL,
  `created` timestamp NULL DEFAULT NULL,
  `trash` tinyint(1) NOT NULL DEFAULT '0',
  `updated` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `client_id` (`client_id`),
  KEY `external_id combo` (`client_id`,`datasource`,`external_id`),
  KEY `trash` (`trash`),
  KEY `lastname` (`lastname`),
  KEY `firstname` (`firstname`),
  CONSTRAINT `contacts_ibfk_1` FOREIGN KEY (`client_id`) REFERENCES `clients` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=14742974 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT

CREATE TABLE `contacts_emailAddresses` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `contact_id` int(10) unsigned NOT NULL,
  `emailAddress_id` int(11) unsigned DEFAULT NULL,
  `primary` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `subscribed` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `contact_id` (`contact_id`),
  KEY `subscribed` (`subscribed`),
  KEY `combo` (`contact_id`,`emailAddress_id`) USING BTREE,
  KEY `emailAddress_id` (`emailAddress_id`) USING BTREE,
  CONSTRAINT `contacts_emailAddresses_ibfk_1` FOREIGN KEY (`contact_id`) REFERENCES `contacts` (`id`),
  CONSTRAINT `contacts_emailAddresses_ibfk_2` FOREIGN KEY (`emailAddress_id`) REFERENCES `emailAddresses` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24700918 DEFAULT CHARSET=utf8

Here's the EXPLAIN: 这是EXPLAIN:

+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| id | select_type | table                   | type   | possible_keys | key     | key_len | ref                                       | rows     | Extra                           |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| 1  | SIMPLE      | contacts_emailAddresses | ALL    | NULL          | NULL    | NULL    | NULL                                      | 10176639 | Using temporary; Using filesort |
| 1  | SIMPLE      | contacts                | eq_ref | PRIMARY       | PRIMARY | 4       | icarus.contacts_emailAddresses.contact_id | 1        |                                 |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
2 rows in set (0.08 sec)

The problem here clearly is the GROUP BY clause, as I can remove the JOIN (and the items that depend on it) and the performance still is terrible (40+ seconds). 这里的问题显然是GROUP BY子句,因为我可以删除JOIN(以及依赖它的项目)并且性能仍然很糟糕(40+秒)。 There are 10m records in contacts_emailAddresses, 12m-some records in contacts, and 10–15 client records for the grouping. contacts_emailAddresses中有10m记录,联系人中有12m记录,分组中有10-15个客户记录。

From the doc : 来自doc

Temporary tables can be created under conditions such as these: 可以在以下条件下创建临时表:

If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created. 如果存在ORDER BY子句和不同的GROUP BY子句,或者ORDER BY或GROUP BY包含连接队列中第一个表以外的表中的列,则会创建临时表。

DISTINCT combined with ORDER BY may require a temporary table. DISTINCT与ORDER BY结合使用可能需要临时表。

If you use the SQL_SMALL_RESULT option, MySQL uses an in-memory temporary table, unless the query also contains elements (described later) that require on-disk storage. 如果使用SQL_SMALL_RESULT选项,MySQL将使用内存中的临时表,除非查询还包含需要磁盘存储的元素(稍后描述)。

I'm obviously not combining the GROUP BY with an ORDER BY, and I have tried multiple things to ensure that the GROUP BY is on a column that should be properly placed in the join queue (including rewriting the query to put contacts in the FROM and instead join to contacts_emailAddresses), all to no avail. 我显然没有将GROUP BY与ORDER BY结合起来,我尝试了多种方法来确保GROUP BY位于应该正确放置在连接队列中的列上(包括重写查询以将联系人放入FROM中)而是加入contacts_emailAddresses),一切都无济于事。

Any suggestions for performance tuning would be much appreciated! 任何性能调整的建议将非常感谢!

I think the only real shot you have of getting away from a "Using temporary; Using filesort" operation (given the current schema, the current query, and the specified resultset) would be to use correlated subqueries in the SELECT list. 我认为你唯一能够摆脱“使用临时;使用文件排序”操作(给定当前模式,当前查询和指定结果集)的实际镜头将是在SELECT列表中使用相关子查询。

SELECT c.client_id
     , (SELECT IFNULL(SUM(es.subscribed=1),0)
          FROM contacts_emailAddresses es
          JOIN contacts cs
            ON cs.id = es.contact_id
         WHERE cs.client_id = c.client_id
       ) AS subs
     , (SELECT IFNULL(SUM(eu.subscribed=0),0)
          FROM contacts_emailAddresses eu
          JOIN contacts cu
            ON cu.id = eu.contact_id
         WHERE cu.client_id = c.client_id
       ) AS unsubs
  FROM contacts c
 GROUP BY c.client_id

This may run quicker than the original query, or it may not. 这可能比原始查询运行得更快,或者可能不会。 Those correlated subqueries are going to get run for each returned by the outer query. 这些相关的子查询将为外部查询返回的每个子查询运行。 If that outer query is returning a boatload of rows, that's a whole boatload of subquery executions. 如果那个外部查询返回了一大堆行,那就是一大堆子查询执行。

Here's the output from an EXPLAIN : 这是EXPLAIN的输出:


id  select_type        table type  possible_keys                       key        key_len  ref   Extra
--  ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
 1  PRIMARY            c     index (NULL)                              client_id  5       (NULL) Using index
 3  DEPENDENT SUBQUERY cu    ref   PRIMARY,client_id,external_id combo client_id  5       func   Using where; Using index
 3  DEPENDENT SUBQUERY eu    ref   contact_id,combo                    contact_id 4       cu.id  Using where
 2  DEPENDENT SUBQUERY cs    ref   PRIMARY,client_id,external_id combo client_id  5       func   Using where; Using index
 2  DEPENDENT SUBQUERY es    ref   contact_id,combo                    contact_id 4       cs.id  Using where

For optimum performance of this query, we'd really like to see "Using index" in the Extra column of the explain for the eu and es tables. 为了获得此查询的最佳性能,我们非常希望在解释的Extra列中看到“使用索引”,用于eues表。 But to get that, we'd need a suitable index, one with a leading column of contact_id and including the subscribed column. 但要实现这一点,我们需要一个合适的索引,一个包含contact_id的前导列并包含subscribed列。 For example: 例如:

CREATE INDEX cemail_IX2 ON contacts_emailAddresses (contact_id, subscribed);

With the new index available, EXPLAIN output shows MySQL will use the new index: 在新索引可用的情况下, EXPLAIN输出显示MySQL将使用新索引:


id  select_type        table type  possible_keys                       key        key_len ref    Extra                     
--  ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
 1  PRIMARY            c     index (NULL)                              client_id  5       (NULL) Using index
 3  DEPENDENT SUBQUERY cu    ref   PRIMARY,client_id,external_id combo client_id  5       func   Using where; Using index
 3  DEPENDENT SUBQUERY eu    ref   contact_id,combo,cemail_IX2         cemail_IX2 4       cu.id  Using where; Using index
 2  DEPENDENT SUBQUERY cs    ref   PRIMARY,client_id,external_id combo client_id  5       func   Using where; Using index
 2  DEPENDENT SUBQUERY es    ref   contact_id,combo,cemail_IX2         cemail_IX2 4       cs.id  Using where; Using index

NOTES 笔记

This is the kind of problem where introducing a little redundancy can improve performance. 这是一种引入少量冗余可以提高性能的问题。 (Just like we do in a traditional data warehouse.) (就像我们在传统的数据仓库中一样。)

For optimum performance, what we'd really like is to have the client_id column available on the contacts_emailAddresses table, without a need to JOINI to the contacts table. 为了获得最佳性能,我们真正想要的是在contacts_emailAddresses表上提供client_id列,而无需JOINI到contacts表。

In the current schema, the foreign key relationship to contacts table gets us the client_id (rather, the JOIN operation in the original query is what gets it for us.) If we could avoid that JOIN operation entirely, we could satisfy the query entirely from a single index, using the index to do the aggregation, and avoiding the overhead of the "Using temporary; Using filesort" and JOIN operations... 在当前模式中,与contacts表的外键关系为我们提供了client_id (相反,原始查询中的JOIN操作是为我们提供的。)如果我们可以完全避免该JOIN操作,我们可以完全满足查询单个索引,使用索引进行聚合,并避免“使用临时;使用filesort”和JOIN操作的开销......

With the client_id column available, we'd create a covering index like... client_id列可用的情况下,我们将创建一个覆盖索引,如...

... ON contacts_emailAddresses (client_id, subscribed)

Then, we'd have a blazingly fast query... 然后,我们有一个非常快速的查询......

SELECT e.client_id
     , SUM(e.subscribed=1) AS subs
     , SUM(e.subscribed=0) AS unsubs
  FROM contacts_emailAddresses e
GROUP BY e.client_id

That would get us a "Using index" in the query plan, and the query plan for this resultset just doesn't get any better than that. 这将使我们在查询计划中获得“使用索引”,并且此结果集的查询计划没有比这更好。

But, that would require a change to your scheam, it doesn't really answer your question. 但是,这需要更改你的scheam,它并没有真正回答你的问题。



Without the client_id column, then the best we're likely to do is a query like the one Gordon posted in his answer (though you still need to add the GROUP BY c.client_id to get the specified result.) The index Gordon recommended will be of benefit... 如果没有client_id列,那么我们可能做的最好的事情就是像Gordon在他的回答中发布的那样查询(尽管你仍然需要添加GROUP BY c.client_id来获得指定的结果。)Gordon推荐的指数将是有益...

... ON contacts_emailAddresses(contact_id, subscribed)

With that index defined, the standalone index on contact_id is redundant. 定义了该索引后,contact_id上的独立索引是多余的。 The new index will be a suitable replacement to support the existing foreign key constraint. 新索引将是支持现有外键约束的合适替代。 (The index on just contact_id could be dropped.) (只有contact_id的索引才能被删除。)


Another approach would be to do the aggregation on the "big" table first, before doing the JOIN, since it's the driving table for the outer join. 另一种方法是在执行JOIN之前首先在“大”表上进行聚合,因为它是外连接的驱动表。 Actually, since that foreign key column is defined as NOT NULL, and there's a foreign key, it's not really an "outer" join at all. 实际上,由于该外键列被定义为NOT NULL,并且有一个外键,它根本不是一个“外部”连接。

SELECT c.client_id
     , SUM(s.subs) AS subs
     , SUM(s.unsubs) AS unsubs 
  FROM ( SELECT e.contact_id
              , SUM(e.subscribed=1) AS subs
              , SUM(e.eubscribed=0) AS unsubs
           FROM contacts_emailAddresses e
          GROUP BY e.contact_id
       ) s
 JOIN contacts c
   ON c.id = s.contact_id
GROUP BY c.client_id

Again, we need an index with contact_id as the leading column and including the subscribed column, for best performance. 同样,我们需要一个索引,其中contact_id作为前导列并包含subscribed列,以获得最佳性能。 (The plan for s should show "Using index".) Unfortunately, that's still going to materialize a fairly sizable resultset (derived table s ) as a temporary MyISAM table, and the MyISAM table isn't going to be indexed. (该计划s应该显示“使用索引”)。不幸的是,仍然要兑现一个相当可观的结果集(派生表s )作为临时MyISAM表,和MyISAM表是不会被索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM