简体   繁体   English

如何优化同时取决于COUNT和GROUP BY的查询?

[英]How can I optimize a query which depends on both COUNT and GROUP BY?

I have a query which purpose is to generate statistics for how many musical work (track) has been downloaded from a site at different periods (by month, by quarter, by year etc). 我有一个查询,目的是生成统计数据,说明在不同时期(按月,按季度,按年等)从站点下载了多少音乐作品(曲目)。 The query operates on the tables entityusage , entityusage_file and track . 该查询对表entityusageentityusage_filetrack

To get the number of downloads for tracks belonging to an specific album I would do the following query : 为了获得属于特定专辑的曲目的下载数量,我将执行以下查询:

select 
    date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from        entityusage as eu
inner join  entityusage_file as euf 
        ON  euf.entityusage_id = eu.id
inner join  track as t 
        ON t.id = euf.track_id
where
    t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
        and entitytype = 't'
        and action = 1
group by date_format(eu.updated, '%Y%m%d')

I need to set entitytype = 't' as the entityusage can hold downloads of other entities as well (if entitytype = 'a' then an entire album would have been downloaded, and entityusage_file would then hold all tracks which the album "translated" into at the point of download). 我需要设置entitytype = 't'因为entityusage也可以保存其他实体的下载内容(如果entitytype = 'a'则将下载整个专辑,然后entityusage_file将保存专辑“翻译”成的所有曲目在下载时)。

This query takes 40 - 50 seconds. 此查询需要40-50秒。 I've been trying to optimize this query for a while, but I have the feeling that I'm approaching this the wrong way. 我一直在尝试优化此查询一段时间,但是我感觉自己正在以错误的方式进行处理。

This is one out of 4 similar queries which must run to generate a report. 这是必须运行才能生成报告的4个类似查询中的一个。 The report should preferable be able to finish while a user waits for it. 该报告最好能够在用户等待时完成。 Right now, I'm looking at 3 - 4 minutes. 现在,我正在看3-4分钟。 That's a long time to wait. 那是很长的等待时间。

Can this query be optimised further with indexes, or do I need to take another approach to get this job done? 可以使用索引进一步优化此查询,还是需要采取另一种方法来完成此工作?

CREATE TABLE `entityusage` (
  `id` char(36) NOT NULL,
  `title` varchar(255) DEFAULT NULL,
  `entitytype` varchar(5) NOT NULL,
  `entityid` char(36) NOT NULL,
  `externaluser` int(10) NOT NULL,
  `action` tinyint(1) NOT NULL,
  `updated` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `e` (`entityid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `entityusage_file` (
  `id` char(36) NOT NULL,
  `entityusage_id` char(36) NOT NULL,
  `track_id` char(36) NOT NULL,
  `file_id` char(36) NOT NULL,
  `type` varchar(3) NOT NULL,
  `quality` int(1) NOT NULL,
  `size` int(20) NOT NULL,
  `updated` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `file_id` (`file_id`),
  KEY `entityusage_id` (`entityusage_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `track` (
  `id` char(36) NOT NULL,
  `album_id` char(36) NOT NULL,
  `number` int(3) NOT NULL DEFAULT '0',
  `title` varchar(255) DEFAULT NULL,
  `updated` datetime NOT NULL DEFAULT '2000-01-01 00:00:00',
  PRIMARY KEY (`id`),
  KEY `album` (`album_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 CHECKSUM=1 DELAY_KEY_WRITE=1 ROW_FORMAT=DYNAMIC;

An EXPLAIN on the query gives me the following : 一个EXPLAIN的查询给了我以下内容:

+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
| id   | select_type | table | type   | possible_keys  | key            | key_len | ref                          | rows    | Extra                                        |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+
|    1 | SIMPLE      | eu    | ALL    | NULL           | NULL           | NULL    | NULL                         | 7832817 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | euf   | ref    | entityusage_id | entityusage_id | 108     | func                         |       1 | Using index condition                        |
|    1 | SIMPLE      | t     | eq_ref | PRIMARY,album  | PRIMARY        | 108     | trackerdatabase.euf.track_id |       1 | Using where                                  |
+------+-------------+-------+--------+----------------+----------------+---------+------------------------------+---------+----------------------------------------------+

This is your query: 这是您的查询:

select date_format(eu.updated, '%Y-%m-%d') as p, count(eu.id) as c
from entityusage eu join
     entityusage_file euf
     on euf.entityusage_id = eu.id join
     track t 
     on t.id = euf.track_id
where t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7' and
      eu.entitytype = 't' and
      eu.action = 1
group by date_format(eu.updated, '%Y%m%d');

I would suggest indexes on track(album_id, id) , entityusage_file(track_id, entityusage_id) , and entityusage(id, entitytype, action) . 我建议在track(album_id, id)entityusage_file(track_id, entityusage_id)entityusage(id, entitytype, action)

Assuming that entityusage_file is mostly a many:many mapping table, see this for tips on improving it. 假设entityusage_file多半是多方面的:许多映射表,看到对提高它的提示。 Note that it calls for getting rid of the id and making a pair of 2-column indexes, one of which is the PRIMARY KEY(track_id, entityusage_id) . 请注意,它要求摆脱id并创建一对2列索引,其中一个是PRIMARY KEY(track_id, entityusage_id) Since your table has a few extra columns, that link does not cover everything. 由于您的表有一些额外的列,因此该链接无法涵盖所有​​内容。

The UUIDs could be shrunk from 108 bytes to 36, then then to 16 by going to BINARY(16) and using a compression function. 通过转到BINARY(16)并使用压缩功能,可以将UUID从108字节缩减为36,然后缩减为16。 Many exist (including a builtin pair in version 8.0); 存在许多(包括8.0版中的内置对); here's mine. 这是我的。

To explain one thing... The query execution should have started with track (on the assumption that '0054a47e-b594-407b-86df-3be078b4e7b7' is very selective). 解释一件事...查询执行应该从track开始(假设'0054a47e-b594-407b-86df-3be078b4e7b7'是非常有选择性的)。 The hangup was that there was no index to get from there to the next table. 挂起的是没有索引可以从那里到达下一张表。 Gordon's suggested indexes include such. 戈登的建议指标包括此类指标。

date_format(eu.updated, '%Y-%m-%d') and date_format(eu.updated, '%Y%m%d') can be simplified to DATE(eu.updated) . date_format(eu.updated, '%Y-%m-%d')date_format(eu.updated, '%Y%m%d')可以简化为DATE(eu.updated) (No significant performance change.) (没有明显的性能变化。)

(The other Answers and Comments cover a number of issues; I won't repeat them here.) (其他“答案和评论”涵盖了许多问题;在此不再赘述。)

Because the GROUP BY operation is on an expression involving a function, MySQL can't use an index to optimize that operation. 由于GROUP BY操作位于包含函数的表达式上,因此MySQL无法使用索引来优化该操作。 It's going to require a "Using filesort" operation. 这将需要“使用文件排序”操作。

I believe the indexes that Gordon suggested are the best bets, given the current table definitions. 我相信,鉴于当前的表定义,戈登建议的索引是最好的选择。 But even with those indexes, the "tall post" is the eu table, chunking through and sorting all those rows. 但是,即使有了这些索引,“最高职位”还是eu表,对所有这些行进行分块和排序。

To get more reasonable performance, you may need to introduce a "precomputed results" table. 为了获得更合理的性能,您可能需要引入“预计算结果”表。 It's going to be expensive to generate the counts for everything... but we can pay that price ahead of time... 生成所有内容的计数将非常昂贵...但是我们可以提前支付该价格...

CREATE TABLE usage_track_by_day
( updated_dt DATE NOT NULL
, PRIMARY KEY (track_id, updated_dt)
)
AS
SELECT eu.track_id
     , DATE(eu.updated) AS updated_dt
     , SUM(IF(eu.action = 1,1,0) AS cnt
  FROM entityusage eu
 WHERE eu.track_id IS NOT NULL
   AND eu.updated IS NOT NULL
 GROUP
    BY eu.track_id
     , DATE(eu.updated)

An index ON entityusage (track_id,updated,action) may benefit performance. 索引ON entityusage (track_id,updated,action)可能会提高性能。

Then, we could write a query against the new "precomputed results" table, with a better shot at reasonable performance. 然后,我们可以针对新的“预计算结果”表编写查询,以合理的性能获得更好的结果。

The "precomputed results" table would get stale, and would need to be periodically refreshed. “预先计算的结果”表将过时,并且需要定期刷新。

This isn't necessarily the best solution to the issue, but it's a technique we can use in datawarehouse/datamart applications. 这不一定是解决问题的最佳方法,但这是我们可以在数据仓库/数据集市应用程序中使用的一种技术。 This lets us churn through lots of detail rows to get counts one time, and then save those counts for fast access. 这样一来,我们就可以遍历大量明细行以一次获得计数,然后保存这些计数以进行快速访问。

can you try this. 你可以试试这个吗 i cant really test it without some sample data from you. 没有您的一些示例数据,我无法真正测试它。 In this case the query looks first in table track and joins then the other tables. 在这种情况下,查询首先在表跟踪中查找,然后再联接其他表。

 SELECT 
    date_format(eu.updated, '%Y-%m-%d') AS p
    , count(eu.id) AS c
FROM track AS t
INNER JOIN entityusage_file AS euf ON t.id = euf.track_id
INNER JOIN entityusage AS eu ON euf.entityusage_id = eu.id
 WHERE
    t.album_id = '0054a47e-b594-407b-86df-3be078b4e7b7'
        AND entitytype = 't'
        AND ACTION = 1
GROUP BY date_format(eu.updated, '%Y%m%d');

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM