简体   繁体   English

MYSQL SUM()与GROUP BY和LIMIT

[英]MYSQL SUM() with GROUP BY and LIMIT

I got this table 我拿到了这张桌子

CREATE TABLE `votes` (
  `item_id` int(10) unsigned NOT NULL,
  `user_id` int(10) unsigned NOT NULL,
  `vote` tinyint(4) NOT NULL DEFAULT '0',
  PRIMARY KEY (`item_id`,`user_id`),
  KEY `FK_vote_user` (`user_id`),
  KEY `vote` (`vote`),
  KEY `item` (`item_id`),
  CONSTRAINT `FK_vote_item` FOREIGN KEY (`item_id`) REFERENCES `items` (`id`) ON UPDATE CASCADE,
  CONSTRAINT `FK_vote_user` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

And I got this simple select 我得到了这个简单的选择

SELECT 
  `a`.`item_id`, `a`.`sum`
FROM
  (SELECT 
    `item_id`, SUM(vote) AS `sum` 
  FROM
    `votes` 
  GROUP BY `item_id`) AS a 
ORDER BY `a`.`sum` DESC
LIMIT 10

Right now, with only 250 rows, there isn't a problem, but it's using filesort. 现在,只有250行,没有问题,但它正在使用filesort。 The vote column has either -1 , 0 or 1 . vote列具有任一-101 But will this be performant when this table has millions or rows? 但是当这个表有数百万或行时,这会有效吗?

If I make it a simpler query without a subquery, then the using temporary table appears. 如果我在没有子查询的情况下使其成为更简单的查询,则会出现using temporary table

Explain gives (the query completes in 0.00170s): 解释给出(查询在0.00170s完成):

id select_type table      type  possible_keys key     key_len ref  rows Extra
1  PRIMARY     <derived2> ALL   NULL          NULL    NULL    NULL 33   Using filesort
2  DERIVED     votes      index NULL          PRIMARY 8       NULL 250

No, this won't be efficient with millions of rows. 不,这对于数百万行来说效率不高。

You'll have to create a supporting aggregate table which would store votes per item: 您必须创建一个支持聚合表,该表将存储每个项目的投票:

CREATE TABLE item_votes
        (
        item_id INT NOT NULL PRIMARY KEY,
        votes UNSIGNED INT NOT NULL,
        upvotes UNSIGNED INT NOT NULL,
        downvotes UNSIGNED INT NOT NULL,
        KEY (votes),
        KEY (upvotes),
        KEY (downvotes)
        )

and update it each time a vote is cast: 并在每次投票时更新它:

INSERT
INTO    item_votes (item_id, votes, upvotes, downvotes)
VALUES  (
        $item_id,
        CASE WHEN $upvote THEN 1 ELSE -1 END,
        CASE WHEN $upvote THEN 1 ELSE 0 END,
        CASE WHEN $upvote THEN 0 ELSE 1 END
        )
ON DUPLICATE KEY
UPDATE
SET     votes = votes + VALUES(upvotes) - VALUES(downvotes),
        upvotes = upvotes + VALUES(upvotes),
        downvotes = downvotes + VALUES(downvotes)

then select top 10 votes: 然后选择前10票:

SELECT  *
FROM    item_votes
ORDER BY
        votes DESC, item_id DESC
LIMIT   10

efficiently using an index. 有效地使用索引。

But will this be performant when this table has millions or rows? 但是当这个表有数百万或行时,这会有效吗?

No, it won't. 不,它不会。

If I make it a simpler query without a subquery, then the using temporary table appears. 如果我在没有子查询的情况下使其成为更简单的查询,则会出现使用临时表。

Probably because the planner would turn it into the query you posted: it needs to calculate the sum to return the results in the correct order. 可能是因为计划程序会将其转换为您发布的查询:它需要计算总和以按正确的顺序返回结果。

To quickly grab the top voted questions, you need to cache the result. 要快速获取最高投票问题,您需要缓存结果。 Add a score field in your items table, and maintain it (eg using triggers). 在项目表中添加分数字段并进行维护(例如使用触发器)。 And index it. 索引它。 You'll then be able to grab the top 10 scores using an index scan. 然后,您将能够使用索引扫描获取前10个分数。

First, you don't need the subquery, so you can rewrite your query as: 首先,您不需要子查询,因此您可以将查询重写为:

SELECT `item_id`, SUM(vote) AS `sum` 
FROM `votes`
GROUP BY `item_id`
ORDER BY `a`.`sum` DESC
LIMIT 10

Second, you can build an index on votes(item_id, vote) . 其次,您可以建立votes(item_id, vote)索引votes(item_id, vote) The group by will then be an index scan. 然后, group by将成为索引扫描。 This will take time as the table gets bigger, but it should be manageable for reasonable data sizes. 随着表变大,这花费时间,但对于合理的数据大小,它应该是可管理的。

Finally, with this structure of a query, you need to do a file sort for the final order by . 最后,使用此查询结构,您需要对最终order by执行文件排序。 Whether this is efficient or not depends on the number of items you have. 这是否有效取决于您拥有的物品数量。 If each item has, on average, one or two votes, then this may take some time. 如果每个项目平均有一到两票,那么这可能需要一些时间。 If you have a fixed set of items and there are only a few hundred or thousand, then then should not be a performance bottleneck, even as the data size expands. 如果你有一组固定的项目并且只有几百或几千,那么即使数据大小扩大,也不应该成为性能瓶颈。

If this summary is really something you need quickly, then a trigger with a summary table (as explained in another answer) provides a faster retrieval method. 如果这个摘要确实是您需要的,那么带有摘要表的触发器(如另一个答案中所述)提供了更快的检索方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM