在大表的查询中获取计数匹配非常慢

Question

我有一个带有2个整数字段的mysql表“items”：seid和tiid
该表有大约35000000条记录，所以它非常大。

seid  tiid  
-----------
1     1  
2     2  
2     3  
2     4  
3     4  
4     1  
4     2

该表在两个字段上都有一个主键，seid上的索引和tiid上的索引。

有人输入一个或多个tiid值，现在我想得到最多结果的seid。

例如，当有人输入1,2,3时，我希望得到2和4的结果。 他们在tiid值上都有2个匹配。

我的查询到目前为止：

SELECT COUNT(*) as c, seid
  FROM items
 WHERE tiid IN (1,2,3) 
GROUP BY seid
HAVING c = (SELECT COUNT(*) as c, seid
              FROM items
             WHERE tiid IN (1,2,3) 
          GROUP BY seid
          ORDER BY c DESC 
             LIMIT 1)

但是由于大表，这个查询极其缓慢。

有谁知道如何为此目的构建更好的查询？

Answer 1

这需要你遍历大表两次。 也许缓存结果将有助于将所花费的时间减半，但看起来似乎不太可能。

DROP temporary table if exists TMP_COUNTED;

create temporary table TMP_COUNTED
select seid, COUNT(*) as C
from items
where tiid in (1,2,3)
group by seid;

CREATE INDEX IX_TMP_COUNTED on TMP_COUNTED(C);

SELECT *
FROM TMP_COUNTED
WHERE C = (SELECT MAX(C) FROM seid)

Answer 2

所以我找到了2个解决方案，第一个：

SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list 
FROM (
    SELECT COUNT(*) as c, seid FROM items 
    WHERE tiid IN (1,2,3) 
    GROUP BY seid ORDER BY c DESC
) T1 
GROUP BY c 
ORDER BY c DESC
LIMIT 1;
+---+-----------+
| c | seid_list |
+---+-----------+
| 2 | 2,4       | 
+---+-----------+

编辑：

EXPLAIN SELECT c,GROUP_CONCAT(CAST(seid AS CHAR)) as seid_list  FROM (     SELECT COUNT(*) as c, seid FROM items      WHERE tiid IN (1,2,3)      GROUP BY seid ORDER BY c DESC ) T1  GROUP BY c  ORDER BY c DESC LIMIT 1;
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table      | type  | possible_keys    | key     | key_len | ref  | rows | Extra                                                     |
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL             | NULL    | NULL    | NULL |    3 | Using filesort                                            | 
|  2 | DERIVED     | items      | range | PRIMARY,tiid_idx | PRIMARY | 4       | NULL |    4 | Using where; Using index; Using temporary; Using filesort | 
+----+-------------+------------+-------+------------------+---------+---------+------+------+-----------------------------------------------------------+

重新编辑：

第一个解决方案有一个问题，数十亿行的结果字段可能太大。 所以这是另一个解决方案，它通过对MySQl变量应用clasical max memorisation / check来避免双彩虹效果：

SELECT c,seid
  FROM (
   SELECT c,seid,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax'
     FROM (
       SELECT COUNT(*) as c, seid FROM items WHERE tiid IN (1,2,3)
        GROUP BY seid
        ORDER BY c DESC
    ) res1
   ,(SELECT @mmax:=0) initmax
   ORDER BY c DESC
 ) res2 WHERE mymax>0;
+---+------+
| c | seid |
+---+------+
| 2 |    4 | 
| 2 |    2 | 
+---+------+

说明：

+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table      | type   | possible_keys    | key     | key_len | ref  | rows | Extra                                                     |
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL             | NULL    | NULL    | NULL |    3 | Using where                                               | 
|  2 | DERIVED     | <derived4> | system | NULL             | NULL    | NULL    | NULL |    1 | Using filesort                                            | 
|  2 | DERIVED     | <derived3> | ALL    | NULL             | NULL    | NULL    | NULL |    3 |                                                           | 
|  4 | DERIVED     | NULL       | NULL   | NULL             | NULL    | NULL    | NULL | NULL | No tables used                                            | 
|  3 | DERIVED     | items      | range  | PRIMARY,tiid_idx | PRIMARY | 4       | NULL |    4 | Using where; Using index; Using temporary; Using filesort | 
+----+-------------+------------+--------+------------------+---------+---------+------+------+-----------------------------------------------------------+

Answer 3

预先计算所有唯一tiid值的计数并存储它们。

每小时，每天或每周刷新此计数。 或者尝试通过更新来保持计数正确。 这将消除进行计数的需要。 计数总是很慢。

Answer 4

我有一个名为product_category的表，它有一个复合主键，由2个无符号整数字段组成，没有其他二级索引：

create table product_category
(
prod_id int unsigned not null,
cat_id mediumint unsigned not null,
primary key (cat_id, prod_id) -- note the clustered composite index !!
)
engine = innodb;

该表目前有1.25亿行

select count(*) as c from product_category;
c
=
125,524,947

具有以下索引/基数：

show indexes from product_category;

Table              Non_unique   Key_name    Seq_in_index    Column_name Collation   Cardinality
=====              ==========   ========    ============    =========== =========   ===========
product_category    0            PRIMARY                1    cat_id      A           1162276
product_category    0            PRIMARY                2    prod_id     A           125525826

如果我运行类似于你的查询（第一次运行没有缓存和冷/空缓冲区）：

select 
 prod_id, count(*) as c
from
 product_category 
where 
  cat_id between 1600 and 2000 -- using between to include a wider range of data
group by
 prod_id 
having c = (
  select count(*) as c from product_category 
  where cat_id between 1600 and 2000
  group by  prod_id order by c desc limit 1
)
order by prod_id;

我得到以下结果：

(cold run)
+---------+---+
| prod_id | c |
+---------+---+
|   34957 | 4 |
|  717812 | 4 |
|  816612 | 4 |
|  931111 | 4 |
+---------+---+
4 rows in set (0.18 sec)

(2nd run)
+---------+---+
| prod_id | c |
+---------+---+
|   34957 | 4 |
|  717812 | 4 |
|  816612 | 4 |
|  931111 | 4 |
+---------+---+
4 rows in set (0.14 sec)

解释计划如下：

+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table            | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                                     |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
|  1 | PRIMARY     | product_category | range | PRIMARY       | PRIMARY | 3       | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
|  2 | SUBQUERY    | product_category | range | PRIMARY       | PRIMARY | 3       | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

如果我运行regilero的查询：

SELECT c,prod_id
  FROM (
   SELECT c,prod_id,CASE WHEN @mmax<=c THEN @mmax:=c ELSE 0 END 'mymax'
     FROM (
       SELECT COUNT(*) as c, prod_id FROM product_category WHERE
        cat_id between 1600 and 2000
        GROUP BY prod_id
        ORDER BY c DESC
    ) res1
   ,(SELECT @mmax:=0) initmax
   ORDER BY c DESC
 ) res2 WHERE mymax>0;

我得到以下结果：

(cold) 
+---+---------+
| c | prod_id |
+---+---------+
| 4 |  931111 |
| 4 |   34957 |
| 4 |  717812 |
| 4 |  816612 |
+---+---------+
4 rows in set (0.17 sec)

(2nd run)
+---+---------+
| c | prod_id |
+---+---------+
| 4 |   34957 |
| 4 |  717812 |
| 4 |  816612 |
| 4 |  931111 |
+---+---------+
4 rows in set (0.13 sec)

解释计划如下：

+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table            | type   | possible_keys | key     | key_len | ref  | rows   | Extra                                                     |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
|  1 | PRIMARY     | <derived2>       | ALL    | NULL          | NULL    | NULL   | NULL |  92760 | Using where                                               |
|  2 | DERIVED     | <derived4>       | system | NULL          | NULL    | NULL   | NULL |      1 | Using filesort                                            |
|  2 | DERIVED     | <derived3>       | ALL    | NULL          | NULL    | NULL   | NULL |  92760 |                                                           |
|  4 | DERIVED     | NULL             | NULL   | NULL          | NULL    | NULL   | NULL |   NULL | No tables used                                            |
|  3 | DERIVED     | product_category | range  | PRIMARY       | PRIMARY | 3      | NULL | 194622 | Using where; Using index; Using temporary; Using filesort   |
+----+-------------+------------------+--------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

最后尝试使用cyberwiki的方法：

drop procedure if exists cyberkiwi_variant;

delimiter #

create procedure cyberkiwi_variant()
begin

create temporary table tmp engine=memory
 select prod_id, count(*) as c from
 product_category where cat_id between 1600 and 2000
 group by prod_id order by c desc; 

select max(c) into @max from tmp;

select * from tmp where c = @max;

drop temporary table if exists tmp;

end#

delimiter ;

call cyberkiwi_variant();

我得到以下结果：

(cold and 2nd run)
+---------+---+
| prod_id | c |
+---------+---+
|  816612 | 4 |
|  931111 | 4 |
|   34957 | 4 |
|  717812 | 4 |
+---------+---+
4 rows in set (0.14 sec)

解释计划如下：

+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table            | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                                     |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+
|  1 | SIMPLE      | product_category | range | PRIMARY       | PRIMARY | 3  | NULL | 194622 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+---------------+---------+---------+------+--------+-----------------------------------------------------------+

所以测试的所有方法似乎都有。 相同的运行时间介于0.14和0.18秒之间，考虑到表的大小和查询的行数，这对我来说似乎非常有效。

希望这会有所帮助 - http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

Answer 5

如果我了解您的要求，您可以尝试这样的事情

select seid, tiid, count(*) from items where tiid in (1,2,3)
group by seid, tiid
order by seid

在大表的查询中获取计数匹配非常慢

问题描述

5 个解决方案

解决方案1
2 2011-01-12 21:48:57

解决方案2
2 已采纳 2011-01-12 23:07:13

解决方案3
1 2011-01-12 21:44:43

解决方案4
1 2011-01-13 06:27:57

解决方案5
0 2011-01-12 21:40:56

在大表的查询中获取计数匹配非常慢

问题描述

5 个解决方案

解决方案1 2 2011-01-12 21:48:57

解决方案2 2 已采纳 2011-01-12 23:07:13

解决方案3 1 2011-01-12 21:44:43

解决方案4 1 2011-01-13 06:27:57

解决方案5 0 2011-01-12 21:40:56

解决方案1
2 2011-01-12 21:48:57

解决方案2
2 已采纳 2011-01-12 23:07:13

解决方案3
1 2011-01-12 21:44:43

解决方案4
1 2011-01-13 06:27:57

解决方案5
0 2011-01-12 21:40:56