MySQL对大数据集的低效查询

Question

We have a MySQL table that looks something like this (insignificant columns removed): 我们有一个类似于这样的MySQL表（删除了无关紧要的列）：

CREATE TABLE `my_data` (
  `auto_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `created_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_ts` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `data_txt` varchar(256) CHARACTER SET utf8 NOT NULL,
  `issued_ts` timestamp NULL DEFAULT NULL,
  `account_id` int(11) NOT NULL,
  PRIMARY KEY (`auto_id`),
  KEY `account_issued_idx` (`account_id`,`issued_ts`),
  KEY `account_issued_created_idx` (`account_id`,`issued_ts`,`created_ts`),
  KEY `account_created_idx` (`account_id`,`created_ts`),
  KEY `issued_idx` (`issued_ts`)
) ENGINE=InnoDB;

We have approximately 900M rows in the table, with one account_id accounting for more than 65% of those rows. 表中有大约900M行，其中一个account_id占这些行的65％以上。 I'm being asked to write queries across date ranges for both created_ts and issued_ts that depend upon the account_id, which appears to have a 1:1 functional dependence on the auto increment key. 我被要求在日期范围内为create_ts和issued_ts编写查询，这些查询依赖于account_id，而account_id似乎对自动增量键具有1：1的功能依赖性。

A typical query would look like this: 典型的查询看起来像这样：

SELECT * 
FROM my_data 
WHERE account_id = 1 AND 
      created_ts > TIMESTAMP('2012-01-01') AND 
      created_ts <= TIMESTAMP('2012-01-21') 
ORDER BY created_ts DESC LIMIT 100;

An EXPLAIN on the query shows this: 对查询的EXPLAIN显示：

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: my_data
         type: range
possible_keys: account_issued_idx, account_issued_created_idx, account_created_idx,
      key: account_issued_created_idx
  key_len: 8
      ref: NULL
     rows: 365314721
    Extra: Using where

The problem is that the query takes far too long and is eventually killed. 问题是查询花了太长时间并最终被杀死。 I've let it run a couple of times and it brings the down the database host because the OS (Linux) runs out of swap space. 我让它运行了几次，它带来了数据库主机，因为操作系统（Linux）耗尽了交换空间。

I've researched the issue, repeatedly, and have tried to break up the query into uncorrelated subqueries, forcing indexes, using an explicit SELECT clause, and limiting the window of the date range, but the result is the same: poor performance (too slow) and too taxing on the host (invariably dies). 我反复研究过这个问题，并尝试将查询分解为不相关的子查询，强制索引，使用显式的SELECT子句，并限制日期范围的窗口，但结果是相同的：性能不佳（也是对主人过于沉重（总是死亡）。

My question(s) are: 我的问题是：

Is it possible that a query can be formulated to slice the data into date ranges and perform acceptably for a real-time call? 是否有可能制定一个查询来将数据分割成日期范围并为可实时调用执行可接受的操作？ ( < 1s) （<1s）
Are there optimizations that I'm missing, or may help, in order to get the performance I am being asked to get? 我是否缺少优化或者可能有所帮助，以获得我被要求获得的性能？

Any other suggestions, hints, or thoughts are welcomed. 欢迎任何其他建议，提示或想法。

Thanks 谢谢

Answer 1

Seems mysql uses wrong index for this query, try to force another: 似乎mysql对此查询使用了错误的索引，尝试强制另一个：

SELECT * 
FROM my_data FORCE INDEX (`account_created_idx`)
WHERE account_id = 1 AND 
      created_ts > TIMESTAMP('2012-01-01') AND 
      created_ts <= TIMESTAMP('2012-01-21') 
ORDER BY created_ts DESC LIMIT 100;

Answer 2

This question is getting on in years. 这个问题已经持续多年了。 Still, there's a good answer. 不过，还有一个很好的答案。

The key to your struggle lies in your words insignificant columns removed. 你的斗争的关键在于你的话语删除了无关紧要的列。 There aren't any insignificant columns when you do SELECT * .... ORDER BY X DESC LIMIT N . 当你做SELECT * .... ORDER BY X DESC LIMIT N时，没有任何无关紧要的列SELECT * .... ORDER BY X DESC LIMIT N That's because the entire resultset has to be picked up and shuffled. 那是因为整个结果集必须被拾取和洗牌。 When you ask for all the columns in a complex table, that's a lot of data. 当你要求复杂表中的所有列时，这就是很多数据。

You have a good index for the WHERE clause. 你有一个很好的WHERE子句索引。 It would also be good for the ORDER BY clause if that didn't say DESC in it. 如果ORDER BY子句中没有说DESC ，那么它也会有好处。

What you want is a deferred join. 你想要的是延期加入。 Start by retrieving just the IDs of the rows you need. 首先只检索所需行的ID。

        SELECT auto_id
          FROM my_data
         WHERE account_id = 1 AND 
              created_ts > TIMESTAMP('2012-01-01') AND 
              created_ts <= TIMESTAMP('2012-01-21') 
     ORDER BY created_ts DESC
        LIMIT 100

This will give you the list of auto_id values for the columns you need. 这将为您提供所需列的auto_id值列表。 To order this list, MySql only has to shuffle the id and timestamp values. 要订购此列表，MySql只需要重新设置id和timestamp值。 It's LOTS less data to handle. 要处理的数据很少。

Then you JOIN that list of IDs to your main table and grab the results. 然后，将该ID列表JOIN主表并获取结果。

SELECT a.*
  FROM my_data a
  JOIN (
             SELECT auto_id
               FROM my_data
              WHERE account_id = 1 AND 
                    created_ts > TIMESTAMP('2012-01-01') AND 
                    created_ts <= TIMESTAMP('2012-01-21') 
           ORDER BY created_ts DESC
              LIMIT 100
       ) b ON a.auto_id = b.auto_id
 ORDER BY a.created_ts DESC

Try this. 试试这个。 It will probably save you a lot of time. 这可能会为你节省很多时间。

If you know a priori that both auto_id and created_ts are monotone increasing, then you can do even better. 如果你知道auto_id和created_ts都是单调递增的先验 ，那么你可以做得更好。 Your subquery can contain 您的子查询可以包含

      ORDER BY auto_id DESC
         LIMIT 100

That will reduce the data you need to shuffle even further. 这将减少您需要进一步洗牌所需的数据。

Pro tip: avoid SELECT * in production systems; 专业提示：避免在生产系统中使用SELECT * ; instead enumerate the columns you actually need. 而是枚举您实际需要的列。 There are lots of reasons for this. 这有很多原因。

Answer 3

Try MariaDB (or MySQL 5.6), as their Optimizer can do it faster. 尝试MariaDB（或MySQL 5.6），因为他们的优化器可以更快地完成它。 I am using it for some months, and for some queries like yours it's 1000% faster. 我使用它几个月了，对于像你这样的一些查询，它的速度提高了1000％。

You need Index Condition Pushdown: http://kb.askmonty.org/en/index-condition-pushdown/ 您需要索引条件下推： http ： //kb.askmonty.org/en/index-condition-pushdown/

Answer 4

Do not use function in the comparision. 不要在比较中使用功能。 Calculate the timestamps and use the computed values, otherwise you can't use the index to compare created_ts, and it's the field that will filter million of rows from the resultset 计算时间戳并使用计算值，否则你不能使用索引来比较created_ts，它是将从结果集中过滤掉数百万行的字段

Answer 5

Not sure why MySQL uses the (obviously) not best index. 不确定为什么MySQL使用（显然）不是最佳索引。 Besides forcing the index, can you try the EXPLAIN plan on this variation: 除了强制索引，你可以尝试这个变化的EXPLAIN计划：

SELECT * 
FROM my_data 
WHERE account_id = 1 AND 
      created_ts > TIMESTAMP('2012-01-01') AND 
      created_ts <= TIMESTAMP('2012-01-21') 
ORDER BY account_id
       , created_ts DESC 
LIMIT 100;

MySQL对大数据集的低效查询

问题描述

5 个解决方案

解决方案1
4 2012-05-24 16:41:10

解决方案2
1 2015-04-15 00:28:44

解决方案3
0 2012-05-24 17:14:17

解决方案4
0 2012-05-24 17:24:43

解决方案5
0 2012-05-24 17:30:13

MySQL对大数据集的低效查询

问题描述

5 个解决方案

解决方案1 4 2012-05-24 16:41:10

解决方案2 1 2015-04-15 00:28:44

解决方案3 0 2012-05-24 17:14:17

解决方案4 0 2012-05-24 17:24:43

解决方案5 0 2012-05-24 17:30:13

解决方案1
4 2012-05-24 16:41:10

解决方案2
1 2015-04-15 00:28:44

解决方案3
0 2012-05-24 17:14:17

解决方案4
0 2012-05-24 17:24:43

解决方案5
0 2012-05-24 17:30:13