简体   繁体   English

在大型MySQL InnoDB表上,全计数查询真的那么慢吗?

[英]Are full count queries really so slow on a large MySQL InnoDB tables?

We have a large tables with millions of entrys. 我们有一个包含数百万个条目的大型表。 A full count is pretty slow, see code below. 完整计数非常慢,请参见下面的代码。 Is this quite common for a MySQL InnoDB table? 这对于MySQL InnoDB表是否很常见? Is there no way to accelerate this? 有没有办法加速呢? Even with the query cache it's still "slow". 即使使用查询缓存,它仍然“缓慢”。 I also wonder, why the count on "communication" table with 2.8 mio entries is slower than the count on "transaction" with 4.5 mio entries. 我还想知道,为什么具有2.8 mio条目的“通讯”表的计数比具有4.5 mio条目的“事务”的计数慢。

I'know that it's much faster with a where clause. 我知道使用where子句可以更快。 I just want to know if the bad performance is normal. 我只想知道不良的表现是否正常。

We are using Amazon RDS MySQL 5.7 with an m4.xlarge (4 CPU, 16 GB RAM, 500 GB Storage). 我们正在使用具有m4.xlarge(4 CPU,16 GB RAM,500 GB存储)的Amazon RDS MySQL 5.7。 I've also already tried bigger instances with more CPU and RAM, but there is no big change on the query times. 我也已经尝试了具有更多CPU和RAM的大型实例,但是查询时间没有太大变化。

mysql> SELECT COUNT(*) FROM transaction;
+----------+
| COUNT(*) |
+----------+
|  4569880 |
+----------+
1 row in set (1 min 37.88 sec)

mysql> SELECT COUNT(*) FROM transaction;
+----------+
| count(*) |
+----------+
|  4569880 |
+----------+
1 row in set (1.44 sec)

mysql> SELECT COUNT(*) FROM communication;
+----------+
| count(*) |
+----------+
|  2821486 |
+----------+
1 row in set (2 min 19.28 sec)

This is the downside of using a database storage engine that supports multi-versioning concurrency control (MVCC) . 这是使用支持多版本并发控制(MVCC)的数据库存储引擎的缺点。

InnoDB allows your query to be isolated in a transaction, without blocking other concurrent clients who are reading and writing rows of data. InnoDB允许您将查询隔离在一个事务中,而不会阻止正在读取和写入数据行的其他并发客户端。 Those concurrent updates don't affect the view of data your transaction has. 这些并发更新不会影响您的事务处理的数据视图。

But what is the count of rows in the table, given that many of the rows are in progress of being added or deleted while you're doing the count? 但是,考虑到在进行计数时有许多行正在添加或删除中,表中的行数是多少? The answer is fuzzy. 答案是模糊的。

Your transaction shouldn't be able to "see" row versions that were created after your transaction started. 您的事务不应能够“查看”在事务开始后创建的行版本。 Likewise, your transaction should count rows even if someone else has requested they be deleted, but they did so after your transaction started. 同样,即使其他人已要求删除行,您的事务也应该对行进行计数,但是行在事务开始后才被删除。

The answer is that when you do a SELECT COUNT(*) — or any other type of query that needs to examine many rows — InnoDB has to visit every row, to see which is the current version of that row visible to your transaction's view of the database, and count it if it's visible. 答案是,当您执行SELECT COUNT(*)或需要检查许多行的任何其他类型的查询时,InnoDB必须访问每一行,以查看对您的事务视图可见的该行的当前版本。数据库,并对其进行计数(如果可见)。

In a table that doesn't support transactions or concurrent updates, like MyISAM, the storage engine keeps the total count of rows as metadata for the table. 在不支持事务或并发更新的表(例如MyISAM)中,存储引擎会将总行数保留为表的元数据。 This storage engine can't support multiple threads updating rows concurrently, so the total count of rows is less fuzzy. 该存储引擎无法支持多个线程同时更新行,因此行的总数不那么模糊。 So when you request SELECT COUNT(*) from a MyISAM table, it just returns the count of rows it has in memory (but this isn't useful if you do SELECT COUNT(*) with a WHERE clause to count some subset of rows by some condition, so it has to actually count them in that case). 因此,当您从MyISAM表中请求SELECT COUNT(*) ,它仅返回其在内存中的行数(但是,如果您使用WHERE子句执行SELECT COUNT(*)来计算行的某些子集,这将无用在某种情况下,因此在这种情况下必须对其进行计数)。

In general, most people find InnoDB's support for concurrent updates is worth a lot, and they are willing to sacrifice the optimization of SELECT COUNT(*) . 通常,大多数人发现InnoDB对并发更新的支持非常有价值,并且他们愿意牺牲SELECT COUNT(*)的优化。

In addition to what Bill says... 除了比尔说的...

Smallest index 最小指数

InnoDB picks the 'smallest' index for doing COUNT(*) . InnoDB选择“最小”索引来执行COUNT(*) It could be that all of the indexes of communication are bigger than the smallest of transaction , hence the time difference. 可能所有的communication指标都大于最小的transaction ,因此存在时间差。 When judging the size of an index, include the PRIMARY KEY column(s) with any secondary index: 在判断索引的大小时,请在PRIMARY KEY列中包含任何辅助索引:

PRIMARY KEY(id),   -- INT (4 bytes)
INDEX(flag),       -- TINYINT (1 byte)
INDEX(name),       -- VARCHAR(255) (? bytes)

For measuring size, the PRIMARY KEY has big since it includes (due to clustering) all the columns of the table. 对于度量大小, PRIMARY KEY很大,因为它包含(由于群集)表的所有列。 INDEX(flag) is "5 bytes". INDEX(flag)为“ 5个字节”。 INDEX(name) probably averages a few dozen bytes. INDEX(name)可能平均几十个字节。 SELECT COUNT(*) will clearly pick INDEX(flag) . SELECT COUNT(*)将清楚地选择INDEX(flag)

Apparently transaction has a 'small' index, but communication does not. 显然, transaction的索引很小,但是communication却没有。

TEXT / BLOG columns are sometimes stored "off-record". TEXT / BLOG列有时存储为“脱记录”。 Hence, they do not count in the size of the PK index. 因此,它们不计入PK指数的大小。

Query Cache 查询缓存

If the "Query cache" is turned on, the second running of a query may be immensely faster than the first. 如果“查询缓存”已打开,则查询的第二次运行可能比第一次运行快得多。 But that is only if there were no changes to the table in the mean time. 但这仅是在此期间表没有更改的情况下。 Since any change to the table invalidates all QC entries for that table, the QC is rarely useful in production systems. 由于对该表的任何更改都会使该表的所有 QC条目失效,因此QC在生产系统中很少有用。 By "faster" I mean on the order of 0.001 seconds; “更快”是指大约0.001秒; not 1.44 seconds. 不是1.44秒。

The difference between 1m38s and 1.44s is probably due to what was cached in the buffer_pool -- the general caching area for InnoDB. 1m38s和1.44s之间的差异可能是由于buffer_pool中缓存的内容所致,后者是InnoDB的常规缓存区域。 The first run probably found none of the 'smallest' index in RAM so it did a lot of I/O, taking 98 seconds to fetch all 4.5M rows of that index. 第一次运行可能未在RAM中找到“最小”索引,因此它进行了大量I / O,花费98秒来获取该索引的所有4.5M行。 The second run found all that data cached in the buffer_pool, so it ran at CPU speed (no I/O), hence much faster. 第二次运行发现所有缓存在buffer_pool中的数据,因此它以CPU速度(无I / O)运行,因此速度更快。

Good Enough 够好了

In situations like this, I question the necessity of doing the COUNT(*) at all. 在这种情况下,我完全质疑执行COUNT(*)的必要性。 Notice how you said "2.8 mio entries", as if 2 significant digits was "good enough". 注意您怎么说“ 2.8 mio entry”,好像2位有效数字“足够好”一样。 If you are displaying the count to users on a UI, won't that be "good enough"? 如果要在UI上向用户显示计数,那还不够“好”吗? If so, one solution to the performance is to do the count once a day and store it some place. 如果这样的话,一种解决方案是每天进行一次计数并将其存储在某个位置。 This would allow instantaneous access to a "good enough" value. 这将允许瞬时访问“足够好”的值。

There are other techniques. 还有其他技术。 One is to keep the counter updated, either with active code, or with some form of Summary Table. 一种是使用活动代码或某种形式的摘要表来保持计数器的更新。

Throwing hardware at it 扔硬件

You already found that changing the hardware did not help. 您已经发现更改硬件没有帮助。

  • The 98s was as fast as any of RDS's I/O offerings can run. 98年代是RDS的任何I / O产品都能运行的最快速度。
  • The 1.44s was as fast as any one RDS CPU can run. 1.44的速度与任何一个RDS CPU可以运行的速度一样快。
  • MySQL (and its variants) do not use more than one CPU per query. MySQL(及其变体)每个查询使用的CPU不超过一个。
  • You had enough RAM so the entire 'small' index would fit in the buffer_pool until your second SELECT COUNT(*).. (Too little RAM would have led the second running to be very slow.) 您有足够的RAM,因此整个“小”索引将一直容纳在buffer_pool中,直到您的第二个SELECT COUNT(*).. (RAM太少会导致第二个运行非常慢。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM