简体   繁体   English

为什么大于与等于在MySQL SELECT中有所不同?

[英]Why does greater-than versus equals make a difference in MySQL SELECT?

I have a large MyISAM table. 我有一个大的MyISAM表。 It's approaching 1 million rows. 它接近100万行。 It's basically a list of items and some information about them. 它基本上是一个项目列表和一些有关它们的信息。

There are two indices: 有两个指数:

  • primary: the item ID primary:商品ID
  • date (date) and col (int). date(date)和col(int)。

I run two queries: 我运行两个查询:

SELECT * FROM table WHERE date = '2011-02-01' AND col < 5 LIMIT 10

SELECT * FROM table WHERE date < '2011-02-01' AND col < 5 LIMIT 10

The first one finishes in ~0.0005 seconds and the second in ~0.05 seconds. 第一个在~0.0005秒内完成,第二个在~0.05秒内完成。 That is 100X difference. 这是100倍的差异。 Is it wrong for me to expect both of these to run at roughly the same speed? 我期望这两者以大致相同的速度运行是不对的? I must not be understanding the indices very well. 我不能很好地理解这些指数。 How can I speed up the second query? 如何加快第二次查询?

Regardless of Mysql it boils down to basic algorithm theory. 无论Mysql如何,它归结为基本算法理论。

Greater than and Less than operations on a large set are slower than Identity operations. 大集上的大于和小于操作比Identity操作慢。 With a large data set an ideal data structure for determining less than or greater is a self balancing tree (binary or n-tree). 对于大数据集,用于确定小于或大于的自然平衡树(二进制或n树)的理想数据结构。 On aa self balanced tree the worst case scenario to find all less/greater is log n . 在自平衡树上,找到所有更小/更大的最坏情况是log n

The ideal data structure for identity lookup is a hashtable. 身份查找的理想数据结构是哈希表。 The performance of hashtables is generally O(1) aka fixed time. 哈希表的性能通常是O(1)又称固定时间。 A hashtable however is not good for greater/less. 然而,散列表对于更大/更小是不利的。

Generally a well balanced tree is only slightly less performing than a hashtable (which is how Haskell gets away with using a tree for hashtables). 通常,一个平衡良好的树只比一个哈希表(这就是Haskell使用树用于哈希表的方式)的表现稍差。

Thus irregardless of what Mysql does its not surprise that <,> is slower than = 因此,无论Mysql做什么,<,>都比=慢,这并不奇怪

Old Answer below: 旧答案如下:

Because the first one is like Hashtable lookup since its '=' (particularly if your index is a hashtable) it will be faster than the second one which might work better with a tree like index. 因为第一个就像Hashtable查找一样,因为它的'='(特别是如果你的索引是一个哈希表),它会比第二个更快,它可能更像树索引。

Since MySql allows to configure the index format you can try changing that but I'm rather sure the first will always run faster than the second. 由于MySql允许配置索引格式,您可以尝试更改它,但我相信第一个将始终比第二个运行得更快。

I'm assuming you have an index on the date column. 我假设你在日期列上有一个索引。 The first query uses the index, the second query probably does a linear scan (at least over part of the data). 第一个查询使用索引,第二个查询可能执行线性扫描(至少部分数据)。 A direct fetch is always faster than a linear scan. 直接提取总是比线性扫描更快。

MySQL stores its indexes by default in a BTREE. MySQL默认将其索引存储在BTREE中。 No hashing in general. 一般没有哈希。

The short answer for the performance difference is that the < form evaluates more nodes then the = form. 性能差异的简短答案是<form评估更多节点然后评估= form。

The index that you've got on there (date, col) stores the values roughly like a phone book: 你在那里得到的索引(日期,col)将值大致存储为电话簿:

2011-01-01, col=1, row_ptr
2011-01-01, col=2, row_ptr
2011-01-01, col=3, row_ptr
etc...
2011-02-01, col=1, row_ptr
2011-02-01, col=2, row_ptr
2011-02-01, col=3, row_ptr
etc...
2011-02-02, col=1, row_ptr
2011-02-02, col=2, row_ptr
etc...

...in ascending sorted tree nodes of size B (2011-01-01, col=1) < (2011-01-01, col=2) < (2011-01-02, col=1). ...在大小为B的升序排序树节点中(2011-01-01,col = 1)<(2011-01-01,col = 2)<(2011-01-02,col = 1)。

Your question is essentially asking the difference between: 你的问题基本上是要求区别:

  1. Find all phone numbers with last name 'Smith' and first name starting with 'A' 查找姓氏为“Smith”的所有电话号码,以“A”开头的名字
  2. Find all phone numbers that come before 'Smith' and have first name starting with 'A' . 查找“史密斯”之前的所有电话号码,并以“A”开头的名字

It should be obvious why #1 is so much faster then #2. 很明显为什么#1比#2快得多。

There are also considerations of memory /disk transfer efficiency and heap allocations (= does WAY fewer transfers then <) that account for a not-insignificant amount of time but depend largely on the distribution of the data and the specific location of the 2011-02-01, col=min(col) key record. 还考虑了内存/磁盘传输效率和堆分配(= WAY减少传输然后<),这可以解释不可忽视的时间,但主要取决于数据的分布和2011-02的具体位置-01,col = min(col)密钥记录。

The first one performs a seek over data where as the second one goes for a scan . 第一个执行搜索数据,其中第二个用于扫描。 Scans are always costlier than seeks hence the time difference . 扫描总是比寻找更昂贵因此时差。

Its like that, the the scan means running through all the pages of the book where as seek is directly jumping to a page number. 就像那样,扫描意味着贯穿本书的所有页面,其中搜索直接跳转到页码。

Hope this might help. 希望这可能有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM