简体   繁体   English

WHERE子句中的字段顺序是否会影响MySQL的性能?

[英]Does the order of fields in a WHERE clause affect performance in MySQL?

I have two indexed fields in a table - type and userid (individual indexes, not a composite). 我在表中有两个索引字段 - typeuserid (单个索引,而不是复合索引)。

type s field values are very limited (let's say it is only 0 or 1), so 50% of table records have the same type . type s字段值非常有限(假设它只有0或1),因此50%的表记录具有相同的type userid values, on the other hand, come from a much larger set, so the amount of records with the same userid is small. 另一方面, userid值来自更大的集合,因此具有相同userid的记录数量很小。

Will any of these queries run faster than the other: 这些查询中的任何一个都会比另一个运行得更快:

select * from table where type=1 and userid=5
select * from table where userid=5 and type=1

Also if both fields were not indexed, would it change the behavior? 此外,如果两个字段都没有编入索引,它会改变行为吗?

SQL was designed to be a declarative language, not a procedural one. SQL被设计为声明性语言,而不是程序性语言。 So the query optimizer should not consider the order of the where clause predicates in determining how to apply them. 因此,查询优化器应该考虑在决定如何应用它们的where子句谓词的顺序。

I'm probably going to waaaay over-simplify the following discussion of an SQL query optimizer. 我可能会过度简化以下对SQL查询优化器的讨论。 I wrote one years ago, along these lines (it was tons of fun!). 我在一年前写过这样的文章(这很有趣!)。 If you really want to dig into modern query optimization, see Dan Tow's SQL Tuning , from O'Reilly. 如果您真的想深入了解现代查询优化,请参阅O'Reilly的Dan Tow的SQL Tuning

In a simple SQL query optimizer, the SQL statement first gets compiled into a tree of relational algebra operations. 在简单的SQL查询优化器中,SQL语句首先被编译到关系代数操作树中。 These operations each take one or more tables as input and produce another table as output. 这些操作每个都将一个或多个表作为输入,并生成另一个表作为输出。 Scan is a sequential scan that reads a table in from the database. 扫描是从数据库中读取表的顺序扫描。 Sort produces a sorted table. Sort生成一个已排序的表。 Select produces a table whose rows are selected from another table according to some selection condition. Select生成一个表,根据某些选择条件从另一个表中选择行。 Project produces a table with only certain columns of another table. Project生成一个只包含另一个表的某些列的表。 Cross Product takes two tables and produces an output table composed of every conceivable pairing of their rows. Cross Product采用两个表并生成一个输出表,该输出表由每行可能的配对组成。

Confusingly, the SQL SELECT clause is compiled into a relational algebra Project , while the WHERE clause turns into a relational algebra Select . 令人困惑的是,SQL SELECT子句被编译成关系代数Project ,而WHERE子句变成了关系代数Select The FROM clause turns into one or more Joins , each taking two tables in and producing one table out. FROM子句变成一个或多个连接 ,每个连接占用两个表并生成一个表。 There are other relational algebra operations involving set union, intersection, difference, and membership, but let's keep this simple. 还有其他关系代数操作涉及集合,交集,差异和成员资格,但让我们保持这个简单。

This tree really needs to be optimized. 这棵树真的需要优化。 For example, if you have: 例如,如果您有:

select E.name, D.name 
from Employee E, Department D 
where E.id = 123456 and E.dept_id = D.dept_id

with 5,000 employees in 500 departments, executing an unoptimized tree will blindly produce all possible combinations of one Employee and one Department (a Cross Product ) and then Select out just the one combination that was needed. 在500个部门拥有5,000名员工,执行未经优化的树将盲目地生成一个员工和一个部门(一个交叉产品 )的所有可能组合,然后选择所需的一个组合。 The Scan of Employee will produce a 5,000 record table, the Scan of Department will produce a 500 record table, the Cross Product of those two tables will produce a 2,500,000 record table, and the Select on E.id will take that 2,500,000 record table and discard all but one, the record that was wanted. Scan of Employee将生成一个5,000记录表, Scan of Department将生成一个500记录表,这两个表的Cross Product将产生一个2,500,000记录表, Select on E.id将采用该2,500,000记录表和丢弃除了一个之外的所有记录。

[Real query processors will try not to materialize all of these intermediate tables in memory of course.] [当然,查询处理器会尽量不在内存中实现所有这些中间表。]

So the query optimizer walks the tree and applies various optimizations. 因此,查询优化器遍历树并应用各种优化。 One is to break up each Select into a chain of Selects , one for each of the original Select 's top level conditions, the ones and-ed together. 一种方法是将每个选择分解成一组选择 ,一个用于原始选择的顶级条件,一个和一起。 (This is called "conjunctive normal form".) Then the individual smaller Selects are moved around in the tree and merged with other relational algebra operations to form more efficient ones. (这称为“联合正规形式”。)然后,单个较小的选择在树中移动并与其他关系代数运算合并以形成更有效的选择。

In the above example, the optimizer first pushes the Select on E.id = 123456 down below the expensive Cross Product operation. 在上面的示例中,优化程序首先将Select on E.id = 123456压低到昂贵的Cross Product操作之下。 This means the Cross Product just produces 500 rows (one for each combination of that employee and one department). 这意味着Cross Product只生产500行(该员工和一个部门的每个组合各一行)。 Then the top level Select for E.dept_id = D.dept_id filters out the 499 unwanted rows. 然后顶级选择 E.dept_id = D.dept_id过滤出499个不需要的行。 Not bad. 不错。

If there's an an index on Employee's id field, then the optimizer can combine the Scan of Employee with the Select on E.id = 123456 to form a fast index Lookup . 如果有一个关于雇员的ID字段上的索引,那么优化器可以与选择上E.id = 123456结合员工的扫描 ,以形成快速索引查找 This means that only one Employee row is read into memory from disk instead of 5,000. 这意味着只有一个Employee行从磁盘而不是5,000读入内存。 Things are looking up. 事情在好转。

The final major optimization is to take the Select on E.dept_id = D.dept_id and combine it with the Cross Product . 最后的主要优化是选择 E.dept_id = D.dept_id,并将其与Cross Product结合使用。 This turns it into a relational algebra Equijoin operation. 这将它变成了关系代数Equijoin操作。 This doesn't do much by itself. 这本身并没有太大作用。 But if there's an index on Department.dept_id, then the lower level sequential Scan of Department feeding the Equijoin can be turned into a very fast index Lookup of our one employee's Department record. 但是如果在Department.dept_id上有一个索引,则可以将提供Equijoin的较低级别顺序Scan of Department转换为我们一个员工的部门记录的非常快速的索引查找

Lesser optimizations involve pushing Project operations down. 较少的优化涉及推动项目运营。 If the top level of your query just needs E.name and D.name, and the conditions need E.id, E.dept_id, and D.dept_id, then the Scan operations don't have to build intermediate tables with all the other columns, saving space during the query execution. 如果查询的最高级别只需要E.name和D.name,并且条件需要E.id,E.dept_id和D.dept_id,那么扫描操作不必与其他所有构建中间表列,在查询执行期间节省空间。 We've turned a horribly slow query into two index lookups and not much else. 我们将一个非常缓慢的查询转换为两个索引查找而不是其他。

Getting more towards the original question, let's say you've got: 更多地关注原始问题,让我们说你得到了:

select E.name 
from Employee E 
where E.age > 21 and E.state = 'Delaware'

The unoptimized relational algebra tree, when executed, would Scan in the 5,000 employees and produce, say, the 126 ones in Delaware who are older than 21. The query optimizer also has some rough idea of the values in the database. 未经优化的关系代数树在执行时将扫描5,000名员工,并生成比特拉华州中超过21名的126名员工。查询优化器还对数据库中的值有一些粗略的了解。 It might know that the E.state column has the 14 states that the company has locations in, and something about the E.age distributions. 它可能知道E.state列具有公司所在位置的14个状态,以及有关E.age分布的信息。 So first it sees if either field is indexed. 所以首先它会看到是否索引了任何一个字段。 If E.state is, it makes sense to use that index to just pick out the small number of employees the query processor suspects are in Delaware based on its last computed statistics. 如果是E.state,那么使用该索引来挑选查询处理器怀疑在特拉华州的少数员工是基于其最后计算的统计数据是有意义的。 If only E.age is, the query processor likely decides that it's not worth it, since 96% of all employees are 22 and older. 如果只有E.age,查询处理器可能会认为它不值得,因为96%的员工都是22岁以上。 So if E.state is indexed, our query processor breaks the Select and merges the E.state = 'Delaware' with the Scan to turn it into a much more efficient Index Scan . 因此,如果E.state被编入索引,我们的查询处理器会中断Select并将E.state ='Delaware'与Scan合并,将其转换为更高效的Index Scan

Let's say in this example that there are no indexes on E.state and E.age. 让我们说在这个例子中,E.state和E.age上没有索引。 The combined Select operation takes place after the sequential "Scan" of Employee. 组合的Select操作发生在Employee的连续“Scan”之后。 Does it make a difference which condition in the Select is done first? 首先完成选择中的哪个条件会有所不同吗? Probably not a great deal. 可能不是很多。 The query processor might leave them in the original order in the SQL statement, or it might be a bit more sophisticated and look at the expected expense. 查询处理器可能会将它们保留在SQL语句中的原始顺序中,或者它可能会更复杂并查看预期的开销。 From the statistics, it would again find that the E.state = 'Delaware' condition should be more highly selective, so it would reverse the conditions and do that first, so that there are only 126 E.age > 21 comparisons instead of 5,000. 从统计数据来看,它会再次发现E.state ='Delaware'条件应该更具选择性,因此它会颠倒条件并首先执行此操作,因此只有126个E.age> 21个比较而不是5,000个。 Or it might realize that string equality comparisons are much more expensive than integer compares and leave the order alone. 或者它可能意识到字符串相等比较比整数比较昂贵得多,并且单独保留顺序。

At any rate, all this is very complex and your syntactic condition order is very unlikely to make a difference. 无论如何,所有这些都是非常复杂的,你的句法条件顺序不太可能有所作为。 I wouldn't worry about it unless you have a real performance problem and your database vendor uses the condition order as a hint. 除非您遇到真正的性能问题并且数据库供应商使用条件顺序作为提示,否则我不担心它。

Most query optimizers use the order in which conditions appear as a hint. 大多数查询优化器使用条件显示为提示的顺序。 If everything else is equal, they will follow that order. 如果其他条件相同,他们将遵循该顺序。

However, many things can override that: 但是,许多事情可以覆盖:

  • the second field has an index and the first has not 第二个字段有一个索引,第一个字段没有
  • there are statistics to suggest that field 2 is more selective 有统计数据表明第2场更具选择性
  • the second field is easier to search ( varchar(max) vs int ) 第二个字段更容易搜索( varchar(max) vs int

So (and this is true for all SQL optimization questions) unless you observe a performance issue, it's better to optimize for clarity, not for (imagined) performance. 所以(对于所有SQL优化问题都是如此)除非你观察到性能问题,否则最好是为了清晰度而不是(想象的)性能进行优化。

It shouldn't in your small example. 它不应该在你的小例子中。 The query optimizer should do the right thing. 查询优化器应该做正确的事情。 You can check for sure by adding explain to the front of the query. 您可以通过在查询前面添加explain来确认。 MySQL will tell you how it's joining things together and how many rows it needs to search in order to do the join. MySQL将告诉你如何将它们连接在一起以及为了进行连接需要搜索多少行。 For example: 例如:

explain select * from table where type=1 and userid=5

If they were not indexed it would probably change behavior. 如果它们没有编入索引,则可能会改变行为。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM