简体   繁体   English

SQL查询联接表和having子句

[英]SQL Query Joining tables and having clause

I am writing a query in SQL to pull a player's home run totals in their age 35 season who have hit over 500 home runs in their career. 我正在用SQL编写查询,以获取35岁以下年龄段球员的本垒打总数,这些球员在其职业生涯中打了500多个本垒打。

SELECT b.playerID, b.yearID, b.HR
FROM batting b
JOIN master m ON m.playerID = b.playerID
WHERE b.yearID - m.birthYear = '35'
HAVING SUM(b.HR) > 500

This query times out while executing. 该查询在执行时超时。 I have successfully created a query to return a player's home run total in a specific age season. 我已经成功创建了一个查询,以返回特定年龄段球员的本垒打总数。 I have also successfully created a query to return players in the 500 home run club. 我还成功创建了一个查询,以返回500个本垒打俱乐部中的球员。

When I try to combine them something makes it time out and I cannot determine why. 当我尝试将它们组合在一起时,会有些超时,并且我无法确定原因。

Here is a query that works well: 这是一个运行良好的查询:

SELECT b.playerID, b.yearID, b.HR
FROM batting b
JOIN MASTER M ON b.playerID = m.playerID
WHERE b.yearID - m.birthYear = 35 AND b.yearID = 2015
ORDER BY b.HR DESC

Now if I could only incorporate returning only those players who have hit 500 career home runs in this result. 现在,如果我只能将只返回500个本垒打的球员归还给这个结果。 Only 500 home run hitters HR total in 2015. 2015年,本垒打总人数只有500人。

The most likely explanation is that the execution plan chosen by the optimizer isn't as efficient as it could be. 最可能的解释是,优化器选择的执行计划效率不高。

What we're not seeing is what indexes are available on these tables. 我们没有看到的是这些表上可用的索引。

One thing that stands out about the query is this: 关于查询的一件事是:

WHERE b.yearID - m.birthYear = '35'

MySQL is going to take every row from master with a given player_id and match that to every row from batting with that same player_id (because of the equality join predicate) MySQL将使用具有给定player_id的master行中的每一行,并使用相同的player_id将其与batting每一行相匹配(因为相等联接谓词)

  ON m.playerID = b.playerID

And then MySQL has to take that set of combined rows and then calculate this expression 然后,MySQL必须采用一组合并的行,然后计算该表达式

    b.yearID - m.birthYear

And then take the result from that and compare it to '35'. 然后从中获取结果并将其与“ 35”进行比较。

Assuming that the playerID column is unique on the master table 假设playerID列在master表上是唯一的

We'd prefer to see the query predicates written in a form that can take advantage of an index on batting that has leading columns of (playerID,yearID) . 我们希望看到查询谓词以某种形式编写,该形式可以利用前导(playerID,yearID) batting索引。

 SELECT b.playerid
      , b.yearid
      , SUM(b.hr) AS hr
   FROM master m
   JOIN batting b
     ON b.playerid = m.playerid
    AND b.yearid   = m.birthyear + 35
  GROUP BY b.playerid, b.yearid
HAVING SUM(b.hr) > 500
 ORDER BY SUM(b.hr) DESC

To get rows returned for each player, you are going to need a GROUP BY clause. 要获得每个玩家返回的行,您将需要GROUP BY子句。 And to get the total home runs, you are going to need a SUM() aggregate in the SELECT list. 为了获得全部本垒打,您将需要在SELECT列表中聚合一个SUM()。

For optimal performance of the query, you'd want a covering index 为了使查询达到最佳性能,您需要一个覆盖索引

... ON batting (playerid, yearid, hr)

If playerid is not unique on master table, then the query isn't going to guarantee the value you expect for SUM(b.hr), the value could be double, triple, etc. of what is expected. 如果playeridmaster表上不是唯一的,则查询将无法保证您期望的SUM(b.hr)值,该值可以是期望值的两倍,三倍等。

Use EXPLAIN to see the execution plan. 使用EXPLAIN查看执行计划。

Also beware of implicit datatype conversions that can have a negative impact on an execution plan. 还要提防隐式数据类型转换,这可能会对执行计划产生负面影响。 We're assuming that the datatype of the playerid column in both tables matches, and that the datatype of the yearid and birthyear columns is numeric. 我们假设两个表中的playerid列的数据类型都匹配,并且yearidbirthyear列的数据类型是数字。

EDIT 编辑

My original answer was focused on the reasons your query "timed out", and I missed the specification for the result you want to achieve: 我最初的答案集中在查询“超时”的原因上,而我错过了要实现的结果的规范:

Return players that have career HR total over 500, and return the year total HR for a specific year for each of those players. 返回职业 HR总计超过500的球员,并为每个球员返回特定年份的总HR。

(I'll set aside a discussion of appropriately determining the "year" in which a player turns 35 years of age, and use the criteria from the original query.) (我将讨论适当确定玩家年满35岁的“年”,并使用原始查询中的条件。)

One approach is to use conditional aggregation . 一种方法是使用条件聚合 Use an expression that returns HR when a condition is TRUE, and otherwise return 0 or NULL. 当条件为TRUE时,使用返回HR的表达式,否则返回0或NULL。 And then wrap that expression in a SUM aggregate in the SELECT list. 然后将该表达式包装在SELECT列表中的SUM聚合中。

If we want to return only rows for players with career HR total over 500 and that have at least one row in batting for the specified year... 如果我们想为玩家提供职业HR总数超过500,并且在至少一排返回行batting指定的一年...

 SELECT b.playerid
      , MAX(IF(b.yearid = m.birthyear + 35,b.yearid,NULL)) AS yearid
      , SUM(IF(b.yearid = m.birthyear + 35, b.hr, 0)) AS year_hr
   FROM master m
   JOIN batting b
     ON b.playerid = m.playerid
  GROUP BY b.playerid
HAVING SUM(b.hr) > 500
   AND MAX(IF(b.yearid = m.birthyear + 35,b.yearid,NULL)) IS NOT NULL
 ORDER BY ... 

To return rows for every player that has a career HR total over 500, even when there are no rows in batting for the specified yearid , we can tweak the query to omit the second condition in the HAVING clause, and use expression m.birthyear + 35 in the the SELECT list 要为职业总HR超过500的每个球员返回行,即使指定的yearid batting中没有行,我们也可以调整查询以忽略HAVING子句中的第二个条件,并使用表达式m.birthyear + SELECT列表中的35

 SELECT b.playerid
      , MAX(m.birthyear + 35) AS yearid
      , SUM(IF(b.yearid = m.birthyear + 35, b.hr, 0)) AS year_hr
   FROM master m
   JOIN batting b
     ON b.playerid = m.playerid
  GROUP BY b.playerid
HAVING SUM(b.hr) > 500
 ORDER BY ... 

Note that players that have a career HR total of exactly 500 will be excluded. 请注意,职业HR总计恰好为500的球员将被排除在外。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM