在一个表中选择一行的最有效方法：MySQL中的多对表

Question

Let's say I've got the following data in one-to-many tables city and person, respectively: 假设我在一对多表city和person中分别获得了以下数据：

SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name   | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
|       1 | chicago     |         1 | charles     |              1 |
|       1 | chicago     |         2 | celia       |              1 |
|       1 | chicago     |         3 | curtis      |              1 |
|       1 | chicago     |         4 | chauncey    |              1 |
|       2 | new york    |         5 | nathan      |              2 |
|       3 | los angeles |         6 | luke        |              3 |
|       3 | los angeles |         7 | louise      |              3 |
|       3 | los angeles |         8 | lucy        |              3 |
|       3 | los angeles |         9 | larry       |              3 |
+---------+-------------+-----------+-------------+----------------+
9 rows in set (0.00 sec)

And I want to select a single record from person for each unique city using some particular logic. 我想使用某种特定逻辑从每个人的唯一城市中选择一条记录。 For example: 例如：

SELECT city.*, person.* FROM city, person WHERE city.city_id = person.person_city_id
GROUP BY city_id ORDER BY person_name DESC
;

The implication here is that within each city, I want to get the lexigraphically greatest value, eg: 这里的含义是，在每个城市中，我都希望获得从逻辑上讲最大的价值，例如：

+---------+-------------+-----------+-------------+----------------+
| city_id | city_name   | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
|       2 | new york    |         5 | nathan      |              2 |
|       3 | los angeles |         6 | luke        |              3 |
|       1 | chicago     |         1 | curtis      |              1 |
+---------+-------------+-----------+-------------+----------------+

The actual output I get, however, is: 但是，我得到的实际输出是：

+---------+-------------+-----------+-------------+----------------+
| city_id | city_name   | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
|       2 | new york    |         5 | nathan      |              2 |
|       3 | los angeles |         6 | luke        |              3 |
|       1 | chicago     |         1 | charles     |              1 |
+---------+-------------+-----------+-------------+----------------+

I understand that the reason for this discrepancy is that MySQL first performs the GROUP BY, then it does the ORDER BY. 我知道这种差异的原因是MySQL首先执行GROUP BY，然后执行ORDER BY。 This is unfortunate for me, as I want the GROUP BY to have selection logic in which record it picks. 这对我来说是不幸的，因为我希望GROUP BY具有选择记录的选择逻辑。

I can workaround this by using some nested SELECT statements: 我可以通过使用一些嵌套的SELECT语句来解决此问题：

SELECT c.*, p.* FROM city c,
    ( SELECT p_inner.* FROM
        ( SELECT * FROM person ORDER BY person_city_id, person_name DESC ) p_inner
        GROUP BY person_city_id ) p
    WHERE c.city_id = p.person_city_id;
+---------+-------------+-----------+-------------+----------------+
| city_id | city_name   | person_id | person_name | person_city_id |
+---------+-------------+-----------+-------------+----------------+
|       1 | chicago     |         3 | curtis      |              1 |
|       2 | new york    |         5 | nathan      |              2 |
|       3 | los angeles |         6 | luke        |              3 |
+---------+-------------+-----------+-------------+----------------+

This seems like it would be terribly inefficient when the person table grows arbitrarily large. 当person表任意增大时，这似乎效率极低。 I assume the inner SELECT statements don't know about outermost WHERE filters. 我假设内部的SELECT语句不了解最外面的WHERE过滤器。 Is this true? 这是真的？

What is the accepted best approach for doing what effectively is an ORDER BY before the GROUP BY? 在GROUP BY 之前有效执行ORDER BY的最佳方法是什么？

Answer 1

The usual way to do this (in MySQL) is with a join of your table to itself. 通常的方法（在MySQL中）是将表连接到自身。

First to get the greatest person_name per city (ie per person_city_id in the person table): 首先获得每个city的最大person_name （即， person表中的每个person_city_id ）：

SELECT p.*
FROM person p
LEFT JOIN person p2
 ON p.person_city_id = p2.person_city_id
 AND p.person_name < p2.person_name
WHERE p2.person_name IS NULL

This joins person to itself within each person_city_id (your GROUP BY variable), and also pairs the tables up such that p2 's person_name is greater than p 's person_name . 这在每个person_city_id （您的GROUP BY变量）内将person自身连接起来，并且还对表进行配对，以使p2的person_name大于p的person_name 。

Since it's a left join if there's a p.person_name for which there is no greater p2.person_name (within that same city), then the p2.person_name will be NULL . 因为它是一个左连接，如果有一个p.person_name对此有没有更大的 p2.person_name （即同一城市内），那么p2.person_name将是NULL 。 These are precisely the "greatest" person_name s per city. 这些恰好是每个城市中“最大”的person_name 。

So to join your other information (from city ) to it, just do another join: 因此，要将您的其他信息（从city ）加入其中，只需执行另一次加入：

SELECT c.*,p.*
FROM person p
LEFT JOIN person p2
 ON p.person_city_id = p2.person_city_id
 AND p.person_name < p2.person_name
LEFT JOIN city c                           -- add in city table
 ON p.person_city_id = c.city_id           -- add in city table
WHERE p2.person_name IS NULL               -- ORDER BY c.city_id if you like

Answer 2

Your "solution" is not valid SQL but it works in MySQL. 您的“解决方案”不是有效的SQL，但可以在MySQL中使用。 You can't be sure however if it will break with a future change in the query optimizer code. 但是，您不确定它是否会随着查询优化器代码的将来更改而中断。 It could be slightly improved to have just 1 level of nesting (still not valid SQL): 可以稍加改进以仅具有1级嵌套（仍然无效的SQL）：

--- Option 1 ---
SELECT 
       c.*
     , p.* 
FROM 
      city AS c
  JOIN
      ( SELECT * 
        FROM person 
        ORDER BY person_city_id
               , person_name DESC 
      ) AS p
    ON  c.city_id = p.person_city_id
GROUP BY p.person_city_id

Another way (valid SQL syntax, works in other DBMS, too) is to make a subquery to select the last name for every city and then join: 另一种方法（有效的SQL语法，也可以在其他DBMS中使用）是进行子查询以选择每个城市的姓氏，然后进行联接：

--- Option 2 ---
SELECT 
       c.*
     , p.* 
FROM 
      city AS c
  JOIN
      ( SELECT person_city_id
             , MAX(person_name) AS person_name 
        FROM person 
        GROUP BY person_city_id
      ) AS pmax
    ON  c.city_id = pmax.person_city_id
  JOIN 
      person AS p
    ON  p.person_city_id = pmax.person_city_id
    AND p.person_name = pmax.person_name

Another way is the self join (of the table person ), with the < trick that @mathematical_coffee describes. 另一种方法是（表person ）自我联接，使用@mathematical_coffee描述的<技巧。

--- Option 3 ---
  see @mathematical-coffee's answer

Yet another way is to use a LIMIT 1 subquery for the join of city with person : 另一种方法是使用LIMIT 1子查询将city与person连接起来：

--- Option 4 ---
SELECT 
       c.*
     , p.* 
FROM 
      city AS c
  JOIN
      person AS p
    ON
      p.person_id =
      ( SELECT person_id
        FROM person AS pm 
        WHERE pm.person_city_id = c.city_id
        ORDER BY person_name DESC
        LIMIT 1
      )

This will run a subquery (on table person ) for every city and it will be efficient if you have a (person_city_id, person_name) index for InnoDB engine or an (person_city_id, person_name, person_id) for MyISAM engine. 这将为每个城市运行一个子查询（在表person ），并且如果您有InnoDB引擎的(person_city_id, person_name)索引或MyISAM引擎的(person_city_id, person_name, person_id)索引，这将非常有效。

There is one major difference between these options: 这些选项之间有一个主要区别：

Oprions 2 and 3 will return all tied results (if you have two or more persons in a city with same name that is alphabetically last, then both or all will be shown). 选项2和3将返回所有并列的结果（如果您在一个城市中有两个或更多个人的名字按字母顺序排在最后，那么将显示全部或全部）。

Options 1 and 4 will return one result per city, even if there are ties. 即使有联系，选项1和4也会为每个城市返回一个结果。 You can choose which one by altering the ORDER BY clause. 您可以通过更改ORDER BY子句来选择哪一个。

Which option is more efficient depends also on the distribution of your data, so the best way is to try them all, check their execution plans and find the best indexes that work for each one. 哪种方法更有效还取决于数据的分布，因此最好的方法是尝试所有数据，检查其执行计划并找到适用于每个数据的最佳索引。 An index on (person_city_id, person_name) will most likely be good for any of those queries. (person_city_id, person_name)的索引很可能对这些查询中的任何一个都有利。

With distribution I mean: 分配是指：

Do you have few cities with many persons per city? 您有几个城市，每个城市都有很多人吗？ (I would think that options 2 and 4 would behave better in this case) （我认为在这种情况下，选项2和4会表现得更好）
Or many cities with few persons per city? 还是许多城市每个城市的人少？ (option 3 may be better with such data). （对于此类数据，选项3可能更好）。

在一个表中选择一行的最有效方法：MySQL中的多对表

问题描述

2 个解决方案

解决方案1
1 已采纳 2012-02-06 00:28:11

解决方案2
0 2012-02-06 00:35:11

在一个表中选择一行的最有效方法：MySQL中的多对表

问题描述

2 个解决方案

解决方案1 1 已采纳 2012-02-06 00:28:11

解决方案2 0 2012-02-06 00:35:11

解决方案1
1 已采纳 2012-02-06 00:28:11

解决方案2
0 2012-02-06 00:35:11