[英]Filter a one-to-many query by requiring all of many meet criteria
Imagine the following tables: 想象一下下表:
create table boxes( id int, name text, ...); 创建表格框(id int,name text,...);
create table thingsinboxes( id int, box_id int, thing enum('apple,'banana','orange'); 创建表thinginboxes(id int,box_id int,thing enum('apple,'banana','orange');
And the tables look like: 表格如下:
Boxes: id | name 1 | orangesOnly 2 | orangesOnly2 3 | orangesBananas 4 | misc thingsinboxes: id | box_id | thing 1 | 1 | orange 2 | 1 | orange 3 | 2 | orange 4 | 3 | orange 5 | 3 | banana 6 | 4 | orange 7 | 4 | apple 8 | 4 | banana
How do I select the boxes that contain at least one orange and nothing that isn't an orange? 如何选择包含至少一个橙色的盒子,而不包含任何不是橙色的盒子?
How does this scale, assuming I have several hundred thousand boxes and possibly a million things in boxes? 这个规模如何,假设我有数十万个盒子,可能有一百万个盒子?
I'd like to keep this all in SQL if possible, rather than post-processing the result set with a script. 如果可能的话,我想将这一切保留在SQL中,而不是使用脚本对结果集进行后处理。
I'm using both postgres and mysql, so subqueries are probably bad, given that mysql doesn't optimize subqueries (pre version 6, anyway). 我正在使用postgres和mysql,因此子查询可能很糟糕,因为mysql没有优化子查询(无论如何都是版本6)。
SELECT b.*
FROM boxes b JOIN thingsinboxes t ON (b.id = t.box_id)
GROUP BY b.id
HAVING COUNT(DISTINCT t.thing) = 1 AND SUM(t.thing = 'orange') > 0;
Here's another solution that does not use GROUP BY: 这是另一个不使用GROUP BY的解决方案:
SELECT DISTINCT b.*
FROM boxes b
JOIN thingsinboxes t1
ON (b.id = t1.box_id AND t1.thing = 'orange')
LEFT OUTER JOIN thingsinboxes t2
ON (b.id = t2.box_id AND t2.thing != 'orange')
WHERE t2.box_id IS NULL;
As always, before you make conclusions about the scalability or performance of a query, you have to try it with a realistic data set, and measure the performance. 与往常一样,在您对查询的可伸缩性或性能做出结论之前, 您必须使用实际数据集进行尝试 ,并测量性能。
I think Bill Karwin's query is just fine, however if a relatively small proportion of boxes contain oranges, you should be able to speed things up by using an index on the thing
field: 我认为Bill Karwin的查询很好,但是如果相对较小比例的盒子包含橙子,你应该能够通过在
thing
字段上使用索引来加快速度:
SELECT b.*
FROM boxes b JOIN thingsinboxes t1 ON (b.id = t1.box_id)
WHERE t1.thing = 'orange'
AND NOT EXISTS (
SELECT 1
FROM thingsinboxes t2
WHERE t2.box_id = b.id
AND t2.thing <> 'orange'
)
GROUP BY t1.box_id
The WHERE NOT EXISTS
subquery will only be run once per orange thing, so it's not too expensive provided there aren't many oranges. WHERE NOT EXISTS
子查询只会在每个橙色的东西上运行一次,所以如果橙子不多,它就不会太贵。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.