[英]Eliminate duplicates before union
I need to run a query, that select two columns from a big table (3m+ rows, with selecting two columns, the result set is around 6-7m) and returns a list. 我需要运行一个查询,从一个大表中选择两列(3m +行,选择两列,结果集大约为6-7m)并返回一个列表。 So I used union to merge the columns into a list and also to eliminate duplicates. 所以我使用union将列合并到列表中,同时消除重复。 The problem is that I cant return the result in one query, I need to partition it, so I applied a LIMIT ?,?
问题是我不能在一个查询中返回结果,我需要对它进行分区,所以我应用了LIMIT ?,?
to the subqueries, which the application layer sets via Prepared Statements. 到子查询,应用程序层通过Prepared Statements设置。
SELECT val
FROM
(
(SELECT fs.smr as val
FROM `fr_search` as fs
ORDER BY val LIMIT ?,?)
UNION
(SELECT fs.dmr as val
FROM `fr_search` as fs
ORDER BY val LIMIT ?,?)
) as vals
GROUP BY val
The problem: The union eliminates the duplicates, but only after the LIMIT is applied. 问题:联合消除了重复,但仅在应用LIMIT之后。 Meaning If the two query returns 100+100=200 rows and most of them is a duplicate, I only return <200 rows. 含义如果两个查询返回100 + 100 = 200行,并且大多数是重复的,我只返回<200行。 How can I apply a limit to such a query, that I can return a specific amount of rows? 如何对这样的查询应用限制,我可以返回特定数量的行? (If I apply the LIMIT after the subqueries, It will take more than two minutes to run, so It will not solve the problem. ) (如果我在子查询后应用LIMIT,则运行时间超过两分钟,因此无法解决问题。)
You don't actually need a subquery for this. 实际上你并不需要子查询。 The following will work for the first 100 rows: 以下内容适用于前100行:
(SELECT DISTINCT fs.smr as val
FROM `fr_search` as fs
ORDER BY val
LIMIT 100
)
UNION
(SELECT DISTINCT fs.dmr as val
FROM `fr_search` as fs
ORDER BY val
LIMIT 100
)
ORDER BY val
LIMIT 100;
However, once you start putting in offset, it gets more complicated. 但是,一旦你开始放入偏移量,就会变得更加复杂。 For the next 100 rows: 对于接下来的100行:
(SELECT DISTINCT fs.smr as val
FROM `fr_search` as fs
ORDER BY val
LIMIT 200
)
UNION
(SELECT DISTINCT fs.dmr as val
FROM `fr_search` as fs
ORDER BY val
LIMIT 200
)
ORDER BY val
LIMIT 100, 100;
The problem is that you don't know where the second set will come from. 问题是你不知道第二组的来源。
If you actually need to page through the result set, I would suggest that you store it in a temporary table and page off of the temporary table. 如果您确实需要翻阅结果集,我建议您将其存储在临时表中,并将页面存储在临时表中。
Query optimisation is always has two parts to the solution. 查询优化始终包含两个部分。 And is sometimes an iterative process of try, measure and compare. 并且有时是尝试,测量和比较的迭代过程。
The best query is most likely the straight-forward and simple: 最好的查询很可能是直截了当的简单:
SELECT v.val
FROM (
SELECT fs.smr as val
FROM `fr_search` as fs
UNION
SELECT fs.dmr as val
FROM `fr_search` as fs
) as v
ORDER BY v.val LIMIT ?,?;
In order to run efficiently, you'll want 2 indexes: 为了有效运行,您需要2个索引:
fr_search.smr
一个在fr_search.smr
fr_search.dmr
另一个在fr_search.dmr
If the optimiser cannot handle the above, then try using index hints to force it to use the indexes. 如果优化器无法处理上述情况,请尝试使用索引提示强制它使用索引。
In an extreme pinch you could try forcing the issue with the following: 在最极端的情况下,您可以尝试使用以下方法解决问题:
SELECT v.val
FROM (
SELECT DISTINCT fs.smr as val
FROM `fr_search` as fs
ORDER BY fs.smr LIMIT ?
UNION
SELECT DISTINCT fs.dmr as val
FROM `fr_search` as fs
ORDER BY fs.dmr LIMIT ?
) as v
ORDER BY v.val LIMIT ?,?;
Note that your substitutions (assuming pages of 100) should be as follows: 请注意,您的替换(假设页数为100)应如下所示:
Page 1: 100, 100, 100, 0 Page 2: 200, 200, 100, 100 Page 3: 300, 300, 100, 200 Page 4: 400, 400, 100, 300 etc.
The reason is, you need to cater for a possible imbalance of cross column ordering favouring either table. 原因是,您需要满足有利于任一表的交叉列排序的可能不平衡。 So for example page 4: 所以例如第4页:
You have two options: 您有两种选择:
You can SELECT DISTINCT
in the inner and outer queries: 您可以在内部和外部查询中SELECT DISTINCT
:
SELECT DISTINCT val
FROM
(
(SELECT DISTINCT fs.smr as val
FROM `fr_search` as fs)
UNION ALL
(SELECT DISTINCT fs.dmr as val
FROM `fr_search` as fs)
) as vals
ORDER BY val LIMIT ?,?;
or you can group by your inner queries too, before then grouping by the outer query. 或者您也可以按内部查询进行分组,然后再按外部查询进行分组。
SELECT val
FROM
(
(SELECT fs.smr as val
FROM `fr_search` as fs
GROUP BY fs.smr)
UNION ALL
(SELECT fs.dmr as val
FROM `fr_search` as fs
GROUP BY fs.dmr)
) as vals
GROUP BY val
ORDER BY val LIMIT ?,?;
Both will do essentially the same thing in this particular scenario. 在这个特定的场景中,两者都会做同样的事情。 However in both you should use union all, so that the UNION
part doesn't do work on its own, and you are explicit about how you want your record grouping. 但是在两者中你都应该使用union all,这样UNION
部分就不会单独工作,而且你明确知道你的记录分组方式。 I would also move the limit clause to the outer query 我还会将limit子句移动到外部查询
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.