简体   繁体   English

在联合之前消除重复

[英]Eliminate duplicates before union

I need to run a query, that select two columns from a big table (3m+ rows, with selecting two columns, the result set is around 6-7m) and returns a list. 我需要运行一个查询,从一个大表中选择两列(3m +行,选择两列,结果集大约为6-7m)并返回一个列表。 So I used union to merge the columns into a list and also to eliminate duplicates. 所以我使用union将列合并到列表中,同时消除重复。 The problem is that I cant return the result in one query, I need to partition it, so I applied a LIMIT ?,? 问题是我不能在一个查询中返回结果,我需要对它进行分区,所以我应用了LIMIT ?,? to the subqueries, which the application layer sets via Prepared Statements. 到子查询,应用程序层通过Prepared Statements设置。

SELECT val
FROM 
(
    (SELECT fs.smr as val
    FROM `fr_search` as fs
    ORDER BY val LIMIT ?,?)

    UNION

    (SELECT fs.dmr as val
    FROM `fr_search` as fs
    ORDER BY val LIMIT ?,?)
) as vals
GROUP BY val

The problem: The union eliminates the duplicates, but only after the LIMIT is applied. 问题:联合消除了重复,但仅在应用LIMIT之后。 Meaning If the two query returns 100+100=200 rows and most of them is a duplicate, I only return <200 rows. 含义如果两个查询返回100 + 100 = 200行,并且大多数是重复的,我只返回<200行。 How can I apply a limit to such a query, that I can return a specific amount of rows? 如何对这样的查询应用限制,我可以返回特定数量的行? (If I apply the LIMIT after the subqueries, It will take more than two minutes to run, so It will not solve the problem. ) (如果我在子查询后应用LIMIT,则运行时间超过两分钟,因此无法解决问题。)

You don't actually need a subquery for this. 实际上你并不需要子查询。 The following will work for the first 100 rows: 以下内容适用于前100行:

 (SELECT DISTINCT fs.smr as val
  FROM `fr_search` as fs
  ORDER BY val
  LIMIT 100
 )
 UNION
 (SELECT DISTINCT fs.dmr as val
  FROM `fr_search` as fs
  ORDER BY val
  LIMIT 100
 )
 ORDER BY val
 LIMIT 100;

However, once you start putting in offset, it gets more complicated. 但是,一旦你开始放入偏移量,就会变得更加复杂。 For the next 100 rows: 对于接下来的100行:

 (SELECT DISTINCT fs.smr as val
  FROM `fr_search` as fs
  ORDER BY val
  LIMIT 200
 )
 UNION
 (SELECT DISTINCT fs.dmr as val
  FROM `fr_search` as fs
  ORDER BY val
  LIMIT 200
 )
 ORDER BY val
 LIMIT 100, 100;

The problem is that you don't know where the second set will come from. 问题是你不知道第二组的来源。

If you actually need to page through the result set, I would suggest that you store it in a temporary table and page off of the temporary table. 如果您确实需要翻阅结果集,我建议您将其存储在临时表中,并将页面存储在临时表中。

Query optimisation is always has two parts to the solution. 查询优化始终包含两个部分。 And is sometimes an iterative process of try, measure and compare. 并且有时是尝试,测量和比较的迭代过程。

  1. Write a good (and ofc accurate) query that the engine can run efficiently. 写一个好的(并且是准确的)查询,引擎可以高效运行。
  2. Ensure the appropriate indexes are available so the optimiser can choose a good execution plan. 确保适当的索引可用,以便优化器可以选择一个好的执行计划。

The best query is most likely the straight-forward and simple: 最好的查询很可能是直截了当的简单:

SELECT  v.val
FROM    (
        SELECT  fs.smr as val
        FROM    `fr_search` as fs
        UNION
        SELECT  fs.dmr as val
        FROM    `fr_search` as fs
        ) as v
ORDER BY v.val LIMIT ?,?;

In order to run efficiently, you'll want 2 indexes: 为了有效运行,您需要2个索引:

  • one on fr_search.smr 一个在fr_search.smr
  • the other on fr_search.dmr 另一个在fr_search.dmr

If the optimiser cannot handle the above, then try using index hints to force it to use the indexes. 如果优化器无法处理上述情况,请尝试使用索引提示强制它使用索引。

In an extreme pinch you could try forcing the issue with the following: 在最极端的情况下,您可以尝试使用以下方法解决问题:

SELECT  v.val
FROM    (
        SELECT  DISTINCT fs.smr as val
        FROM    `fr_search` as fs
        ORDER BY fs.smr LIMIT ?
        UNION
        SELECT  DISTINCT fs.dmr as val
        FROM    `fr_search` as fs
        ORDER BY fs.dmr LIMIT ?
        ) as v
ORDER BY v.val LIMIT ?,?;

Note that your substitutions (assuming pages of 100) should be as follows: 请注意,您的替换(假设页数为100)应如下所示:

Page 1: 100, 100, 100, 0
Page 2: 200, 200, 100, 100
Page 3: 300, 300, 100, 200
Page 4: 400, 400, 100, 300
etc.

The reason is, you need to cater for a possible imbalance of cross column ordering favouring either table. 原因是,您需要满足有利于任一表的交叉列排序的可能不平衡。 So for example page 4: 所以例如第4页:

  • Get top 400 distinct rows ordered by the key from each column. 从每列获取按键排序的前400个不同行。
  • Return rows 301 to 400 of the merged data. 返回合并数据的行301到400。
  • This could be the last 400 rows of one of the sub-queries. 这可能是其中一个子查询的最后400行。 But it's more likely to return about 50 rows from each subquery somewhere above the 150 row mark. 但它更有可能从150行标记之上的每个子查询返回大约50行。

You have two options: 您有两种选择:

You can SELECT DISTINCT in the inner and outer queries: 您可以在内部和外部查询中SELECT DISTINCT

SELECT DISTINCT val
FROM 
(
    (SELECT DISTINCT fs.smr as val
    FROM `fr_search` as fs)

    UNION ALL

    (SELECT DISTINCT fs.dmr as val
    FROM `fr_search` as fs)
) as vals
ORDER BY val LIMIT ?,?;

or you can group by your inner queries too, before then grouping by the outer query. 或者您也可以按内部查询进行分组,然后再按外部查询进行分组。

SELECT val
FROM 
(
    (SELECT fs.smr as val
    FROM `fr_search` as fs
    GROUP BY fs.smr)

    UNION ALL

    (SELECT fs.dmr as val
    FROM `fr_search` as fs
    GROUP BY fs.dmr)
) as vals
GROUP BY val
ORDER BY val LIMIT ?,?;

Both will do essentially the same thing in this particular scenario. 在这个特定的场景中,两者都会做同样的事情。 However in both you should use union all, so that the UNION part doesn't do work on its own, and you are explicit about how you want your record grouping. 但是在两者中你都应该使用union all,这样UNION部分就不会单独工作,而且你明确知道你的记录分组方式。 I would also move the limit clause to the outer query 我还会将limit子句移动到外部查询

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM