简体   繁体   English

大表的UNION性能问题

[英]Performance issues with UNION of large tables

I have seven large tables, that can be storing between 100 to 1 million rows at any time. 我有七个大表,可以随时存储100到100万行。 I'll call them LargeTable1 , LargeTable2 , LargeTable3 , LargeTable4 ... LargeTable7 . 我将它们LargeTable1LargeTable2LargeTable3LargeTable4 ... LargeTable7 These tables are mostly static: there are no updates nor new inserts. 这些表大部分是静态的:没有更新也没有新插入。 They change only once every two weeks or once a month, when they are truncated and a new batch of registers are inserted in each. 当它们被截断并在每个寄存器中插入新的一批寄存器时,它们仅每两周或每月一次更改一次。

All these tables have three fields in common: Headquarter , Country and File . 所有这些表具有三个共同的字段: HeadquarterCountryFile Headquarter and Country are numbers in the format '000', though in two of these tables they are parsed as int due to some other system necessities. Headquarter和“ Country是格式为“ 000”的数字,尽管由于某些其他系统必要性,在其中两个表中将它们解析为int

I have another, much smaller table called Headquarters with the information of each headquarter. 我还有另一个小得多的表,称为“ Headquarters ,其中包含每个总部的信息。 This table has very few entries. 该表具有很少的条目。 At most 1000, actually. 实际上,最多为1000。

Now, I need to create a stored procedure that returns all those headquarters that appear in the large tables but are either absent in the Headquarters table or have been deleted (this table is deleted logically: it has a DeletionDate field to check this). 现在,我需要创建一个存储过程,该存储过程将返回所有出现在大表中但Headquarters表中不存在或已被删除的Headquarters (逻辑上删除此表:它具有DeletionDate字段以进行检查)。

This is the query I've tried: 这是我尝试过的查询:

CREATE PROCEDURE deletedHeadquarters
AS
BEGIN
    DECLARE @headquartersFiles TABLE
    (
        hq int,
        countryFile varchar(MAX)
    );

    SET NOCOUNT ON

    INSERT INTO @headquartersFiles
    SELECT headquarter, CONCAT(country, ' (', file, ')')
    FROM
    (
        SELECT DISTINCT CONVERT(int, headquarter) as headquarter,
                        CONVERT(int, country) as country,
                        file
        FROM            LargeTable1     
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable2
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable3
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable4
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable5
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable6
        UNION
        SELECT DISTINCT headquarter,
                        country,
                        file
        FROM            LargeTable7
    ) TC

    SELECT  RIGHT('000' + CAST(st.headquarter AS VARCHAR(3)), 3) as headquarter,
            MAX(s.deletionDate) as deletionDate,
            STUFF
            (
                (SELECT DISTINCT ', ' + st2.countryFile
                FROM @headquartersFiles st2
                WHERE st2.headquarter = st.headquarter
                FOR XML PATH('')),
                1,
                1,
                ''
            ) countryFile
    FROM    @headquartersFiles as st
    LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
    WHERE   s.headquarter IS NULL
       OR   s.deletionDate IS NOT NULL
    GROUP BY st.headquarter

END

This sp's performance isn't good enough for our application. 对于我们的应用程序,此sp的性能还不够好。 It currently takes around 50 seconds to complete, with the following total rows for each table (just to give you an idea about the sizes): 目前,大约需要50秒才能完成,每个表的总行数如下(只是为了让您了解大小):

  • LargeTable1: 1516666 rows 大表1:1516666行
  • LargeTable2: 645740 rows 大表2:645740行
  • LargeTable3: 1950121 rows 大表3:1950121行
  • LargeTable4: 779336 rows 大表4:779336行
  • LargeTable5: 1100999 rows 大表5:1100999行
  • LargeTable6: 16499 rows 大表6:16499行
  • LargeTable7: 24454 rows 大表7:24454行

What can I do to improve performance? 我该怎么做才能提高性能? I've tried to do the following, with no much difference: 我尝试执行以下操作,两者之间没有太大差异:

  • Inserting into the local table by batches , excluding those headquarters I've already inserted and then updating the countryFile field for those that are repeated 分批插入本地表 ,不包括我已经插入的总部,然后为重复的国家更新countryFile字段
  • Creating a view for that UNION query 为该UNION查询创建视图
  • Creating indexes for the LargeTables for the headquarter field 为总部字段的LargeTables创建索引

I've also thought about inserting these missing headquarters in a permanent table after the LargeTables change, but the Headquarters table can change more often, and I would like not having to change its module to keep these things tidy and updated. 我还考虑过在LargeTables更改后将这些缺少的总部插入永久表中,但是Headquarters表可以更频繁地更改,并且我不想不必更改其模块来保持这些事情整洁和更新。 But if it's the best possible alternative, I'd go for it. 但是,如果这是最好的选择,我会去做。

Thanks 谢谢

Take this filter 采取这个过滤器

LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE   s.headquarter IS NULL
   OR   s.deletionDate IS NOT NULL

And add it to each individual query in the union and insert into @headquartersFiles 并将其添加到联合中的每个单独查询中,然后插入@headquartersFiles

It might seem like this makes a lot more filters but it will actually speed stuff up because you are filtering before you start processing as a union. 看起来这会产生更多的过滤器,但实际上会加快速度,因为在开始作为联合进行处理之前要进行过滤。

Also take out all your DISTINCT, it probably won't speed it up but it seems silly because you are doing a UNION and not a UNION all. 同时取出所有DISTINCT,它可能不会加快速度,但是似乎很傻,因为您正在执行UNION,而不是全部UNION。

I'd try doing the filtering with each individual table first. 我会先尝试对每个单独的表进行过滤。 You just need to account for the fact that a headquarter might appear in one table, but not another. 您只需要考虑总部可能出现在一个表中而不是另一个表中的事实。 You can do this like so: 您可以这样做:

SELECT
    headquarter
FROM
(

    SELECT DISTINCT
        headquarter,
        'table1' AS large_table
    FROM
        LargeTable1 LT
    LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
    WHERE
        HQ.headquarter IS NULL OR
        HQ.deletion_date IS NOT NULL
    UNION ALL
    SELECT DISTINCT
        headquarter,
        'table2' AS large_table
    FROM
        LargeTable2 LT
    LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
    WHERE
        HQ.headquarter IS NULL OR
        HQ.deletion_date IS NOT NULL
    UNION ALL
    ...
) SQ
GROUP BY headquarter
HAVING COUNT(*) = 5

That would make sure that it's missing from all five tables. 这样可以确保所有五个表都缺少该表。

Table variables have horrible performance because sql server does not generate statistics for them. 表变量具有可怕的性能,因为sql server不会为它们生成统计信息。 Instead of a table variable, try using a temp table instead, and if headquarter + country + file is unique in the temp table, add a unique constraint (which will create a clustered index) in the temp table definition. 代替表变量,请尝试使用临时表,如果总部+国家+文件在临时表中是唯一的,请在临时表定义中添加唯一约束(这将创建聚簇索引)。 You can set indexes on a temp table after creating it, but for various reasons SQL Server may ignore it. 您可以在创建临时表后在临时表上设置索引,但是由于各种原因,SQL Server可能会忽略它。

Edit: as it turns out, you can in fact create indexes on table variables, even non-unique in 2014+. 编辑: 事实证明,实际上您可以在表变量上创建索引,甚至在2014+版本中也不唯一。

Secondly, try not to use functions in your joins or where clauses - doing so often causes performance problems. 其次,尽量不要在联接或where子句中使用函数-这样做经常会导致性能问题。

Do the filtering at each step. 在每个步骤进行过滤。 But first, modify the headquarters table so it has the right type for what you need . 但首先,修改headquarters表,使其具有适合您所需类型的类型。 . . along with an index: 连同索引:

alter table headquarters add headquarter_int as (cast(headquarter as int));
create index idx_headquarters_int on headquarters(headquarters_int);

SELECT DISTINCT headquarter, country, file
FROM LargeTable5 lt5
WHERE NOT EXISTS (SELECT 1
                  FROM headquarters s
                  WHERE s.headquarter_int = lt5.headquarter and s.deletiondate is not null
                 );

Then, you want an index on LargeTable5(headquarter, country, file) . 然后,您要在LargeTable5(headquarter, country, file)LargeTable5(headquarter, country, file)索引。

This should take less than 5 seconds to run. 运行时间应该少于5秒。 If so, then construct the full query, being sure that the types in the correlated subquery match and that you have the right index on the full table. 如果是这样,则构造完整查询,请确保相关子查询中的类型匹配,并且在完整表上具有正确的索引。 Use union to remove duplicates between the tables. 使用union删除表之间的重复项。

真正的答案是为每个表创建单独的INSERT语句,但要注意的是,要插入的数据在目标表中不存在。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM