[英]Performance issues with UNION of large tables
I have seven large tables, that can be storing between 100 to 1 million rows at any time. 我有七个大表,可以随时存储100到100万行。 I'll call them LargeTable1
, LargeTable2
, LargeTable3
, LargeTable4
... LargeTable7
. 我将它们LargeTable1
, LargeTable2
, LargeTable3
, LargeTable4
... LargeTable7
。 These tables are mostly static: there are no updates nor new inserts. 这些表大部分是静态的:没有更新也没有新插入。 They change only once every two weeks or once a month, when they are truncated and a new batch of registers are inserted in each. 当它们被截断并在每个寄存器中插入新的一批寄存器时,它们仅每两周或每月一次更改一次。
All these tables have three fields in common: Headquarter
, Country
and File
. 所有这些表具有三个共同的字段: Headquarter
, Country
和File
。 Headquarter
and Country
are numbers in the format '000', though in two of these tables they are parsed as int
due to some other system necessities. Headquarter
和“ Country
是格式为“ 000”的数字,尽管由于某些其他系统必要性,在其中两个表中将它们解析为int
。
I have another, much smaller table called Headquarters
with the information of each headquarter. 我还有另一个小得多的表,称为“ Headquarters
,其中包含每个总部的信息。 This table has very few entries. 该表具有很少的条目。 At most 1000, actually. 实际上,最多为1000。
Now, I need to create a stored procedure that returns all those headquarters that appear in the large tables but are either absent in the Headquarters
table or have been deleted (this table is deleted logically: it has a DeletionDate
field to check this). 现在,我需要创建一个存储过程,该存储过程将返回所有出现在大表中但Headquarters
表中不存在或已被删除的Headquarters
(逻辑上删除此表:它具有DeletionDate
字段以进行检查)。
This is the query I've tried: 这是我尝试过的查询:
CREATE PROCEDURE deletedHeadquarters
AS
BEGIN
DECLARE @headquartersFiles TABLE
(
hq int,
countryFile varchar(MAX)
);
SET NOCOUNT ON
INSERT INTO @headquartersFiles
SELECT headquarter, CONCAT(country, ' (', file, ')')
FROM
(
SELECT DISTINCT CONVERT(int, headquarter) as headquarter,
CONVERT(int, country) as country,
file
FROM LargeTable1
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable2
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable3
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable4
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable5
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable6
UNION
SELECT DISTINCT headquarter,
country,
file
FROM LargeTable7
) TC
SELECT RIGHT('000' + CAST(st.headquarter AS VARCHAR(3)), 3) as headquarter,
MAX(s.deletionDate) as deletionDate,
STUFF
(
(SELECT DISTINCT ', ' + st2.countryFile
FROM @headquartersFiles st2
WHERE st2.headquarter = st.headquarter
FOR XML PATH('')),
1,
1,
''
) countryFile
FROM @headquartersFiles as st
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE s.headquarter IS NULL
OR s.deletionDate IS NOT NULL
GROUP BY st.headquarter
END
This sp's performance isn't good enough for our application. 对于我们的应用程序,此sp的性能还不够好。 It currently takes around 50 seconds to complete, with the following total rows for each table (just to give you an idea about the sizes): 目前,大约需要50秒才能完成,每个表的总行数如下(只是为了让您了解大小):
What can I do to improve performance? 我该怎么做才能提高性能? I've tried to do the following, with no much difference: 我尝试执行以下操作,两者之间没有太大差异:
I've also thought about inserting these missing headquarters in a permanent table after the LargeTables
change, but the Headquarters
table can change more often, and I would like not having to change its module to keep these things tidy and updated. 我还考虑过在LargeTables
更改后将这些缺少的总部插入永久表中,但是Headquarters
表可以更频繁地更改,并且我不想不必更改其模块来保持这些事情整洁和更新。 But if it's the best possible alternative, I'd go for it. 但是,如果这是最好的选择,我会去做。
Thanks 谢谢
Take this filter 采取这个过滤器
LEFT JOIN headquarters s ON CONVERT(int, s.headquarter) = st.headquarter
WHERE s.headquarter IS NULL
OR s.deletionDate IS NOT NULL
And add it to each individual query in the union and insert into @headquartersFiles 并将其添加到联合中的每个单独查询中,然后插入@headquartersFiles
It might seem like this makes a lot more filters but it will actually speed stuff up because you are filtering before you start processing as a union. 看起来这会产生更多的过滤器,但实际上会加快速度,因为在开始作为联合进行处理之前要进行过滤。
Also take out all your DISTINCT, it probably won't speed it up but it seems silly because you are doing a UNION and not a UNION all. 同时取出所有DISTINCT,它可能不会加快速度,但是似乎很傻,因为您正在执行UNION,而不是全部UNION。
I'd try doing the filtering with each individual table first. 我会先尝试对每个单独的表进行过滤。 You just need to account for the fact that a headquarter might appear in one table, but not another. 您只需要考虑总部可能出现在一个表中而不是另一个表中的事实。 You can do this like so: 您可以这样做:
SELECT
headquarter
FROM
(
SELECT DISTINCT
headquarter,
'table1' AS large_table
FROM
LargeTable1 LT
LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
WHERE
HQ.headquarter IS NULL OR
HQ.deletion_date IS NOT NULL
UNION ALL
SELECT DISTINCT
headquarter,
'table2' AS large_table
FROM
LargeTable2 LT
LEFT OUTER JOIN Headquarters HQ ON HQ.headquarter = LT.headquarter
WHERE
HQ.headquarter IS NULL OR
HQ.deletion_date IS NOT NULL
UNION ALL
...
) SQ
GROUP BY headquarter
HAVING COUNT(*) = 5
That would make sure that it's missing from all five tables. 这样可以确保所有五个表都缺少该表。
Table variables have horrible performance because sql server does not generate statistics for them. 表变量具有可怕的性能,因为sql server不会为它们生成统计信息。 Instead of a table variable, try using a temp table instead, and if headquarter + country + file is unique in the temp table, add a unique constraint (which will create a clustered index) in the temp table definition. 代替表变量,请尝试使用临时表,如果总部+国家+文件在临时表中是唯一的,请在临时表定义中添加唯一约束(这将创建聚簇索引)。 You can set indexes on a temp table after creating it, but for various reasons SQL Server may ignore it. 您可以在创建临时表后在临时表上设置索引,但是由于各种原因,SQL Server可能会忽略它。
Edit: as it turns out, you can in fact create indexes on table variables, even non-unique in 2014+. 编辑: 事实证明,实际上您可以在表变量上创建索引,甚至在2014+版本中也不唯一。
Secondly, try not to use functions in your joins or where clauses - doing so often causes performance problems. 其次,尽量不要在联接或where子句中使用函数-这样做经常会导致性能问题。
Do the filtering at each step. 在每个步骤进行过滤。 But first, modify the headquarters
table so it has the right type for what you need . 但首先,修改headquarters
表,使其具有适合您所需类型的类型。 . 。 . 。 along with an index: 连同索引:
alter table headquarters add headquarter_int as (cast(headquarter as int));
create index idx_headquarters_int on headquarters(headquarters_int);
SELECT DISTINCT headquarter, country, file
FROM LargeTable5 lt5
WHERE NOT EXISTS (SELECT 1
FROM headquarters s
WHERE s.headquarter_int = lt5.headquarter and s.deletiondate is not null
);
Then, you want an index on LargeTable5(headquarter, country, file)
. 然后,您要在LargeTable5(headquarter, country, file)
上LargeTable5(headquarter, country, file)
索引。
This should take less than 5 seconds to run. 运行时间应该少于5秒。 If so, then construct the full query, being sure that the types in the correlated subquery match and that you have the right index on the full table. 如果是这样,则构造完整查询,请确保相关子查询中的类型匹配,并且在完整表上具有正确的索引。 Use union
to remove duplicates between the tables. 使用union
删除表之间的重复项。
真正的答案是为每个表创建单独的INSERT
语句,但要注意的是,要插入的数据在目标表中不存在。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.