[英]Distinct and Inner join when millions of records exists
I have my SQL Query is like this 我有我的SQL查询是这样的
INSERT INTO staging.lps_data
(
col1
,col2
,col3
,col4
,col5
)
SELECT DISTINCT
col1
,col2
,col3
,col4
,col5
FROM tbl1 r WITH ( NOLOCK )
INNER JOIN tbl2 p WITH ( NOLOCK ) ON p.col1= r.col1
INNER JOIN tbl3 l WITH ( NOLOCK ) ON l.col2 = r.col2
r.col1 NOT IN ( 'Foreclosure Deed',
'Foreclosure Deed - Judicial',
'Foreclosure RESPA',
'Foreclosure Vendor Assignment Review',
'Foreclosure Stop',
'Foreclosure Screenprints Other',
'Foreclosure Sale Audit',
'Foreclosure Property Preservation',
'Foreclosure Acquisition',
'Foreclosure Notices Attorney Certification' )
AND ( r.col1 LIKE 'foreclosure%'
OR r.col1 = 'Vesting CT'
);
My tbl1 contains 100 million records , tbl2 contains 100 million records and tbl3 contains 1000million records. 我的tbl1包含1亿条记录,tbl2包含1亿条记录,tbl3包含10亿条记录。 I gone thru the estimated execution plan The more load shows in Distinct. 我通过了估计的执行计划。更多的负载显示在Distinct中。 Note : I applied proper indexing on the tables. 注意:我在表上应用了正确的索引。
I just try to solve this using batch process some thing like below 我只是尝试使用批处理解决此问题,如下所示
INSERT INTO TEMP1
SELECT SK_ID from tbl1 r where ( r.processname LIKE 'foreclosure%' OR r.processname = 'Vesting CT')
EXCEPT
SELECT SK_ID from tbl1 r where r.processname NOT IN ( 'Foreclosure Deed','Foreclosure Deed - Judicial',
'Foreclosure RESPA',
'Foreclosure Vendor Assignment Review',
'Foreclosure Stop',
'Foreclosure Screenprints Other',
'Foreclosure Sale Audit',
'Foreclosure Property Preservation',
'Foreclosure Acquisition',
'Foreclosure Notices Attorney Certification' )
-- Load data into staging table in batch mode
DECLARE @STARTID BIGINT=1, @LASTID BIGINT, @ENDID BIGINT;
DECLARE @SPLITCONFIG BIGINT =1000 -- Process 1000 records as batch
SELECT @LASTID = MAX(ID) FROM TEMP1
WHILE @STARTID < @LASTID
BEGIN
IF(@STARTID + @SPLITCONFIG > @LASTID)
SET @ENDID = @LASTID
ELSE
SET @ENDID = @STARTID + @SPLITCONFIG
INSERT INTO staging.lps_data
( col1
,col2
,col3
,col4
,col5)
SELECT DISTINCT
col1
,col2
,col3
,col4
,col5
FROM tbl1 r WITH (NOLOCK)
INNER JOIN TEMP1 SK WITH(NOLOCK) ON (r.SK_ID=SK.SK_ID AND SK.ID >=@STARTID AND SK.ID < @ENDID)
INNER JOIN tbl2 p WITH (NOLOCK) ON p.refinfoidentifier = r.refinfoidentifier
INNER JOIN tbl3 l WITH (NOLOCK) ON l.loaninfoidentifier = r.loaninfoidentifier
SET @STARTID = @ENDID
END
With first approach my server got crashed with out of memeory, With second approach I could able to process complete records in 4 Hours. 第一种方法使我的服务器内存不足而崩溃,第二种方法使我能够在4小时内处理完整的记录。
Please suggest me if any thing else i can do to complete this process less than hour 请建议我,如果我能在不到一个小时的时间内完成此过程
Not sure what indexing you have in your table but try changing your SELECT
with ROW_NUMBER()
instead of using DISTINCT
like 不知道表中有什么索引,但是尝试使用ROW_NUMBER()
而不是像DISTINCT
这样更改SELECT
SELECT
col1
,col2
,col3
,col4
,col5 FROM
(
SELECT
col1
,col2
,col3
,col4
,col5
,ROW_NUMBER() OVER(ORDER BY r.col1) as rn
FROM tbl1 r
INNER JOIN tbl2 p ON p.col1= r.col1
INNER JOIN tbl3 l ON l.col2 = r.col2
WHERE
r.col1 NOT IN ( 'Foreclosure Deed',
'Foreclosure Deed - Judicial',
'Foreclosure RESPA',
'Foreclosure Vendor Assignment Review',
'Foreclosure Stop',
'Foreclosure Screenprints Other',
'Foreclosure Sale Audit',
'Foreclosure Property Preservation',
'Foreclosure Acquisition',
'Foreclosure Notices Attorney Certification' )
AND ( r.col1 LIKE 'foreclosure%' OR r.col1 = 'Vesting CT')) xxx
WHERE rn = 1;
try inserting the strings of the NOT IN in a NEW_TABLE and left join it with the tbl1 filtering WHERE r.col1 IS NULL (best using an ID or integer instead of strings) or use r.col1 NOT EXISTS (SELECT 1 FROM NEW_TABLE WHERE ...) 尝试将NOT IN的字符串插入NEW_TABLE中,并通过tbl1过滤将其左连接WHERE r.col1 IS NULL(最好使用ID或整数代替字符串)或使用r.col1 NOT EXISTS(从NEW_TABLE WHERE中选择1)。 ..)
Bye, 再见,
Igor 伊戈尔
So what is the DISTINCT
for when there cannot be duplicates? 那么,当不能重复时, DISTINCT
的作用是什么? Remove it and your query should be way faster. 删除它,您的查询应该更快。
As you say the columns are unique in the tables, there will certainly be indexes on them, so it is very likely the tables themselves don't get read, but only the indexes as they contain all required data already. 正如您所说的,表中的列是唯一的,它们上肯定会有索引,因此很可能表本身没有被读取,但是只有索引,因为它们已经包含了所有必需的数据。 This is as good as it can get. 这是可以得到的。 I see no means for optimization here. 我认为这里没有优化的方法。
(Of course with large tables and indexes one can always think about partitioning to get data access faster.) (当然,对于大型表和索引,人们总是可以考虑进行分区以更快地访问数据。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.