简体   繁体   English

当数百万条记录存在时,不同的和内部的联接

[英]Distinct and Inner join when millions of records exists

I have my SQL Query is like this 我有我的SQL查询是这样的

 INSERT INTO staging.lps_data
    (
          col1
          ,col2
          ,col3
          ,col4
          ,col5

    )
            SELECT DISTINCT
                    col1
          ,col2
          ,col3
          ,col4
          ,col5
            FROM    tbl1 r WITH ( NOLOCK )
                    INNER JOIN tbl2 p WITH ( NOLOCK ) ON p.col1= r.col1
                    INNER JOIN tbl3 l WITH ( NOLOCK ) ON l.col2 = r.col2
        r.col1 NOT IN ( 'Foreclosure Deed',
                                           'Foreclosure Deed - Judicial',
                                           'Foreclosure RESPA',
                                           'Foreclosure Vendor Assignment Review',
                                           'Foreclosure Stop',
                                           'Foreclosure Screenprints Other',
                                           'Foreclosure Sale Audit',
                                           'Foreclosure Property Preservation',
                                           'Foreclosure Acquisition',
                                           'Foreclosure Notices Attorney Certification' )
                    AND ( r.col1  LIKE 'foreclosure%'
                          OR r.col1  = 'Vesting CT'
                        );

My tbl1 contains 100 million records , tbl2 contains 100 million records and tbl3 contains 1000million records. 我的tbl1包含1亿条记录,tbl2包含1亿条记录,tbl3包含10亿条记录。 I gone thru the estimated execution plan The more load shows in Distinct. 我通过了估计的执行计划。更多的负载显示在Distinct中。 Note : I applied proper indexing on the tables. 注意:我在表上应用了正确的索引。

I just try to solve this using batch process some thing like below 我只是尝试使用批处理解决此问题,如下所示

          INSERT INTO TEMP1
          SELECT SK_ID from tbl1 r where                        ( r.processname LIKE 'foreclosure%'                 OR r.processname = 'Vesting CT')
       EXCEPT
       SELECT SK_ID from tbl1 r where r.processname NOT IN ( 'Foreclosure Deed','Foreclosure Deed - Judicial',
                'Foreclosure RESPA',
                'Foreclosure Vendor Assignment Review',
                'Foreclosure Stop',
                'Foreclosure Screenprints Other',
                'Foreclosure Sale Audit',
                'Foreclosure Property Preservation',
                'Foreclosure Acquisition',
                'Foreclosure Notices Attorney Certification' )


               -- Load data into staging table in batch mode
              DECLARE @STARTID BIGINT=1, @LASTID BIGINT, @ENDID BIGINT;
              DECLARE @SPLITCONFIG BIGINT =1000 -- Process 1000 records as batch
              SELECT  @LASTID = MAX(ID)  FROM TEMP1


       WHILE @STARTID < @LASTID
       BEGIN
IF(@STARTID + @SPLITCONFIG > @LASTID)
    SET @ENDID = @LASTID
ELSE
    SET @ENDID = @STARTID + @SPLITCONFIG


    INSERT INTO staging.lps_data
               ( col1
          ,col2
          ,col3
          ,col4
          ,col5)
  SELECT DISTINCT
                 col1
          ,col2
          ,col3
          ,col4
          ,col5
            FROM    tbl1 r WITH (NOLOCK)
                    INNER JOIN TEMP1 SK WITH(NOLOCK) ON (r.SK_ID=SK.SK_ID AND SK.ID >=@STARTID AND SK.ID < @ENDID)
                    INNER JOIN tbl2 p WITH (NOLOCK) ON p.refinfoidentifier = r.refinfoidentifier 
                    INNER JOIN tbl3 l WITH (NOLOCK) ON l.loaninfoidentifier = r.loaninfoidentifier



    SET @STARTID = @ENDID
  END

With first approach my server got crashed with out of memeory, With second approach I could able to process complete records in 4 Hours. 第一种方法使我的服务器内存不足而崩溃,第二种方法使我能够在4小时内处理完整的记录。

Please suggest me if any thing else i can do to complete this process less than hour 请建议我,如果我能在不到一个小时的时间内完成此过程

Not sure what indexing you have in your table but try changing your SELECT with ROW_NUMBER() instead of using DISTINCT like 不知道表中有什么索引,但是尝试使用ROW_NUMBER()而不是像DISTINCT这样更改SELECT

        SELECT 
       col1
      ,col2
      ,col3
      ,col4
      ,col5 FROM
      (
        SELECT 
       col1
      ,col2
      ,col3
      ,col4
      ,col5
      ,ROW_NUMBER() OVER(ORDER BY r.col1) as rn
        FROM    tbl1 r
                INNER JOIN tbl2 p ON p.col1= r.col1
                INNER JOIN tbl3 l ON l.col2 = r.col2
    WHERE
    r.col1 NOT IN ( 'Foreclosure Deed',
                                       'Foreclosure Deed - Judicial',
                                       'Foreclosure RESPA',
                                       'Foreclosure Vendor Assignment Review',
                                       'Foreclosure Stop',
                                       'Foreclosure Screenprints Other',
                                       'Foreclosure Sale Audit',
                                       'Foreclosure Property Preservation',
                                       'Foreclosure Acquisition',
                                       'Foreclosure Notices Attorney Certification' )
                AND ( r.col1  LIKE 'foreclosure%' OR r.col1  = 'Vesting CT')) xxx
      WHERE rn = 1;


try inserting the strings of the NOT IN in a NEW_TABLE and left join it with the tbl1 filtering WHERE r.col1 IS NULL (best using an ID or integer instead of strings) or use r.col1 NOT EXISTS (SELECT 1 FROM NEW_TABLE WHERE ...) 尝试将NOT IN的字符串插入NEW_TABLE中,并通过tbl1过滤将其左连接WHERE r.col1 IS NULL(最好使用ID或整数代替字符串)或使用r.col1 NOT EXISTS(从NEW_TABLE WHERE中选择1)。 ..)
Bye, 再见,
Igor 伊戈尔

  1. You are saying that col1+col2 is unique in tbl1. 您说的是col1 + col2在tbl1中是唯一的。 That means if we select these two columns from tbl1 alone we get no duplicates. 这意味着,如果仅从tbl1中选择这两列,就不会有重复项。
  2. Then you use col1 to join tbl2. 然后,您使用col1加入tbl2。 In tbl2 the pair col1+col3 is unique. 在tbl2中,对col1 + col3是唯一的。 So when we join we get no duplicates either. 因此,当我们加入时,我们也不会重复。 We get unique col1+col2+col3. 我们得到唯一的col1 + col2 + col3。
  3. Then you use col2 to join tbl3. 然后,您使用col2加入tbl3。 Here col2+col4+col5 are unique. 这里col2 + col4 + col5是唯一的。 So again no duplicates when we join. 所以当我们加入时再也没有重复。 We get unique col1+col2+col3+col4+col5. 我们得到唯一的col1 + col2 + col3 + col4 + col5。

So what is the DISTINCT for when there cannot be duplicates? 那么,当不能重复时, DISTINCT的作用是什么? Remove it and your query should be way faster. 删除它,您的查询应该更快。

As you say the columns are unique in the tables, there will certainly be indexes on them, so it is very likely the tables themselves don't get read, but only the indexes as they contain all required data already. 正如您所说的,表中的列是唯一的,它们上肯定会有索引,因此很可能表本身没有被读取,但是只有索引,因为它们已经包含了所有必需的数据。 This is as good as it can get. 这是可以得到的。 I see no means for optimization here. 我认为这里没有优化的方法。

(Of course with large tables and indexes one can always think about partitioning to get data access faster.) (当然,对于大型表和索引,人们总是可以考虑进行分区以更快地访问数据。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM