简体   繁体   中英

Optimize while loop for large table In SQL Server 2008/12

Table Name [dbo].[SourceData] has 19 millions rows.

I am running while loop against this table and based on match criteria it will load data into another table. While loop is taking longer than ever.

Sample code is below. Sourcedata table has seqno which is unique identity column (primary key). Also Firstname, lastname, address, emailaddress have individual NC index.

create table #holdscore
(
     seqno bigint, 
     associatedseq bigint, 
     scrore int, 
     status varchar(20),
     customerid varchar(30)

     CONSTRAINT [PK_SourceScores] 
         PRIMARY KEY CLUSTERED (seqno ASC, associatedseq ASC) 
) 

Create table #loop 
(
     seqno bigint primary key clustered, 
     Flag varchar(1) NULL
)

Insert #loop (seqno)
   select distinct TOP 1000 seqno     
   from [dbo].[SourceData] 
   order by seqno

Declare @seqno bigint
Declare @firstname Nvarchar(100)
Declare @lastname Nvarchar(100)
Declare @phonenum nvarchar(100)
Declare @emailadd Nvarchar(100)
Declare @Address Nvarchar(250)
Declare @MiddleName nvarchar(50)
Declare @CCExpYYMM nvarchar(4)
Declare @CCLastFour nvarchar(4)

While ((select count(*) from  #Loop where flag is null)>0)
Begin 
    Select top 1 @seqno = seqno from #Loop  where flag is null

    Select @firstname = [FirstName],
           @lastname = [LastName],
           @phonenum = [PhoneNorm],
           @emailadd = [EmailAddress],
           @Address = [AddressNorm],
           @MiddleName = [MiddleName],
           @CCExpYYMM  = [CCExpYYMM],
           @CCLastFour = [CCLastFour]
     from  [dbo].[SourceData]
     where seqno = @seqno

     INSERT #holdscore
         select 
             orginalseqno, associatedseq, score, 
             case when score >= 80 Then 'Match'
                  when score < 80 Then 'Review' 
             end as Status,  
             customerid 
         from
             (select 
                  @seqno orginalseqno, seqno as associatedseq,
                  customerid,
                  case
                      when [FirstName] = @firstname 
                       and [LastName] = @lastname 
                       and [PhoneNorm] = @phonenum 
                       and [EmailAddress] = @emailadd
                       and [AddressNorm] = @Address 
                       and [MiddleName] = @MiddleName 
                       and [CCExpYYMM]  = @CCExpYYMM  
                       and [CCLastFour] = @CCLastFour THEN '100'

                     when [FirstName] = @firstname 
                      and [LastName] = @lastname 
                      and [PhoneNorm] = @phonenum 
                      and [EmailAddress] = @emailadd
                      and [AddressNorm] = @Address 
                      and [MiddleName] = @MiddleName 
                      and [CCExpYYMM]  = @CCExpYYMM THEN '99'

                    when [FirstName] = @firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd
        and [AddressNorm] = @Address and [MiddleName] = @MiddleName and [CCLastFour] = @CCLastFour THEN '99'
    WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd
        and [AddressNorm] = @Address and [MiddleName] = @MiddleName                             Then '98'
    WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd
        and [AddressNorm] = @Address                                                            Then '93'
    WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd  Then '83'
    WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum                               Then '68'
    WHEN [FirstName]=@firstname and [LastName]=@lastname and [EmailAddress]=@emailadd                            Then '63'
    WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [AddressNorm] = @Address  Then '78'
    WHEN [FirstName]=@firstname and [LastName]=@lastname and [EmailAddress]=@emailadd and [AddressNorm] = @Address  Then '73'
    WHEN [FirstName]=@firstname and [LastName]=@lastname and [AddressNorm] = @Address                               Then '58'
    WHEN [FirstName]=@firstname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd and [AddressNorm] = @Address and [MiddleName] = @MiddleName  Then '73'
    WHEN [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd and [AddressNorm] = @Address and [MiddleName] = @MiddleName  THEN '75'
    WHEN [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd and [AddressNorm] = @Address                                 Then '70'
    WHEN [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd                                                              THEN '60' 
    END AS Score
  From   [dbo].[SourceData]

  )A 
  where A.Score is not null
  OPTION (MAXDOP 8)

Update #Loop
set Flag = 'Y'
where seqno =@seqno and Flag is null


end

For 1000 unique seqno it takes more than 1 hour to complete. I need to compare 19 million rows with one another and load it into table. Please help me to make this process faster. So that I can load data into timely manner. SSIS will also work.

build on this

select s1.seqno as orginalseqno, s2,seqno as associatedseq, 100, 'Match', s2.customerid
 from [SourceData] as s1
 join [SourceData] as s2
   on s2.[FirstName]    = s1.firstname 
  and s2.[LastName]     = s1.lastname 
  and s2.[PhoneNorm]    = s1.phonenum 
  and s2.[EmailAddress] = s1.emailadd
  and s2.[AddressNorm]  = s1.Address 
  and s2.[MiddleName]   = s1.MiddleName 
  and s2.[CCExpYYMM]    = s1.CCExpYYMM  
  and s2.[CCLastFour]   = s1.CCLastFour 

From there go down in score and left join to the insert table so you can avoid inserting data that is already present with a higher score. In general don't try and build complex queries that eliminate higher score unless it is a very simple query like 99 is s2.[CCLastFour] <> s1.CCLastFour.

My answer is very much like Frisbee's (with UNION ALLs between each score group of tests) so I won't bother posting the SQL. What I will add though is that while this is the solution you probably want, even this set-based approach is going to be a very beefy query when run over a 19 million row table. As far as I can tell, you're trying to find degrees of association or similarity between the people in your table. You want to compare each person with every other person if I understand rightly. If the match on name and address and DOB (or whatever) score them 100, make the next test slightly less rigorous and assign a lower score and so on. As the tests become weaker, the self join becomes more and more like a cross join - you'll get more hits. If you have a low degree of cardinality (lots of repeating values) in the columns you're testing, you could end up generating many millions (or billions, or even trillions) of rows. Be careful to only test for associations that are going to return results of practical value. For (an extreme) example if you tested similarity based on sex alone you'd end up effectively with two 9.5 million row cross joins.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM