Table Name [dbo].[SourceData]
has 19 millions rows.
I am running while loop against this table and based on match criteria it will load data into another table. While loop is taking longer than ever.
Sample code is below. Sourcedata
table has seqno
which is unique identity
column (primary key). Also Firstname, lastname, address, emailaddress have individual NC index.
create table #holdscore
(
seqno bigint,
associatedseq bigint,
scrore int,
status varchar(20),
customerid varchar(30)
CONSTRAINT [PK_SourceScores]
PRIMARY KEY CLUSTERED (seqno ASC, associatedseq ASC)
)
Create table #loop
(
seqno bigint primary key clustered,
Flag varchar(1) NULL
)
Insert #loop (seqno)
select distinct TOP 1000 seqno
from [dbo].[SourceData]
order by seqno
Declare @seqno bigint
Declare @firstname Nvarchar(100)
Declare @lastname Nvarchar(100)
Declare @phonenum nvarchar(100)
Declare @emailadd Nvarchar(100)
Declare @Address Nvarchar(250)
Declare @MiddleName nvarchar(50)
Declare @CCExpYYMM nvarchar(4)
Declare @CCLastFour nvarchar(4)
While ((select count(*) from #Loop where flag is null)>0)
Begin
Select top 1 @seqno = seqno from #Loop where flag is null
Select @firstname = [FirstName],
@lastname = [LastName],
@phonenum = [PhoneNorm],
@emailadd = [EmailAddress],
@Address = [AddressNorm],
@MiddleName = [MiddleName],
@CCExpYYMM = [CCExpYYMM],
@CCLastFour = [CCLastFour]
from [dbo].[SourceData]
where seqno = @seqno
INSERT #holdscore
select
orginalseqno, associatedseq, score,
case when score >= 80 Then 'Match'
when score < 80 Then 'Review'
end as Status,
customerid
from
(select
@seqno orginalseqno, seqno as associatedseq,
customerid,
case
when [FirstName] = @firstname
and [LastName] = @lastname
and [PhoneNorm] = @phonenum
and [EmailAddress] = @emailadd
and [AddressNorm] = @Address
and [MiddleName] = @MiddleName
and [CCExpYYMM] = @CCExpYYMM
and [CCLastFour] = @CCLastFour THEN '100'
when [FirstName] = @firstname
and [LastName] = @lastname
and [PhoneNorm] = @phonenum
and [EmailAddress] = @emailadd
and [AddressNorm] = @Address
and [MiddleName] = @MiddleName
and [CCExpYYMM] = @CCExpYYMM THEN '99'
when [FirstName] = @firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd
and [AddressNorm] = @Address and [MiddleName] = @MiddleName and [CCLastFour] = @CCLastFour THEN '99'
WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd
and [AddressNorm] = @Address and [MiddleName] = @MiddleName Then '98'
WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd
and [AddressNorm] = @Address Then '93'
WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd Then '83'
WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum Then '68'
WHEN [FirstName]=@firstname and [LastName]=@lastname and [EmailAddress]=@emailadd Then '63'
WHEN [FirstName]=@firstname and [LastName]=@lastname and [PhoneNorm]=@phonenum and [AddressNorm] = @Address Then '78'
WHEN [FirstName]=@firstname and [LastName]=@lastname and [EmailAddress]=@emailadd and [AddressNorm] = @Address Then '73'
WHEN [FirstName]=@firstname and [LastName]=@lastname and [AddressNorm] = @Address Then '58'
WHEN [FirstName]=@firstname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd and [AddressNorm] = @Address and [MiddleName] = @MiddleName Then '73'
WHEN [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd and [AddressNorm] = @Address and [MiddleName] = @MiddleName THEN '75'
WHEN [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd and [AddressNorm] = @Address Then '70'
WHEN [LastName]=@lastname and [PhoneNorm]=@phonenum and [EmailAddress]=@emailadd THEN '60'
END AS Score
From [dbo].[SourceData]
)A
where A.Score is not null
OPTION (MAXDOP 8)
Update #Loop
set Flag = 'Y'
where seqno =@seqno and Flag is null
end
For 1000 unique seqno it takes more than 1 hour to complete. I need to compare 19 million rows with one another and load it into table. Please help me to make this process faster. So that I can load data into timely manner. SSIS will also work.
build on this
select s1.seqno as orginalseqno, s2,seqno as associatedseq, 100, 'Match', s2.customerid
from [SourceData] as s1
join [SourceData] as s2
on s2.[FirstName] = s1.firstname
and s2.[LastName] = s1.lastname
and s2.[PhoneNorm] = s1.phonenum
and s2.[EmailAddress] = s1.emailadd
and s2.[AddressNorm] = s1.Address
and s2.[MiddleName] = s1.MiddleName
and s2.[CCExpYYMM] = s1.CCExpYYMM
and s2.[CCLastFour] = s1.CCLastFour
From there go down in score and left join to the insert table so you can avoid inserting data that is already present with a higher score. In general don't try and build complex queries that eliminate higher score unless it is a very simple query like 99 is s2.[CCLastFour] <> s1.CCLastFour.
My answer is very much like Frisbee's (with UNION ALLs between each score group of tests) so I won't bother posting the SQL. What I will add though is that while this is the solution you probably want, even this set-based approach is going to be a very beefy query when run over a 19 million row table. As far as I can tell, you're trying to find degrees of association or similarity between the people in your table. You want to compare each person with every other person if I understand rightly. If the match on name and address and DOB (or whatever) score them 100, make the next test slightly less rigorous and assign a lower score and so on. As the tests become weaker, the self join becomes more and more like a cross join - you'll get more hits. If you have a low degree of cardinality (lots of repeating values) in the columns you're testing, you could end up generating many millions (or billions, or even trillions) of rows. Be careful to only test for associations that are going to return results of practical value. For (an extreme) example if you tested similarity based on sex alone you'd end up effectively with two 9.5 million row cross joins.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.