簡體   English   中英

使用模糊連接但具有不同匹配的完全連接

[英]Full Join using a fuzzy join but with distinct matching

我們有一張預期付款表和一張已付款表。 我們需要能夠將付款與單一的預期付款相匹配,但我們允許在 +-3 天的時間內完成付款。 最重要的是,它應該是一對一的比賽。

所以想象我有一張預期付款表

2020-10-01
2020-10-04
2020-10-05
2020-10-20

和付款

2020-10-02
2020-10-06
2020-10-07

我想要的結果是

Expected      Made
2020-10-01    2020-10-02
2020-10-04    2020-10-06
2020-10-05    2020-10-07
2020-10-20

如果取消第 6 次付款,結果將是

Expected      Made
2020-10-01    2020-10-02
2020-10-04    2020-10-07
2020-10-05
2020-10-20

所以5號和7號付款之間的匹配取決於付款是否與4號匹配。 第 4 個和第 7 個之間的匹配取決於第 4 個是否與第 6 個匹配。

我目前通過與匹配進行完全連接然后遞歸迭代它以清除雙方重復的記錄來實現這一點。 不幸的是,由於這種情況下的數據有上百萬行,因此攪拌大約需要 40 分鍾。

我想知道是否有更好的方法或內置連接我沒有遇到過來實現這種不同匹配的概念。

你的問題聽起來很奇怪,我發布了一個奇怪的解決方案,使用子查詢而不是連接

Select T1.Dated ExpectationDate, (
    Select TOP 1 T2.Dated
    From PaymentTable T2
    Where T2.Dated>=T1.Dated
    Order By T2.Dated
) PaymentDate
From ExpectedTable T1

最好在 PaymentTable 的日期列上創建索引。 這個查詢在 320 萬條記錄上幾乎不需要 80 秒。

..與您當前的方法相同的邏輯,迭代預期日期並找到任何大於上次使用的付款日期的匹配付款日期。 這是一個人為的例子(基於簡化的要求),它可能不適合完整/完整的規范(在日期的唯一性下隱藏了許多復雜的問題......不要因為示例執行速度快而被欺騙)。

/*
drop table if exists expected
go
drop table if exists payment
go
*/


--expected
create table expected(expd date primary key clustered)
go

declare @d date = isnull((select max(expd) from expected), '20100101')
begin try
    insert into expected(expd)
    select dateadd(day, abs(checksum(newid()))%5, @d)
end try
begin catch
    --ignore errors, .... just a couple sample rows
end catch
go 5000
----------------

--payment
create table payment(payd date primary key clustered)
go

declare @d date = isnull((select max(payd) from payment), '20100101')
begin try
    insert into payment(payd)
    select dateadd(day, abs(checksum(newid()))%10, @d)
end try
begin catch
    --ignore errors, .... just a couple sample rows
end catch
go 5000



--max recursion == number of rows in expected - 1 (1 is the anchor)
declare @maxrecursion int = (select count(*) from expected)-1 /* -1==anchor*/;

--maximum difference between consecutive days in expected ..
--this is not needed if there is sequential column (with no breaks) in expected which could be used for the recursion
--eg rowA has id=1, rowB has id=2, rowC has id=3....rowX has id=100...recursion = recursive id + 1

declare @maxexpdaysdiff int = (
    select max(daysdiff)
    from 
    ( 
        select datediff(day, expd, lead(expd, 1, expd) over(order by expd)) as daysdiff
        from expected
    ) as ed
);



--recursion, rows of expected
with cte
as
(
    --mimimum expected date and any matching payment date
    select top (1) e.expd, lead(e.expd, 1) over(order by e.expd) as recexpd, pp.payd, pp.payd as lastusedpayd, 1 as rownum
    from expected as e
    outer apply(select top (1) p.payd from payment as p where p.payd >= e.expd and p.payd < dateadd(day, 3+1, e.expd) order by p.payd) as pp
    order by e.expd 
    union all
    --recursive part (next row of expected and any matching payment which was not assigned as lastusedpayd
    select cte.recexpd, lexp.leadexpd, mp.payd, isnull(mp.payd, cte.lastusedpayd), cte.rownum+1
    from cte --cte carries next expected date::recexpd, this is not needed if there is another/sequential column to be used
    --next expected row/date for the recursion
    outer apply
    (
        --get the next expected date, after cte.recexpd to continue the recursion (like using lead() in recursion)
        select l.leadexpd
        from
        (
            --rows between cte.recexpd and cte.recexpd+@maxexpdaysdiff (there is always a row..except for the last expected date)
            select e.expd as leadexpd, row_number() over (order by e.expd) as ern
            from expected as e
            where e.expd > cte.recexpd
            and e.expd <= dateadd(day, @maxexpdaysdiff, cte.recexpd)
        ) as l 
        where l.ern = 1
    ) as lexp
    --matching payment, if any
    outer apply
    ( 
        select lmp.payd
        from
        (
            --payments between cte.recexpd and cte.recexpd + 3 days
            --..but exclude all payments before the payment which was lastly used:: cte.lastusedpayd
            select p.payd, row_number() over(order by p.payd) as prn
            from payment as p
            where p.payd >= cte.recexpd
            and p.payd < dateadd(day, 3+1, cte.recexpd)
            and (p.payd > cte.lastusedpayd or cte.lastusedpayd is null)
        
        ) as lmp
        where lmp.prn=1
    ) as mp
    where cte.rownum <=  @maxrecursion--safeguard recursion.    
)
select expd, payd --,*
from cte
option(maxrecursion 0);

編輯:另一種方法(教育)是存儲為前一行/預期日期分配/選擇的付款日期,並在為每個新的預期日期選擇付款日期時使用該值。

簡而言之:

SELECT expecteddate --> SELECT paymentdate WHERE paymentdate > last_used_pa​​yment_date --> SET last_used_pa​​yment_date = SELECTed paymentdate(如果有的話)

雖然邏輯與遞歸方法相同,但它可以在沒有顯式 tsql 遞歸 cte 的情況下實現,而只需使用普通的 select 和標量函數。

如果您熟悉(或聽說過)“古怪的更新”,那么以下內容可能是“古怪的選擇”。

--the scalar function
create or alter function dbo.lastusedpaymentdate(@date date = null)
returns date
as
begin
    if @date is null
    begin
        return (cast(session_context(N'paymentdate') as date)); 
    end

    exec sp_set_session_context @key=N'paymentdate', @value=@date;
    return(@date);
end
go

--the query:
--reset session variable for each execution <-- this is not really needed for a single execution of the select in a session
exec sp_set_session_context @key=N'paymentdate', @value=null;
--..a select
select e.expd, p.payd, dbo.lastusedpaymentdate(p.payd) as lastusedpaymentdate
from expected as e with(index(1))
outer apply
(
    select top (1) pm.payd
    from payment as pm
    where pm.payd >= e.expd
    and pm.payd < dateadd(day, 3+1, e.expd)
    and pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101')
    order by pm.payd
) as p
order by e.expd;
  1. select e.expd from expected ⇛ SELECT 預期日期
  2. outer apply(..pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101')) ⇛ SELECT paymentdate WHERE paymentdate > last_used_pa​​yment_date
  3. select ..dbo.lastusedpaymentdate(p.payd).. ⇛ SET last_used_pa​​yment_date = SELECTed paymentdate (如果有的話)

奇怪之處在於,必須根據業務邏輯(最早日期在前,所有其他日期在后)按順序遍歷基表/主表(即預期日期)。 這違背了 SQL 基本原理(聲明什么,而不是如何聲明),並且不能“假設/認為理所當然”執行順序。

對於簡單的示例,查詢使用 apply() 進行選擇,主表有一個聚集索引,大概可以強制執行檢索順序……但仍然不能將執行順序視為“始終保證”。 一種更具防御性的方法是將所有內容都放在查詢中......以試圖阻止它偏離其“預期”行為(作為徒勞的練習):

...............
from expected as e with(index(1), forcescan)
outer apply
(
..........
) as p
order by e.expd
option(force order, maxdop 1, use hint ('DISABLE_BATCH_MODE_ADAPTIVE_JOINS'));

如果必須利用易變的、不斷變化的會話上下文鍵,強制執行計划可能是最好的選擇。

然而,該方法的重點是簡單的標量函數,它在使用非空參數調用時將其參數值存儲並返回到會話鍵中,或者(函數)在使用 NULL 調用時返回該會話鍵的值范圍。 對於外部apply中的支付日期的檢索,該函數以default/NULL(會話密鑰未設置)調用,而在select部分,該函數以p.payd為參數調用。 p.payd 派生自外部應用,如果 p.payd 有值(==有匹配的付款日期),則標量函數設置會話密鑰,如果沒有匹配的 p.payd==NULL,則標量函數只是返回會話密鑰的值 --> 簡單的 tsql 部分之間的協同作用。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM