繁体   English   中英

使用模糊连接但具有不同匹配的完全连接

[英]Full Join using a fuzzy join but with distinct matching

我们有一张预期付款表和一张已付款表。 我们需要能够将付款与单一的预期付款相匹配,但我们允许在 +-3 天的时间内完成付款。 最重要的是,它应该是一对一的比赛。

所以想象我有一张预期付款表

2020-10-01
2020-10-04
2020-10-05
2020-10-20

和付款

2020-10-02
2020-10-06
2020-10-07

我想要的结果是

Expected      Made
2020-10-01    2020-10-02
2020-10-04    2020-10-06
2020-10-05    2020-10-07
2020-10-20

如果取消第 6 次付款,结果将是

Expected      Made
2020-10-01    2020-10-02
2020-10-04    2020-10-07
2020-10-05
2020-10-20

所以5号和7号付款之间的匹配取决于付款是否与4号匹配。 第 4 个和第 7 个之间的匹配取决于第 4 个是否与第 6 个匹配。

我目前通过与匹配进行完全连接然后递归迭代它以清除双方重复的记录来实现这一点。 不幸的是,由于这种情况下的数据有上百万行,因此搅拌大约需要 40 分钟。

我想知道是否有更好的方法或内置连接我没有遇到过来实现这种不同匹配的概念。

你的问题听起来很奇怪,我发布了一个奇怪的解决方案,使用子查询而不是连接

Select T1.Dated ExpectationDate, (
    Select TOP 1 T2.Dated
    From PaymentTable T2
    Where T2.Dated>=T1.Dated
    Order By T2.Dated
) PaymentDate
From ExpectedTable T1

最好在 PaymentTable 的日期列上创建索引。 这个查询在 320 万条记录上几乎不需要 80 秒。

..与您当前的方法相同的逻辑,迭代预期日期并找到任何大于上次使用的付款日期的匹配付款日期。 这是一个人为的例子(基于简化的要求),它可能不适合完整/完整的规范(在日期的唯一性下隐藏了许多复杂的问题......不要因为示例执行速度快而被欺骗)。

/*
drop table if exists expected
go
drop table if exists payment
go
*/


--expected
create table expected(expd date primary key clustered)
go

declare @d date = isnull((select max(expd) from expected), '20100101')
begin try
    insert into expected(expd)
    select dateadd(day, abs(checksum(newid()))%5, @d)
end try
begin catch
    --ignore errors, .... just a couple sample rows
end catch
go 5000
----------------

--payment
create table payment(payd date primary key clustered)
go

declare @d date = isnull((select max(payd) from payment), '20100101')
begin try
    insert into payment(payd)
    select dateadd(day, abs(checksum(newid()))%10, @d)
end try
begin catch
    --ignore errors, .... just a couple sample rows
end catch
go 5000



--max recursion == number of rows in expected - 1 (1 is the anchor)
declare @maxrecursion int = (select count(*) from expected)-1 /* -1==anchor*/;

--maximum difference between consecutive days in expected ..
--this is not needed if there is sequential column (with no breaks) in expected which could be used for the recursion
--eg rowA has id=1, rowB has id=2, rowC has id=3....rowX has id=100...recursion = recursive id + 1

declare @maxexpdaysdiff int = (
    select max(daysdiff)
    from 
    ( 
        select datediff(day, expd, lead(expd, 1, expd) over(order by expd)) as daysdiff
        from expected
    ) as ed
);



--recursion, rows of expected
with cte
as
(
    --mimimum expected date and any matching payment date
    select top (1) e.expd, lead(e.expd, 1) over(order by e.expd) as recexpd, pp.payd, pp.payd as lastusedpayd, 1 as rownum
    from expected as e
    outer apply(select top (1) p.payd from payment as p where p.payd >= e.expd and p.payd < dateadd(day, 3+1, e.expd) order by p.payd) as pp
    order by e.expd 
    union all
    --recursive part (next row of expected and any matching payment which was not assigned as lastusedpayd
    select cte.recexpd, lexp.leadexpd, mp.payd, isnull(mp.payd, cte.lastusedpayd), cte.rownum+1
    from cte --cte carries next expected date::recexpd, this is not needed if there is another/sequential column to be used
    --next expected row/date for the recursion
    outer apply
    (
        --get the next expected date, after cte.recexpd to continue the recursion (like using lead() in recursion)
        select l.leadexpd
        from
        (
            --rows between cte.recexpd and cte.recexpd+@maxexpdaysdiff (there is always a row..except for the last expected date)
            select e.expd as leadexpd, row_number() over (order by e.expd) as ern
            from expected as e
            where e.expd > cte.recexpd
            and e.expd <= dateadd(day, @maxexpdaysdiff, cte.recexpd)
        ) as l 
        where l.ern = 1
    ) as lexp
    --matching payment, if any
    outer apply
    ( 
        select lmp.payd
        from
        (
            --payments between cte.recexpd and cte.recexpd + 3 days
            --..but exclude all payments before the payment which was lastly used:: cte.lastusedpayd
            select p.payd, row_number() over(order by p.payd) as prn
            from payment as p
            where p.payd >= cte.recexpd
            and p.payd < dateadd(day, 3+1, cte.recexpd)
            and (p.payd > cte.lastusedpayd or cte.lastusedpayd is null)
        
        ) as lmp
        where lmp.prn=1
    ) as mp
    where cte.rownum <=  @maxrecursion--safeguard recursion.    
)
select expd, payd --,*
from cte
option(maxrecursion 0);

编辑:另一种方法(教育)是存储为前一行/预期日期分配/选择的付款日期,并在为每个新的预期日期选择付款日期时使用该值。

简而言之:

SELECT expecteddate --> SELECT paymentdate WHERE paymentdate > last_used_pa​​yment_date --> SET last_used_pa​​yment_date = SELECTed paymentdate(如果有的话)

虽然逻辑与递归方法相同,但它可以在没有显式 tsql 递归 cte 的情况下实现,而只需使用普通的 select 和标量函数。

如果您熟悉(或听说过)“古怪的更新”,那么以下内容可能是“古怪的选择”。

--the scalar function
create or alter function dbo.lastusedpaymentdate(@date date = null)
returns date
as
begin
    if @date is null
    begin
        return (cast(session_context(N'paymentdate') as date)); 
    end

    exec sp_set_session_context @key=N'paymentdate', @value=@date;
    return(@date);
end
go

--the query:
--reset session variable for each execution <-- this is not really needed for a single execution of the select in a session
exec sp_set_session_context @key=N'paymentdate', @value=null;
--..a select
select e.expd, p.payd, dbo.lastusedpaymentdate(p.payd) as lastusedpaymentdate
from expected as e with(index(1))
outer apply
(
    select top (1) pm.payd
    from payment as pm
    where pm.payd >= e.expd
    and pm.payd < dateadd(day, 3+1, e.expd)
    and pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101')
    order by pm.payd
) as p
order by e.expd;
  1. select e.expd from expected ⇛ SELECT 预期日期
  2. outer apply(..pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101')) ⇛ SELECT paymentdate WHERE paymentdate > last_used_pa​​yment_date
  3. select ..dbo.lastusedpaymentdate(p.payd).. ⇛ SET last_used_pa​​yment_date = SELECTed paymentdate (如果有的话)

奇怪之处在于,必须根据业务逻辑(最早日期在前,所有其他日期在后)按顺序遍历基表/主表(即预期日期)。 这违背了 SQL 基本原理(声明什么,而不是如何声明),并且不能“假设/认为理所当然”执行顺序。

对于简单的示例,查询使用 apply() 进行选择,主表有一个聚集索引,大概可以强制执行检索顺序……但仍然不能将执行顺序视为“始终保证”。 一种更具防御性的方法是将所有内容都放在查询中......以试图阻止它偏离其“预期”行为(作为徒劳的练习):

...............
from expected as e with(index(1), forcescan)
outer apply
(
..........
) as p
order by e.expd
option(force order, maxdop 1, use hint ('DISABLE_BATCH_MODE_ADAPTIVE_JOINS'));

如果必须利用易变的、不断变化的会话上下文键,强制执行计划可能是最好的选择。

然而,该方法的重点是简单的标量函数,它在使用非空参数调用时将其参数值存储并返回到会话键中,或者(函数)在使用 NULL 调用时返回该会话键的值范围。 对于外部apply中的支付日期的检索,该函数以default/NULL(会话密钥未设置)调用,而在select部分,该函数以p.payd为参数调用。 p.payd 派生自外部应用,如果 p.payd 有值(==有匹配的付款日期),则标量函数设置会话密钥,如果没有匹配的 p.payd==NULL,则标量函数只是返回会话密钥的值 --> 简单的 tsql 部分之间的协同作用。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM