简体   繁体   English

使用模糊连接但具有不同匹配的完全连接

[英]Full Join using a fuzzy join but with distinct matching

We have a table of expected payments and a table of payments made.我们有一张预期付款表和一张已付款表。 We need to be able to match payments with a singular expected payment but we allow a +-3 day window for it to be made.我们需要能够将付款与单一的预期付款相匹配,但我们允许在 +-3 天的时间内完成付款。 On top of that, it should be a one-to-one match.最重要的是,它应该是一对一的比赛。

So imagine I have a table of expected payments所以想象我有一张预期付款表

2020-10-01
2020-10-04
2020-10-05
2020-10-20

and payments和付款

2020-10-02
2020-10-06
2020-10-07

The result I want is我想要的结果是

Expected      Made
2020-10-01    2020-10-02
2020-10-04    2020-10-06
2020-10-05    2020-10-07
2020-10-20

and if the payment of the 6th is removed the result would be如果取消第 6 次付款,结果将是

Expected      Made
2020-10-01    2020-10-02
2020-10-04    2020-10-07
2020-10-05
2020-10-20

so the matching between the 5th and the payment on the 7th depends on whether the payment matched with the 4th.所以5号和7号付款之间的匹配取决于付款是否与4号匹配。 As does the matching between the 4th and 7th depend on if the 4th was matched with the 6th.第 4 个和第 7 个之间的匹配取决于第 4 个是否与第 6 个匹配。

I've currently achieved this by doing a full join with the matching and then recursively iterating over it to clean out repeated records from both sides.我目前通过与匹配进行完全连接然后递归迭代它以清除双方重复的记录来实现这一点。 unfortunately as the data in this case is in the 100s of millions of rows, it takes about 40 minutes to churn.不幸的是,由于这种情况下的数据有上百万行,因此搅拌大约需要 40 分钟。

I'm wondering if there is a better way or a built in join that I've not come across to achieve this concept of distinct matching.我想知道是否有更好的方法或内置连接我没有遇到过来实现这种不同匹配的概念。

You question sounds strange and I am posting a strange solution, use a subquery instead of a join你的问题听起来很奇怪,我发布了一个奇怪的解决方案,使用子查询而不是连接

Select T1.Dated ExpectationDate, (
    Select TOP 1 T2.Dated
    From PaymentTable T2
    Where T2.Dated>=T1.Dated
    Order By T2.Dated
) PaymentDate
From ExpectedTable T1

Its better to create an index on the date column of PaymentTable.最好在 PaymentTable 的日期列上创建索引。 This query hardly took 80 seconds on 3.2M records.这个查询在 320 万条记录上几乎不需要 80 秒。

..the same logic, as your current approach, iteration through expected dates and find any matching payment date which is greater than the last used payment date. ..与您当前的方法相同的逻辑,迭代预期日期并找到任何大于上次使用的付款日期的匹配付款日期。 It is a contrived example (based on the simplified requirement) and it might not fit well to the complete/full specs (lots of intricacies are hidden underneath the uniqueness of the dates.. do not get tricked just because the example executes fast).这是一个人为的例子(基于简化的要求),它可能不适合完整/完整的规范(在日期的唯一性下隐藏了许多复杂的问题......不要因为示例执行速度快而被欺骗)。

/*
drop table if exists expected
go
drop table if exists payment
go
*/


--expected
create table expected(expd date primary key clustered)
go

declare @d date = isnull((select max(expd) from expected), '20100101')
begin try
    insert into expected(expd)
    select dateadd(day, abs(checksum(newid()))%5, @d)
end try
begin catch
    --ignore errors, .... just a couple sample rows
end catch
go 5000
----------------

--payment
create table payment(payd date primary key clustered)
go

declare @d date = isnull((select max(payd) from payment), '20100101')
begin try
    insert into payment(payd)
    select dateadd(day, abs(checksum(newid()))%10, @d)
end try
begin catch
    --ignore errors, .... just a couple sample rows
end catch
go 5000



--max recursion == number of rows in expected - 1 (1 is the anchor)
declare @maxrecursion int = (select count(*) from expected)-1 /* -1==anchor*/;

--maximum difference between consecutive days in expected ..
--this is not needed if there is sequential column (with no breaks) in expected which could be used for the recursion
--eg rowA has id=1, rowB has id=2, rowC has id=3....rowX has id=100...recursion = recursive id + 1

declare @maxexpdaysdiff int = (
    select max(daysdiff)
    from 
    ( 
        select datediff(day, expd, lead(expd, 1, expd) over(order by expd)) as daysdiff
        from expected
    ) as ed
);



--recursion, rows of expected
with cte
as
(
    --mimimum expected date and any matching payment date
    select top (1) e.expd, lead(e.expd, 1) over(order by e.expd) as recexpd, pp.payd, pp.payd as lastusedpayd, 1 as rownum
    from expected as e
    outer apply(select top (1) p.payd from payment as p where p.payd >= e.expd and p.payd < dateadd(day, 3+1, e.expd) order by p.payd) as pp
    order by e.expd 
    union all
    --recursive part (next row of expected and any matching payment which was not assigned as lastusedpayd
    select cte.recexpd, lexp.leadexpd, mp.payd, isnull(mp.payd, cte.lastusedpayd), cte.rownum+1
    from cte --cte carries next expected date::recexpd, this is not needed if there is another/sequential column to be used
    --next expected row/date for the recursion
    outer apply
    (
        --get the next expected date, after cte.recexpd to continue the recursion (like using lead() in recursion)
        select l.leadexpd
        from
        (
            --rows between cte.recexpd and cte.recexpd+@maxexpdaysdiff (there is always a row..except for the last expected date)
            select e.expd as leadexpd, row_number() over (order by e.expd) as ern
            from expected as e
            where e.expd > cte.recexpd
            and e.expd <= dateadd(day, @maxexpdaysdiff, cte.recexpd)
        ) as l 
        where l.ern = 1
    ) as lexp
    --matching payment, if any
    outer apply
    ( 
        select lmp.payd
        from
        (
            --payments between cte.recexpd and cte.recexpd + 3 days
            --..but exclude all payments before the payment which was lastly used:: cte.lastusedpayd
            select p.payd, row_number() over(order by p.payd) as prn
            from payment as p
            where p.payd >= cte.recexpd
            and p.payd < dateadd(day, 3+1, cte.recexpd)
            and (p.payd > cte.lastusedpayd or cte.lastusedpayd is null)
        
        ) as lmp
        where lmp.prn=1
    ) as mp
    where cte.rownum <=  @maxrecursion--safeguard recursion.    
)
select expd, payd --,*
from cte
option(maxrecursion 0);

EDIT : Another approach (educational), would be to store the payment date which was assigned/selected for a previous row/expected date and use that value when selecting the payment date for each new expected date.编辑:另一种方法(教育)是存储为前一行/预期日期分配/选择的付款日期,并在为每个新的预期日期选择付款日期时使用该值。

In short:简而言之:

SELECT expecteddate --> SELECT paymentdate WHERE paymentdate > last_used_payment_date --> SET last_used_payment_date = SELECTed paymentdate (if any) SELECT expecteddate --> SELECT paymentdate WHERE paymentdate > last_used_pa​​yment_date --> SET last_used_pa​​yment_date = SELECTed paymentdate(如果有的话)

Although the logic is identical to the recursive approach, it can be implemented without an explicit tsql recursive cte, but just with a normal select and a scalar function.虽然逻辑与递归方法相同,但它可以在没有显式 tsql 递归 cte 的情况下实现,而只需使用普通的 select 和标量函数。

If you are familiar with (or ever heard of) the "quirky update" then the following could potentially be a "quirky select".如果您熟悉(或听说过)“古怪的更新”,那么以下内容可能是“古怪的选择”。

--the scalar function
create or alter function dbo.lastusedpaymentdate(@date date = null)
returns date
as
begin
    if @date is null
    begin
        return (cast(session_context(N'paymentdate') as date)); 
    end

    exec sp_set_session_context @key=N'paymentdate', @value=@date;
    return(@date);
end
go

--the query:
--reset session variable for each execution <-- this is not really needed for a single execution of the select in a session
exec sp_set_session_context @key=N'paymentdate', @value=null;
--..a select
select e.expd, p.payd, dbo.lastusedpaymentdate(p.payd) as lastusedpaymentdate
from expected as e with(index(1))
outer apply
(
    select top (1) pm.payd
    from payment as pm
    where pm.payd >= e.expd
    and pm.payd < dateadd(day, 3+1, e.expd)
    and pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101')
    order by pm.payd
) as p
order by e.expd;
  1. select e.expd from expected ⇛ SELECT expecteddate select e.expd from expected ⇛ SELECT 预期日期
  2. outer apply(..pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101')) ⇛ SELECT paymentdate WHERE paymentdate > last_used_payment_date outer apply(..pm.payd > isnull(dbo.lastusedpaymentdate(default), '19000101')) ⇛ SELECT paymentdate WHERE paymentdate > last_used_pa​​yment_date
  3. select ..dbo.lastusedpaymentdate(p.payd).. ⇛ SET last_used_payment_date = SELECTed paymentdate (if any) select ..dbo.lastusedpaymentdate(p.payd).. ⇛ SET last_used_pa​​yment_date = SELECTed paymentdate (如果有的话)

The quirkiness stems from the fact that the base/main table (ie expected dates) must be traversed in order, according to the business logic (earliest date first and all other dates next in order).奇怪之处在于,必须根据业务逻辑(最早日期在前,所有其他日期在后)按顺序遍历基表/主表(即预期日期)。 This goes against the SQL fundamentals (declare what, not how) and order of execution cannot "be assumed/taken for granted".这违背了 SQL 基本原理(声明什么,而不是如何声明),并且不能“假设/认为理所当然”执行顺序。

For the simple example, the query selects with an apply(), the main table has a clustered index to presumably enforce order of retrieval..but still order of execution cannot be taken as 'always guaranteed'.对于简单的示例,查询使用 apply() 进行选择,主表有一个聚集索引,大概可以强制执行检索顺序……但仍然不能将执行顺序视为“始终保证”。 A more defensive approach would be to throw everything at the query...in an attempt to stop it from deviating from its "expected" behavior (as an exercise in futility) :一种更具防御性的方法是将所有内容都放在查询中......以试图阻止它偏离其“预期”行为(作为徒劳的练习):

...............
from expected as e with(index(1), forcescan)
outer apply
(
..........
) as p
order by e.expd
option(force order, maxdop 1, use hint ('DISABLE_BATCH_MODE_ADAPTIVE_JOINS'));

Forcing an execution plan might be the best option if one had to exploit a volatile, ever changing, session context key.如果必须利用易变的、不断变化的会话上下文键,强制执行计划可能是最好的选择。

Nevertheless, the focal point of the approach is the simple scalar function, which stores&returns its parameter value in a session key when called with a non-null parameter or it (the function) returns the value of that session key when it's called with a NULL parameter.然而,该方法的重点是简单的标量函数,它在使用非空参数调用时将其参数值存储并返回到会话键中,或者(函数)在使用 NULL 调用时返回该会话键的值范围。 For the retrieval of the payment date in the outer apply, the function is called with default/NULL (session key is not set), while in the select part, the function is called with p.payd as parameter.对于外部apply中的支付日期的检索,该函数以default/NULL(会话密钥未设置)调用,而在select部分,该函数以p.payd为参数调用。 p.payd derives from an outer apply, if p.payd has a value (==there is a matching payment date) then the scalar function sets the session key, if there is no matching p.payd==NULL, then the scalar function simply returns the value of the session key --> synergy among simplistic tsql parts. p.payd 派生自外部应用,如果 p.payd 有值(==有匹配的付款日期),则标量函数设置会话密钥,如果没有匹配的 p.payd==NULL,则标量函数只是返回会话密钥的值 --> 简单的 tsql 部分之间的协同作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM