简体   繁体   English

根据多个条件匹配记录(来自大型数据集)

[英]Matching records based on multiple criteria (from large dataset)

I have a list of bank accounts in my database. 我的数据库中有一个银行帐户清单。 We want to try to group these accounts based on household. 我们想尝试根据家庭对这些帐户进行分组。 We match on three criteria: 我们根据以下三个条件进行匹配:

  • SSN 社交网络
  • Customer number (this is an arbitrary number from the bank) 客户编号(这是银行的任意编号)
  • "Address string" (it's basically the street address plus zip) “地址字符串”(基本上是街道地址和邮政编码)

If any one of these three things matches between two accounts, the two accounts should be put in the same group. 如果这三个事物中的任何一个在两个帐户之间匹配,则应将这两个帐户放在同一组中。

This can't be done with SQL joins , as far as I understand. 据我了解, 这无法使用SQL joins来完成 I'm also at a loss for how to do it programmatically. 我也不知道如何以编程方式进行操作。 We have millions of accounts in our database and the number grows by many thousands (150K-ish) each month, so it's not like it's practical to go through each and every single record and say, "Okay, do a SELECT * WHERE ssn = (this account's SSN) " because it would take forever. 我们的数据库中有数百万个帐户,并且每个月的数量增长成千上万(约15万个ish),因此遍历每条记录并说:“好吧,执行SELECT * WHERE ssn = (this account's SSN) ”,因为它将永远永久。

I know this is kind of a vague and open-ended question, but any suggestions on how to proceed? 我知道这是一个模糊且开放性的问题,但是有关如何进行的任何建议? I don't care what language(s) you use in your answer, if you use any. 我不在乎您在回答中使用哪种语言(如果您使用任何一种语言)。

In my honest oppinion, your best bet is to implement a one-to-many or many-to-many relationship for household-to-account. 以我的诚实观点,您最好的选择是实现家庭对帐户的一对多或多对多关系。

Two ways I can think of doing something like this. 我可以想到两种方式来做这样的事情。 The first (and probably not the best solution) is to add a column to the account table to store the household. 第一个(可能不是最好的解决方案)是在帐户表中添加一列来存储住户。 Personally I would stay away from this if at all possible. 我个人将尽可能避免这样做。

And second, create a "household" table to store the household PK, and a household cross reference table to store the household FK, and the account FK. 其次,创建一个“家庭”表来存储家庭PK,并创建一个家庭交叉参考表来存储家庭FK和帐户FK。

Then I would create a process with whatever programming language your using (hopeful its object oriented and you can create an "object" that you can use for the next part, and then in the future as well. 然后,我将使用您使用的任何编程语言创建一个过程(希望它是面向对象的,您可以创建一个“对象”,该对象可用于下一部分,以后也可以使用)。

Once the database is setup, I would build a "method" that accepts an account, and does a comparison by ssn, customer number, and address and returns either a list of similar account ids (this could be very useful and might make your initial process go quicker) and/or returns a list of households that similar accounts might belong to 建立数据库后,我将构建一个“方法”来接受一个帐户,并通过ssn,客户编号和地址进行比较,并返回一个相似帐户ID的列表(这可能非常有用,可能会使您的初始处理更快)和/或返回类似帐户可能属于的家庭列表

THIS is the part that would worry me... there maybe situations where accounts that are linked by address may or may not belong in the same household that accounts linked by account number may or maynot be in) ie. 这是让我担心的部分……在某些情况下,通过地址链接的帐户可能属于也可能不属于与通过帐号链接的帐户属于或不属于同一家庭。 a "child" (one customer number) whos parents have separated has an account set up by each parent (two accounts with most likely different addresses), as well as their OWN accounts... and so on... i would personally come up with some sort of business logic to limit the returned household to only 1 household... 父母分开的一个“孩子”(一个客户编号)有一个由每个父母建立的帐户(两个帐户的地址很可能不同),还有他们的OWN帐户...等等...我个人会来通过某种商业逻辑将返还的家庭限制为只有1个家庭...

at this point, by having the list of similar accounts, and returning a single household that atleast one of the similar accounts is a part of, you can then update those specific accounts with that household id. 此时,通过具有相似帐户的列表,并返回至少一个相似帐户中的一部分的单个家庭,然后可以使用该家庭ID更新这些特定帐户。

I would setup logic at that point to loop through every account in the table and run it through the process... yes this is going to be expensive.... but you should only have to do it once. 我会在那时设置逻辑,以遍历表中的每个帐户并在整个过程中运行它。是的,这将非常昂贵。...但是您只需要执行一次即可。

after that, as accounts are entered, setup a process for the entry to automatically find and place accounts into households. 之后,在输入帐户后,为输入设置过程以自动查找帐户并将其放入家庭。

Depending your front end this may or may not be a simple process. 根据您的前端,这可能不是一个简单的过程。

Regardless I would also develope a process/user interface that would allow the user (prefferable a customer service rep) to remove/move accounts between households. 无论如何,我还将开发一个流程/用户界面,该界面将允许用户(可能是客户服务代表)删除/移动家庭之间的帐户。

This is a start, just bouncing ideas off you. 这是一个开始,只是激发您灵感。

Well, I don't see any way around having each record inspect every other record to see if it's in the same household. 好吧,我看不到让每条记录检查其他每条记录以查看它是否在同一个家庭中的任何方法。 The only efficiency I see is that you can skip the inspection if the record already is in a household. 我看到的唯一效率是,如果记录已经存在于家庭中,则可以跳过检查。 In psuedocode 在伪代码中

delete all record.household
currentHousehold=1

foreach record
    record.household=currentHousehold
    foreach record
        if record.household is null
            if meet criteria 1
                record.household=currentHousehold
                exit
            else if meet criteria 2
                record.household=currentHousehold
                exit
            else if meet criteria 3
                record.household=currentHousehold
                exit
            end if
        end if
    next record

    currentHousehold++
next record

The assumption is that you add a household column to the table, which you can group on. 假设您向表中添加了一个家庭列,您可以对其进行分组。 I indicate that current household values should be cleared - this is in case some of the data changes. 我指出应该清除当前的家庭价值-以防万一某些数据发生变化。

If you can intercept any possible change to your criteria fields then you can find that records' new household then and there. 如果您可以拦截对条件字段的任何可能更改,那么您可以在那里找到该记录的新家庭。 In that case, household values can stay put and the script would only have to find households for new records (or just do that when the record is added if you can). 在这种情况下,住户价值可以保持不变,脚本仅需寻找住户即可获得新记录(或者,如果可以的话,只需在添加记录时这样做即可)。 If you have that kind of control, then you should be able to put each record in a household once for existing records, when added or when criteria fields are modified. 如果您具有这种控制权,那么应该将每条记录放入家庭中一次以用于现有记录(添加时或修改条件字段时)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM