简体   繁体   English

为表联接创建多个索引以适应模糊匹配

[英]Creating multiple indexes for table join to accommodate fuzzy matching

I'm trying to match user-provided postal address data to an address reference dataset. 我正在尝试将用户提供的邮政地址数据与地址参考数据集进行匹配。 I want to index both datasets and join on the indexed field. 我想索引两个数据集并加入索引字段。 In a perfect world, this would use a key consisting of the full address (eg, WHERE REF_ADDR = INPUT_ADDR will give 100 W Main St, Springfield, OH 45502 = 100 W Main St, Springfield, OH 45502 ). 在理想情况下,这将使用包含完整地址的密钥(例如, WHERE REF_ADDR = INPUT_ADDR将提供100 W Main St, Springfield, OH 45502 = 100 W Main St, Springfield, OH 45502 WHERE REF_ADDR = INPUT_ADDR 100 W Main St, Springfield, OH 45502 = 100 W Main St, Springfield, OH 45502 )。 Of course, addresses are rarely perfect, so I have a script that can accommodate for differences using fuzzy logic. 当然,地址很少是完美的,因此我有一个脚本,可以使用模糊逻辑来适应差异。 However because this script is very slow, I want to reduce the number of candidates from the reference dataset to which the matching process is attempted before it is used. 但是,因为此脚本非常慢,所以我想减少参考数据集中使用匹配过程之前要尝试的候选对象的数量。 To find all potential candidates, I intend to create an indexed key that is derived from individual address components to be used for joining. 为了找到所有潜在的候选者,我打算创建一个索引键,该索引键是从用于连接的各个地址组件派生的。 The problem is, one key alone will not capture all the possible candidates. 问题在于,仅凭一把钥匙就无法捕获所有可能的候选人。 I would likely need to create multiple indexed keys in order to capture all candidates. 我可能需要创建多个索引键才能捕获所有候选项。

For example, an indexed key in the form of 100 WMNST 455 for address 100 W Main St, Springfield, OH 45502 will be good most of the time, but there can be any number of address errors that will not be caught by such a key. 例如,地址为100 W Main St, Springfield, OH 45502 100 WMNST 455的形式为100 WMNST 455的索引密钥在大多数情况下会很好,但是可以有很多数量的地址错误不会被此类密钥捕获。 In order to accommodate all potential errors that the matching process will recognize, I would likely need to implement at least several indexed keys for joining. 为了适应匹配过程将识别的所有潜在错误,我可能需要实现至少几个索引键以进行连接。

I'm wondering if anyone has any recommendations for handling this issue. 我想知道是否有人对处理此问题有任何建议。 The reference dataset consists of 40M records, and the user-provided address data is typically around 10,000 records. 参考数据集包含4000万条记录,而用户提供的地址数据通常约为10,000条记录。 Would it be more effective to simply use LIKE and OR queries on the address fields as opposed to the method I'm proposing? 与我建议的方法相比, LIKE地址字段使用LIKEOR查询会更有效吗? It is not unusual to encounter the following variations within the latter dataset (accommodated for by the script): 在后一个数据集中(由脚本容纳)中遇到以下变体并不罕见:

Address: 100 W MAIN
City: 
Zip: 45502

Address: 100 MAIN ST
City: SPNGFLD
Zip:

Address: 100 W MAIN STREET
City: SPRINGFIELD
Zip: 54502

Address: 100 MAIN
City: NORTHRIDGE
Zip: 45502

Depending on what DB system you are using you must have try to see if any inbuilt functionality can be used. 根据所使用的数据库系统,您必须尝试查看是否可以使用任何内置功能。 For example if you are working on SQL SERVER, options I can think of is “Change Data Capture”, “Full text search”, “Filtered Index”, etc….. But regardless of the DB system if you want to develop your own that can be implemented on any DB system then this might interest you. 例如,如果您正在使用SQL SERVER,我可以想到的选项是“更改数据捕获”,“全文本搜索”,“过滤索引”等。.但是,无论您要开发自己的数据库系统是什么,可以在任何数据库系统上实施,那么您可能会感兴趣。

What you have ask is to suggest some indexing options but to me that is not the right question as you will be limited with very few options as the data grows in the table and/or your search criteria becomes complex. 您要提出的是建议一些索引选项,但对我来说,这不是正确的问题,因为随着表中数据的增长和/或搜索条件变得复杂,您将受到很少的选择。 If schema design itself is not scalable then you will not be able to implement more performance improvements later in extreme data cases. 如果架构设计本身不可扩展,那么以后在极端数据情况下您将无法实现更多性能改进。

I Created design to implement search so called “Google like Search” in our project whereas user start typing the text appropriate matching text suggestions should come up on result. 我创建了一个设计来在我们的项目中实现所谓的“类似于Google的搜索”搜索,而用户开始输入相应的文本建议就应该在结果中出现。 Also user can control type of search should be performed by setting. 用户还可以通过设置来控制搜索类型。

By that mean I mean “Exact Match”, "Similar Match", “Start With A”, “Ends With A”, or “Contain A”. 我的意思是“完全匹配”,“相似匹配”,“以A开头”,“以A结尾”或“包含A”。

In your case Address is kind of Data where Exact Match is rarely happens. 在您的情况下,地址是一种数据,很少会发生完全匹配。 So i guess you can skip that but if you want to implement that, it can done with some changes. 因此,我想您可以跳过这一点,但是如果您想实现这一点,则可以进行一些更改。 You can customize it as you need depending on the sophistication and complexity you want to handle. 您可以根据需要处理的复杂程度,根据需要自定义它。 here's the concept. 这是概念。

We will need 5 tables. 我们将需要5张桌子。

搜索表达式表说明

Now question is How does this schema help or improve your fuzzy search ? 现在的问题是,该架构如何帮助或改善您的模糊搜索?

Notice that each table has ONLY 2 Clumns with INTEGER and/OR STRING type, We can have Clustered index on each table that includes both column.. 请注意,每个表只有2个具有INTEGER和/或STRING类型的列,我们可以在包含两个列的每个表上具有聚簇索引。

Because we have separated out the data by accuracy you can give option to user how much accurate data user want to access. 由于我们已按准确性将数据分开,因此您可以为用户提供选择要访问多少准确数据的选项。 this will reduce the search load and also batch your search operation. 这将减少搜索负荷,并分批搜索操作。

If this is something you want to go for then let me know. 如果您想这样做,请告诉我。 creating the dummy data and coming up with performance number is not a big deal. 创建虚拟数据并提供性能数字并不重要。 I can help out with coming up final design that may work for you. 我可以帮您提出可能适合您的最终设计。

搜索表达式表示例

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM