简体   繁体   English

在表上插入数据时,t-sql匹配单词

[英]t-sql matching words when insert with data on a table

I have a web application through a user can upload an Excel file with some destination. 我有一个通过用户使用的Web应用程序,可以上传带有某些目标位置的Excel文件。 When the file is uploaded, I read the rows and insert them in a SQL Server database. 上传文件后,我读取了这些行并将其插入到SQL Server数据库中。
On the SQL Server I have to do a matching of the destination with a list of destinations in a table. 在SQL Server上,我必须将目标与表中的目标列表进行匹配。 As the list of destinations in the database is the reference, matching must be accurate. 由于数据库中的目标列表是参考,因此匹配必须准确。

Here is an example of a destination from database and a destination uploaded by the user (among which must be made to match): 这是数据库中的目标和用户上载的目标(必须匹配)的示例:

  • from database: United Kingdom - Mobile - O2 来自数据库:英国-移动-O2
  • Uploaded by user: United Kingdom - O2 Mobile 用户上传:英国-O2 Mobile

What is the best way to make more accurate matching? 进行更精确匹配的最佳方法是什么?

I don't think this problem can be solved using T-SQL only. 我认为仅使用T-SQL不能解决此问题。 Unfortunately T-SQL has no good algorithms for fuzzy matching. 不幸的是,T-SQL没有好的模糊匹配算法。 Soundex is not very relevant, full text search neither for this problem. Soundex的相关性不是很高,全文搜索都无法解决此问题。

I would recommend a very good library written in C# http://anastasiosyal.com/post/2009/01/11/Beyond-SoundEx-Functions-for-Fuzzy-Searching-in-MS-SQL-Server . 我建议使用C# http://anastasiosyal.com/post/2009/01/11/Beyond-SoundEx-Functions-for-Fuzzy-Searching-in-MS-SQL-Server编写的一个很好的库。 It implements a lot of string metric algorithms like and can be imported as CLR functions in SQL Server. 它实现了许多字符串度量算法,例如,可以将其导入为SQL Server中的CLR函数。 Can have performance issues for a large amount of data. 对于大量数据可能存在性能问题。

I also can recommend, especially because you import data, to create a SSIS package. 我也可以建议创建一个SSIS包,尤其是因为您导入数据时。 In a package you can use Fuzzy Lookup Transformation block to identify similarities: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx . 在一个程序包中,可以使用“模糊查找转换”块来识别相似之处: http : //msdn.microsoft.com/zh-cn/magazine/cc163731.aspx I use it to identify duplicates, based on similarity, in a table with more than 1 million records. 我使用它来基于相似度,在具有超过一百万条记录的表中识别重复项。 Also in both cases you will have to run some tests in order to define the percent of similarity for an accurate matching in case of your business. 同样,在这两种情况下,您都必须运行一些测试才能定义相似度的百分比,以便在业务方面进行准确匹配。

I have solved lots of problems like this. 我已经解决了很多这样的问题。 Split the database data into relevant columns (Country, Device, Brand) in a temp table. 将数据库数据拆分到临时表中的相关列(“国家/地区”,“设备”,“品牌”)中。 Split the user input data (excel) into relevant columns (Country, Device, Brand) before you import into the database. 在导入数据库之前,将用户输入数据(excel)拆分为相关列(“国家/地区”,“设备”,“品牌”)。 Then import excel data into a temp table. 然后将excel数据导入到临时表中。 Then you can adjust your matching anyway you want. 然后,您可以随时调整匹配。

You need to define a matching algorithm. 您需要定义一个匹配算法。 If it is by counting words that match, no matter what order they occur, here it is: 如果是通过对匹配的单词进行计数,则不管它们出现的顺序如何,这里是:

declare @t table(field varchar(200))
insert into @t values('United Kingdom - Mobile - O2')
declare @upload varchar(200) = ' United   Kingdom  -  O2    Mobile noise'

-- Let's find matching words, no matter in what order they are!
declare @IgnoreChars varchar(50) = char(13)+char(10)+char(9)+'-.,'
select t.field,
    MatchedWords = SUM(CASE WHEN m.WordFoundAt=0 THEN 0 ELSE 1 END),
    TotalWords = COUNT(*)
from @t t
    CROSS APPLY dbo.str_split(dbo.str_translate(@upload, @IgnoreChars, REPLICATE(' ', LEN(@IgnoreChars))), ' ') w
    OUTER APPLY (SELECT WordFoundAt = CHARINDEX(w.id, t.field)) m
where w.id <> ''
group by t.field

Result: 结果:

field MatchedWords TotalWords 字段MatchedWords TotalWords

United Kingdom - Mobile - O2 4 5 英国-手机-O2 4 5

Functions str_translate and str_split are not built-in, but I don't know how to post them here since attachments are not allowed. 函数str_translate和str_split不是内置的,但由于不允许附件,因此我不知道如何在此处发布它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM