简体   繁体   English

从一个表与另一个表中查找匹配的人(MS SQL Server)

[英]Find matching persons from one table with another (MS SQL Server)

I have two tables:我有两张桌子:

table "Person"表“人”

ID          FirstName  LastName
----------- ---------- ----------
1           Janez      Novak
2           Matija     Špacapan
3           Francka    Joras

Table "UserList"表“用户列表”

ID    FullName
----- --------------------
1     Andrej Novak
2     Novak Peter Janez
3     Jana Novak
4     Andrej Kosir
5     Jan Balon
6     Francka Joras
7     France Joras

As a result, the query must return those IDs from both tables, that FirstName and Lastname from table Person exist in table UserList.因此,查询必须从两个表中返回那些 ID,即来自表 Person 的 FirstName 和 Lastname 存在于表 UserList 中。 The name and Lastname must be precisely the same.姓名和姓氏必须完全相同。 FullName in table UserList can include the middle name - which should be "ignored".表 UserList 中的 FullName 可以包含中间名 - 应该“忽略”。

Match: Janez Novak = Janez Novak OR Novak Janez OR Janez Peter Novak比赛:Janez Novak = Janez Novak OR Novak Janez OR Janez Peter Novak

Not a match: Janez Novak <> Janeza Novak OR Jjanez Novak不匹配:Janez Novak <> Janeza Novak 或 Jjanez Novak

Wanted results:想要的结果:

ID   FirstName  LastName  ID   WholeName
---- ---------- --------- ---- -------------------
1    Janez      Novak     2    Novak Peter Janez
3    Francka    Joras     6    Francka Joras

This is my query:这是我的查询:

SELECT 
    A.ID
    ,A.FirstName
    ,A.LastName
    ,B.ID
    ,B.WholeName
FROM    
    dbo.UserList B
    cross join dbo.Person A 
WHERE   
    (                                                
    CHARINDEX('"'+A.FirstName+'"', '"'+Replace(B.WholeName,' ','"')+'"') > 0
     AND CHARINDEX('"'+A.LastName+'"', '"'+Replace(B.WholeName,' ','"')+'"') > 0 
    )

The query works OK when there are not many records in the tables.当表中的记录不多时,查询工作正常。

But my tables have: "Person" -> 400k and "UserList" -> 14k records.但是我的表有:“Person”-> 400k 和“UserList”-> 14k 记录。

Is my approach to finding a solution OK, or is there any other more efficient way to do that?我找到解决方案的方法可以吗,还是有其他更有效的方法可以做到这一点? Thank you.谢谢你。

BR BR

Your schema is broken :p您的架构已损坏:p

There are various heuristis for doing the matching, but I expect you'll be able to find counterexamples to break whatever you try.进行匹配有多种启发式方法,但我希望您能够找到反例来打破您尝试的任何方法。 For example what about the four people: Peter Smith, Pete Smith, Peter Smithson, and Pete Smithson?例如,四个人:Peter Smith、Pete Smith、Peter Smithson 和 Pete Smithson 呢?

Here's a %LIKE% approach, which I'd expect to be slow.这是一种%LIKE%方法,我预计它会很慢。

SELECT p.ID, p.FirstName, p.LastName, u.ID, u.FullName,
    CASE WHEN COUNT(*) OVER (PARTITION BY p.ID) > 1 THEN 0 ELSE 1 END AS MatchIsUnique
FROM Person p
    INNER JOIN UserList u
        ON u.FullName LIKE p.FirstName + '%'
        AND u.LastName LIKE '%' + p.LastName

Here's a string manipulation approach based on the assumption that the space character is the delimiter.这是一种基于空格字符是分隔符的假设的字符串操作方法。

SELECT p.ID, p.FirstName, p.LastName, u.ID, u.FullName,
    CASE WHEN COUNT(*) OVER (PARTITION BY p.ID) > 1 THEN 0 ELSE 1 END AS MatchIsUnique
FROM Person p
    INNER JOIN UserList u
        ON p.FirstName = SUBSTRING(@FullName, 0, CHARINDEX(' ', @Fullname))
        AND p.LastName = SUBSTRING(@FullName, LEN(@FullName) - CHARINDEX(' ', REVERSE(@Fullname))+1, CHARINDEX(' ', REVERSE(@Fullname)))

Probably also quite slow.可能也很慢。 Maybe you could speed it up by adding也许您可以通过添加来加快速度

  • SUBSTRING(@FullName, 0, CHARINDEX(' ', @Fullname)) and SUBSTRING(@FullName, 0, CHARINDEX(' ', @Fullname))
  • SUBSTRING(@FullName, LEN(@FullName) - CHARINDEX(' ', REVERSE(@Fullname))+1, CHARINDEX(' ', REVERSE(@Fullname)))

as computed columns and indexing them.作为计算列并对它们进行索引。

Create tables创建表

create table persons (
  id int IDENTITY(1,1) PRIMARY KEY,
  FirstName nvarchar(32) NOT NULL,
  LastName nvarchar(32) NOT NULL
);

create table users (
  id int IDENTITY(1,1) PRIMARY KEY,
  FullName nvarchar(32) NOT NULL
);

Sample data样本数据

INSERT INTO persons (FirstName, LastName)
values
('Janez','Novak'),
('Matija','Špacapan'),
('Francka','Joras');

INSERT INTO users (FullName)
VALUES
('Andrej Novak'),
('Novak Peter Janez'),
('Jana Novak'),
('Andrej Kosir'),
('Jan Balon'),
('Francka Joras'),
('France Joras'),

/* --EDIT: added sample data for wildcard testing-- */
('Franckas Joras'), -- added 's' after firstname
('Francka AJoras'), -- added 'A' before lastname
('Franckas AJoras'), -- both above
('Francka Jr. Joras'), -- added just midname
('Franckas Jr. Joras'); -- added 's' before firstname & added midname as well

Query (matching names)查询(匹配名称)

SELECT p.id, p.FirstName, p.LastName, u.id as user_id, u.FullName
FROM persons p, users u
WHERE
  -- EDIT
  /* changed wildcards (added spaces on both sides)
  + added 2 more conditions without wildcards */
  u.FullName LIKE CONCAT(p.FirstName, ' % ', p.LastName)
  OR
  u.FullName LIKE CONCAT(p.LastName, ' % ', p.FirstName)
  OR
  u.FullName LIKE CONCAT(p.FirstName, ' ', p.LastName)
  OR
  u.FullName LIKE CONCAT(p.LastName, ' ', p.FirstName)

Output输出

SO-72348127

EDIT: output with new sample data (for wildcard testing)编辑:输出新样本数据(用于通配符测试) SO-72348127 (2)

Running example SQL Fiddle运行示例SQL Fiddle

Above example link is of MySQL & the code is working fine on SQL server上面的示例链接是 MySQL 的,代码在 SQL 服务器上运行良好

One method you could try is to split the full names into rows and then compare, selecting only those where both first and last name match:您可以尝试的一种方法是将全名分成几行,然后进行比较,只选择名字和姓氏都匹配的那些:

select Max(m.id) Id, max(m.firstname) FirstName, Max(m.lastname) LastName, 
  u.id, Max(u.fullname) FullName
from userlist u
cross apply String_Split(fullname,' ')
cross apply (
    select *
    from person p
    where p.firstname = value or p.lastname = value
)m
group by u.id 
having Count(*)=2;

Output:输出:

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM