简体   繁体   English

如何在Oracle中使用模糊匹配获得准确的JOIN

[英]How to get an accurate JOIN using Fuzzy matching in Oracle

I'm trying to join a set of county names from one table with county names in another table. 我正在尝试从另一个表中的一个带有县名的表中加入一组县名。 The issue here is that, the county names in both tables are not normalized. 这里的问题是,两个表中的县名都没有标准化。 They are not same in count; 它们的数量不一样; also, they may not be appearing in similar pattern always. 此外,他们可能不会总是以类似的模式出现。 For instance, the county 'SAINT JOHNS' in "Table A" may be represented as 'ST JOHNS' in "Table B". 例如,“表A”中的县“SAINT JOHNS”可以在“表B”中表示为“ST JOHNS”。 We cannot predict a common pattern for them. 我们无法预测它们的共同模式。

That means , we cannot use "equal to" ( = ) condition while joining. 这意味着,我们不能在加入时使用“等于”( = )条件。 So, I'm trying to join them using the JARO_WINKLER_SIMILARITY function in oracle. 所以,我正在尝试使用oracle中的JARO_WINKLER_SIMILARITY函数加入它们。 My Left Outer Join condition would be like: 我的左外连接条件如下:

Table_A.State = Table_B.State 
AND UTL_MATCH.JARO_WINKLER_SIMILARITY(Table_A.County_Name,Table_B.County_Name)>=80

I've given the measure 80 after some testing of the results and it seemed to be optimal. 在对结果进行一些测试后,我给出了测量值80,它似乎是最佳的。 Here, the issue is that I'm getting set of "false Positives" when joining. 在这里,问题是我在加入时会得到一组“误报”。 For instance, if there are some counties with similarity in names under the same state ("BARRY'and "BAY" for example), they will be matched if the measure is >=80 . This creates inaccurate set of joined data. Can anyone please suggest some work around? 例如,如果在同一状态下有一些名称具有相似性的县(例如“BARRY”和“BAY”),如果度量>=80 ,它们将匹配。这会产生不准确的连接数据集。任何人都可以请建议一些解决方法?

Thanks, DAV 谢谢,DAV

Can you plz help me to build a query that will lookup Table_A for each record in Table B/C/D, and match against the county name in A with highest ranked similarity that is >=80 你可以帮我构建一个查询,查询表B / C / D中每条记录的Table_A,并匹配A中的县名,其中排名最高的相似度> = 80

Oracle Setup : Oracle安装程序

CREATE TABLE official_words ( word ) AS
  SELECT 'SAINT JOHNS' FROM DUAL UNION ALL
  SELECT 'MONTGOMERY' FROM DUAL UNION ALL
  SELECT 'MONROE' FROM DUAL UNION ALL
  SELECT 'SAINT JAMES' FROM DUAL UNION ALL
  SELECT 'BOTANY BAY' FROM DUAL;

CREATE TABLE words_to_match ( word ) AS
  SELECT 'SAINT JOHN' FROM DUAL UNION ALL
  SELECT 'ST JAMES' FROM DUAL UNION ALL
  SELECT 'MONTGOMERY BAY' FROM DUAL UNION ALL
  SELECT 'MONROE ST' FROM DUAL;

Query : 查询

SELECT *
FROM   (
  SELECT wtm.word,
         ow.word AS official_word,
         UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) AS similarity,
         ROW_NUMBER() OVER ( PARTITION BY wtm.word ORDER BY UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) DESC ) AS rn
  FROM   words_to_match wtm
         INNER JOIN
         official_words ow
         ON ( UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word )>=80 )
)
WHERE rn = 1;

Output : 输出

WORD           OFFICIAL_WO SIMILARITY         RN
-------------- ----------- ---------- ----------
MONROE ST      MONROE              93          1
MONTGOMERY BAY MONTGOMERY          94          1
SAINT JOHN     SAINT JOHNS         98          1
ST JAMES       SAINT JAMES         80          1

Using some made up test data inline (you would use your own TABLE_A and TABLE_B in place of the first two with clauses, and begin at with matches as ... ): 使用内联的一些组成测试数据(您将使用自己的TABLE_A和TABLE_B代替前两个with子句,并从with matches as ...开始with matches as ... ):

with table_a (state, county_name) as
     ( select 'A', 'ST JOHNS' from dual union all
       select 'A', 'BARRY' from dual union all
       select 'B', 'CHEESECAKE' from dual union all
       select 'B', 'WAFFLES' from dual union all
       select 'C', 'UMBRELLAS' from dual )
   , table_b (state, county_name) as
     ( select 'A', 'SAINT JOHNS' from dual union all
       select 'A', 'SAINT JOANS' from dual union all
       select 'A', 'BARRY' from dual union all
       select 'A', 'BARRIERS' from dual union all
       select 'A', 'BANANA' from dual union all
       select 'A', 'BANOFFEE' from dual union all
       select 'B', 'CHEESE' from dual union all
       select 'B', 'CHIPS' from dual union all
       select 'B', 'CHICKENS' from dual union all
       select 'B', 'WAFFLING' from dual union all
       select 'B', 'KITTENS' from dual union all
       select 'C', 'PUPPIES' from dual union all
       select 'C', 'UMBRIA' from dual union all
       select 'C', 'UMBRELLAS' from dual )
   , matches as
     ( select a.state, a.county_name, b.county_name as matched_name
            , utl_match.jaro_winkler_similarity(a.county_name,b.county_name) as score
       from   table_a a
              join table_b b on b.state = a.state  )
   , ranked_matches as
     ( select m.*
            , rank() over (partition by m.state, m.county_name order by m.score desc) as ranking
       from   matches m
       where  score > 50 )
select rm.state, rm.county_name, rm. matched_name, rm.score
from   ranked_matches rm
where  ranking = 1
order by 1,2;

Results: 结果:

STATE COUNTY_NAME MATCHED_NAME      SCORE
----- ----------- ------------ ----------
A     BARRY       BARRY               100
A     ST JOHNS    SAINT JOHNS          80
B     CHEESECAKE  CHEESE               92
B     WAFFLES     WAFFLING             86
C     UMBRELLAS   UMBRELLAS           100

The idea is matches computes all scores, ranked_matches assigns them a sequence within ( state , county_name ), and the final query picks all the top scorers (ie filters on ranking = 1 ). 想法是matches计算所有分数, ranked_matches为它们分配一个序列( statecounty_name ),最终查询选择所有最高分数(即ranking = 1过滤器ranking = 1 )。

You may still get some duplicates as there is nothing to stop two different fuzzy matches scoring the same. 你可能仍然会得到一些重复,因为没有什么可以阻止两个不同的模糊匹配得分相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用模糊连接但具有不同匹配的完全连接 - Full Join using a fuzzy join but with distinct matching 当 oracle 中的连接字段之一是多对 1 时,如何在连接 3 个表时根据最大日期获得准确计数? - how can i get an accurate count based on max date when joining 3 tables when one of the join fields is many to 1 in oracle? 无交叉连接的不同表模糊匹配(Snowflake) - Fuzzy Matching in Different Tables with No Cross Join(Snowflake) 如何获得 Oracle 中长数字的准确长度 - How to get an accurate length of a long digit number in Oracle 对 Oracle SQL 中的外连接表使用模式匹配 - Using pattern matching to outer join tables in Oracle SQL Oracle-SQL语句性能不佳-模糊匹配逻辑 - Oracle - SQL Statement Poor Performance - Fuzzy Matching Logic 为表联接创建多个索引以适应模糊匹配 - Creating multiple indexes for table join to accommodate fuzzy matching Levenshtein距离Python UDF作为SQL连接中的模糊匹配代理 - Levenshtein distance Python UDF as fuzzy matching proxy in SQL join BigQuery 模糊匹配加入或使用范围 - BigQuery Fuzzy Match Join Or Using A Range 如何从SQL Join获取匹配和不匹配的行 - How to get matching and non matching rows from a SQL Join
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM