简体   繁体   English

使用正则表达式和查找的组合匹配记录

[英]Match records with a combination of regex and lookup

I want to match personal records between two tables using the following logic:我想使用以下逻辑匹配两个表之间的个人记录:

  1. Regex match on last name up to minor variations - summarized by the following regex for a given last name: grepl("LNAME(.r|-| [ivx]|.*)", last_name, ignore.case = TRUE) .姓氏的正则表达式匹配,直至细微变化 - 由给定姓氏的以下正则表达式总结: grepl("LNAME(.r|-| [ivx]|.*)", last_name, ignore.case = TRUE) The function fuzzyjoin::regex_*_join was suggested, but I'm not sure how to use it if the name isn't static...?建议使用 function fuzzyjoin::regex_*_join ,但如果名称不是 static...,我不确定如何使用它?

  2. Match on first name based on the nicknames list.根据昵称列表匹配名字。 So basically matching all names in nicknames[[fname]] or just fname if that is empty.因此,基本上匹配nicknames[[fname]]中的所有名称,如果为空,则仅匹配fname Should not be case-sensitive as well.也不应该区分大小写。

  3. Exact match on city, not case-sensitive.完全匹配城市,不区分大小写。

Right now I'm just iterating through df1 and implementing this logic by hand, but my data set is large and it's taking way too long, plus the manual implementation doesn't lend itself to parallelization, which is a concern as I willwant to optimize this in the future.现在我只是迭代 df1 并手动实现这个逻辑,但是我的数据集很大,而且花费的时间太长,而且手动实现不适合并行化,这是我想要优化的一个问题这在未来。 There has to be a smarter way of doing this.必须有一种更聪明的方法来做到这一点。

Example data:示例数据:

df1 <- tibble("lname1"=c("SMITH","BLACK","MILLER"),
              "fname1"=c("JOHN","THOMAS","JAMES"),
              "city"=c("NEW YORK","LOS ANGELES","SEATTLE"),
              "id1"=c("aaaa","bbbb","cccc"),
              "misc1"=c("bla","ble","bla"))

df2 <- tibble("lname2"=c("Smith Jr.","Black III","Miller-Muller","Smith"),
              "fname2"=c("Jon","Tom","Jamie","John"),
              "city"=c("New York","Los Angeles","Seattle","New York"),
              "id2"=c("1111","2222","3333","4444"),
              "misc2"=c("bonk","bzdonk","boom","bam"))

nicknames <- list("john"=c("john","jon","johnny"), 
                  "thomas"=c("thomas","tom","tommy"),
                  "james"=c("james","jamie","jim"))

Expected output:预期 output:

expected_output <- tibble("id1"=c("aaaa","aaaa","bbbb","cccc"),
                          "id2"=c("1111","4444","2222","3333"),
                          "lname1"=c("SMITH","SMITH","BLACK","MILLER"),
                          "fname1"=c("JOHN","JOHN","THOMAS","JAMES"),
                          "lname2"=c("Smith Jr.","Smith","Black III","Miller-Muller"),
                          "fname2"=c("Jon","John","Tom","Jamie"),
                          "city"=c("New York","New York","Los Angeles","Seattle"),
                          "misc1"=c("bla","bla","ble","bla"),
                          "misc2"=c("bonk","bam","bzdonk","boom"))

# A tibble: 4 x 9
  id1   id2   lname1      fname1 lname2        fname2 city        misc1 misc2 
  <chr> <chr> <chr>       <chr>  <chr>         <chr>  <chr>       <chr> <chr> 
1 aaaa  1111  SMITH       JOHN   Smith Jr.     Jon    New York    bla   bonk  
2 aaaa  4444  SMITH       JOHN   Smith         John   New York    bla   bam   
3 bbbb  2222  BLACK       THOMAS Black III     Tom    Los Angeles ble   bzdonk
4 cccc  3333  MILLER      JAMES  Miller-Muller Jamie  Seattle     bla   boom  
fuzzyjoin::regex_right_join(
  df2, df1, by = c(lname2 = "lname1"),
  ignore_case = TRUE)
# # A tibble: 4 x 10
#   lname2        fname2 city.x      id2   misc2  lname1 fname1 city.y      id1   misc1
#   <chr>         <chr>  <chr>       <chr> <chr>  <chr>  <chr>  <chr>       <chr> <chr>
# 1 Smith Jr.     Jon    New York    1111  bonk   SMITH  JOHN   NEW YORK    aaaa  bla  
# 2 Smith         John   New York    4444  bam    SMITH  JOHN   NEW YORK    aaaa  bla  
# 3 Black III     Tom    Los Angeles 2222  bzdonk BLACK  THOMAS LOS ANGELES bbbb  ble  
# 4 Miller-Muller Jamie  Seattle     3333  boom   MILLER JAMES  SEATTLE     cccc  bla  

I didn't want to assume any resolution for city.x vs city.y ;我不想为city.x vs city.y假设任何分辨率; while it's clear visually that they're good, I'll let you work through that.虽然从视觉上很明显它们很好,但我会让你解决这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM