[英]Match records with a combination of regex and lookup
I want to match personal records between two tables using the following logic:我想使用以下逻辑匹配两个表之间的个人记录:
Regex match on last name up to minor variations - summarized by the following regex for a given last name: grepl("LNAME(.r|-| [ivx]|.*)", last_name, ignore.case = TRUE)
.姓氏的正则表达式匹配,直至细微变化 - 由给定姓氏的以下正则表达式总结:
grepl("LNAME(.r|-| [ivx]|.*)", last_name, ignore.case = TRUE)
。 The function fuzzyjoin::regex_*_join
was suggested, but I'm not sure how to use it if the name isn't static...?建议使用 function
fuzzyjoin::regex_*_join
,但如果名称不是 static...,我不确定如何使用它?
Match on first name based on the nicknames list.根据昵称列表匹配名字。 So basically matching all names in
nicknames[[fname]]
or just fname
if that is empty.因此,基本上匹配
nicknames[[fname]]
中的所有名称,如果为空,则仅匹配fname
。 Should not be case-sensitive as well.也不应该区分大小写。
Exact match on city, not case-sensitive.完全匹配城市,不区分大小写。
Right now I'm just iterating through df1 and implementing this logic by hand, but my data set is large and it's taking way too long, plus the manual implementation doesn't lend itself to parallelization, which is a concern as I willwant to optimize this in the future.现在我只是迭代 df1 并手动实现这个逻辑,但是我的数据集很大,而且花费的时间太长,而且手动实现不适合并行化,这是我想要优化的一个问题这在未来。 There has to be a smarter way of doing this.
必须有一种更聪明的方法来做到这一点。
Example data:示例数据:
df1 <- tibble("lname1"=c("SMITH","BLACK","MILLER"),
"fname1"=c("JOHN","THOMAS","JAMES"),
"city"=c("NEW YORK","LOS ANGELES","SEATTLE"),
"id1"=c("aaaa","bbbb","cccc"),
"misc1"=c("bla","ble","bla"))
df2 <- tibble("lname2"=c("Smith Jr.","Black III","Miller-Muller","Smith"),
"fname2"=c("Jon","Tom","Jamie","John"),
"city"=c("New York","Los Angeles","Seattle","New York"),
"id2"=c("1111","2222","3333","4444"),
"misc2"=c("bonk","bzdonk","boom","bam"))
nicknames <- list("john"=c("john","jon","johnny"),
"thomas"=c("thomas","tom","tommy"),
"james"=c("james","jamie","jim"))
Expected output:预期 output:
expected_output <- tibble("id1"=c("aaaa","aaaa","bbbb","cccc"),
"id2"=c("1111","4444","2222","3333"),
"lname1"=c("SMITH","SMITH","BLACK","MILLER"),
"fname1"=c("JOHN","JOHN","THOMAS","JAMES"),
"lname2"=c("Smith Jr.","Smith","Black III","Miller-Muller"),
"fname2"=c("Jon","John","Tom","Jamie"),
"city"=c("New York","New York","Los Angeles","Seattle"),
"misc1"=c("bla","bla","ble","bla"),
"misc2"=c("bonk","bam","bzdonk","boom"))
# A tibble: 4 x 9
id1 id2 lname1 fname1 lname2 fname2 city misc1 misc2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 aaaa 1111 SMITH JOHN Smith Jr. Jon New York bla bonk
2 aaaa 4444 SMITH JOHN Smith John New York bla bam
3 bbbb 2222 BLACK THOMAS Black III Tom Los Angeles ble bzdonk
4 cccc 3333 MILLER JAMES Miller-Muller Jamie Seattle bla boom
fuzzyjoin::regex_right_join(
df2, df1, by = c(lname2 = "lname1"),
ignore_case = TRUE)
# # A tibble: 4 x 10
# lname2 fname2 city.x id2 misc2 lname1 fname1 city.y id1 misc1
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Smith Jr. Jon New York 1111 bonk SMITH JOHN NEW YORK aaaa bla
# 2 Smith John New York 4444 bam SMITH JOHN NEW YORK aaaa bla
# 3 Black III Tom Los Angeles 2222 bzdonk BLACK THOMAS LOS ANGELES bbbb ble
# 4 Miller-Muller Jamie Seattle 3333 boom MILLER JAMES SEATTLE cccc bla
I didn't want to assume any resolution for city.x
vs city.y
;我不想为
city.x
vs city.y
假设任何分辨率; while it's clear visually that they're good, I'll let you work through that.虽然从视觉上很明显它们很好,但我会让你解决这个问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.