I have to do softmatch in one column of data frame with the given input string, like
col <- c("John Collingson","J Collingson","Dummy Name1","Dummy Name2")
inputText <- "J Collingson"
#Vice-Versa
inputText <- "John Collingson"
I want to retrieve both "John Collingson" & "J Collingson" from the provided colname "col"
Kindly help
agrep
is definitely a quick and easy base R solution if you have just a bit of data. If this is just a toy example of a larger data frame, you may be interested in a more durable tool. In the past month, learning about the Levenshtein distance noted by @PaulHiemstra (also in these different questions ) led me to theRecordLinkage package. The vignettes leave me wanting more examples of the "soft" or fuzzy" matches, particularly across more than 1 field, but the basic answer to your question could be somthing like:
library(RecordLinkage)
col <- data.frame(names1 = c("John Collingson","J Collingson","Dummy Name1","Dummy Name2"))
inputText <- data.frame(names2 = c("J Collingson"))
g1 <- compare.linkage(inputText, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6)
# id names2 Weight
# 1 1 J Collingson
# 2 2 J Collingson 1.000
# 3
# 4 1 J Collingson
# 5 1 John Collingson 0.815
inputText2 <- data.frame(names2 = c("Jon Collinson"))
g1 <- compare.linkage(inputText2, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6)
# id names2 Weight
# 1 1 Jon Collinson
# 2 1 John Collingson 0.9644444
# 3
# 4 1 Jon Collinson
# 5 2 J Collingson 0.7924825
Please start with compare.linkage() or compare.dedup()-- RLBigDataLinkage() or RLBigDataDedup() for large data sets. Hope this helps.
It seems that agrep
is the function you are looking for. It does Approximate String Matching (Fuzzy Matching)
. It returns the closest match to the input pattern according to some distance measure, ie the generalized Levenshtein edit distance. See ?agrep
for more details.
agrep("J Collingson", col, value = TRUE)
[1] "John Collingson" "J Collingson"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.