match two vectors by similar characters/strings in R

Question

I have two vectors, like

v1<-c("yellow", "red", "orange", "blue", "green")
v2<-c("blues", "redx", "grean")

and I want to match them, ie, to "link" each element of v1 with the most similar element on v2 , so that the result is

> df
      v1    v2
1 yellow  <NA>
2    red  redx
3 orange  <NA>
4   blue blues
5  green grean

The following code gives the expected result, but just because it has manually "formatted" to do so

df<-data.frame(v1,v2=rep(NA,5))

for (i in 1:nrow(df)) {
  
  ag<-agrep(df[i,1], v2, ignore.case = T, value = T)
  
  if (length(ag)==0) {df[i,2]<-NA}
  else if (length(ag)==1) {df[i,2]<-ag}
  else {df[i,2]<-ag[1]}
  
}

It happens that agrep(df[2,1], v2, max.distance = 0.00001, ignore.case = T, value = T) results in "redx" "grean" , even if I set max.distance = 0.00001 .

That's why I have the if conditions, but it doesn't guarantee that the most similar answer is selected.

How can I overcome this issue?

Thank you in advance

Answer 1

Maybe the following can solve your problem. It uses stringdistmatrix in package stringdist , which can become a memory problem if the vectors v1 and v2 are larger.

d <- stringdist::stringdistmatrix(v1, v2, method = "osa")
i <- which(colSums(d == 1) > 0)
j <- which(rowSums(d == 1) > 0)
df$v2[j] <- v2[i]

df
#      v1    v2
#1 yellow  <NA>
#2    red blues
#3 orange  <NA>
#4   blue  redx
#5  green grean

Answer 2

You could try:

s <- which(adist(v1,v2) <= 1, TRUE) # 1 is the maximum allowed change
data.frame(v1, v2=replace(NA, s[,1], v2[s[,2]]))
      v1    v2
1 yellow  <NA>
2    red  redx
3 orange  <NA>
4   blue blues
5  green grean

match two vectors by similar characters/strings in R

Question

2 answers

solution1
0 2021-06-02 19:59:09

solution2
0 ACCPTED 2021-06-02 20:06:28

match two vectors by similar characters/strings in R

Question

2 answers

solution1 0 2021-06-02 19:59:09

solution2 0 ACCPTED 2021-06-02 20:06:28

solution1
0 2021-06-02 19:59:09

solution2
0 ACCPTED 2021-06-02 20:06:28