简体   繁体   English

比较两列并确定R中的字符差异

[英]Compare two columns and identify character difference in R

I am trying to write a program that examines two columns of text and identifies single errors. 我正在尝试编写一个程序,该程序检查两列文本并识别单个错误。 For example: 例如:

 1    2  
bat  bad  
tap  ta  
tap  tape  

I would like the program to compare column one against column two, and to print the character difference. 我希望程序将第一列与第二列进行比较,并打印字符差异。

Here's an approach using the stringdist package. 这是使用stringdist包的一种方法。

# Your data sample, plus a couple of extra rows
dat = data.frame(x=c(1,'bat','tap','tap','tapes','tapped'), 
                 y=c(2,'bad','ta','tape','tapes','tapas'))

dat
       x     y
1      1     2
2    bat   bad
3    tap    ta
4    tap  tape
5  tapes tapes
6 tapped tapas

library(stringdist)

# Distance methods available in stringdist
dist.methods = c("osa", "lv", "dl", "hamming", "lcs", "qgram",
                 "cosine", "jaccard", "jw", "soundex")

# Try all the methods with the sample data
sapply(dist.methods, function(m) stringdist(dat[,1],dat[,2], method=m))
  osa lv dl hamming lcs qgram cosine jaccard jw soundex [1,] 1 1 1 1 2 2 1.0000000 1.0000000 1.00000000 1 [2,] 1 1 1 1 2 2 0.3333333 0.5000000 0.22222222 0 [3,] 1 1 1 Inf 1 1 0.1835034 0.3333333 0.11111111 1 [4,] 1 1 1 Inf 1 1 0.1339746 0.2500000 0.08333333 0 [5,] 0 0 0 0 0 0 0.0000000 0.0000000 0.00000000 0 [6,] 3 3 3 Inf 5 5 0.3318469 0.5000000 0.30000000 1 

Or, using adist , as suggested by @thelatemail: 或者, adist建议使用adist:

apply(dat, 1, function(d) adist(d[1], d[2]))
 [1] 1 1 1 1 0 3 

adist uses the Levenshtein distance, equivalent to the lv method above. adist使用Levenshtein距离,等效于上述lv方法。 This is probably the method you want. 这可能是您想要的方法。

For explanations of the different distance methods, see this web page . 有关不同距离方法的说明,请参见此网页

here is the code, i think this is you are expecting. 这是代码,我认为这是您所期望的。

df
  one  two
  bat  bad
  tap   ta
  tap tape

getDiff<-function(dataframe){
  result<-" "
  for(i in 1:nrow(dataframe))

    str1<-unlist(strsplit(dataframe[i,"one"],split = ""))
    str2<-unlist(strsplit(dataframe[i,"two"],split = ""))
    for(j in 1:length(str1)){
      if(j <= length(str2) & str1[j] == str2[j]){
        retstr<-str1[(j+1):length(str1)]
      }else{
        break
      }
    }
    result[i]<-paste(retstr,collapse = "")
  }
  return(result)
}

getDiff(df)


results:
 "t" "p" "" 

i don't know if is there any default function to do this... may be this will be helpful... 我不知道是否有任何默认功能可以执行此操作...可能会有所帮助...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM