[英]Compare two columns and identify character difference in R
I am trying to write a program that examines two columns of text and identifies single errors. 我正在尝试编写一个程序,该程序检查两列文本并识别单个错误。 For example:
例如:
1 2
bat bad
tap ta
tap tape
I would like the program to compare column one against column two, and to print the character difference. 我希望程序将第一列与第二列进行比较,并打印字符差异。
Here's an approach using the stringdist
package. 这是使用
stringdist
包的一种方法。
# Your data sample, plus a couple of extra rows
dat = data.frame(x=c(1,'bat','tap','tap','tapes','tapped'),
y=c(2,'bad','ta','tape','tapes','tapas'))
dat
x y
1 1 2
2 bat bad
3 tap ta
4 tap tape
5 tapes tapes
6 tapped tapas
library(stringdist)
# Distance methods available in stringdist
dist.methods = c("osa", "lv", "dl", "hamming", "lcs", "qgram",
"cosine", "jaccard", "jw", "soundex")
# Try all the methods with the sample data
sapply(dist.methods, function(m) stringdist(dat[,1],dat[,2], method=m))
osa lv dl hamming lcs qgram cosine jaccard jw soundex [1,] 1 1 1 1 2 2 1.0000000 1.0000000 1.00000000 1 [2,] 1 1 1 1 2 2 0.3333333 0.5000000 0.22222222 0 [3,] 1 1 1 Inf 1 1 0.1835034 0.3333333 0.11111111 1 [4,] 1 1 1 Inf 1 1 0.1339746 0.2500000 0.08333333 0 [5,] 0 0 0 0 0 0 0.0000000 0.0000000 0.00000000 0 [6,] 3 3 3 Inf 5 5 0.3318469 0.5000000 0.30000000 1
Or, using adist
, as suggested by @thelatemail: 或者,
adist
建议使用adist:
apply(dat, 1, function(d) adist(d[1], d[2]))
[1] 1 1 1 1 0 3
adist
uses the Levenshtein distance, equivalent to the lv
method above. adist
使用Levenshtein距离,等效于上述lv
方法。 This is probably the method you want. 这可能是您想要的方法。
For explanations of the different distance methods, see this web page . 有关不同距离方法的说明,请参见此网页 。
here is the code, i think this is you are expecting. 这是代码,我认为这是您所期望的。
df
one two
bat bad
tap ta
tap tape
getDiff<-function(dataframe){
result<-" "
for(i in 1:nrow(dataframe))
str1<-unlist(strsplit(dataframe[i,"one"],split = ""))
str2<-unlist(strsplit(dataframe[i,"two"],split = ""))
for(j in 1:length(str1)){
if(j <= length(str2) & str1[j] == str2[j]){
retstr<-str1[(j+1):length(str1)]
}else{
break
}
}
result[i]<-paste(retstr,collapse = "")
}
return(result)
}
getDiff(df)
results:
"t" "p" ""
i don't know if is there any default function to do this... may be this will be helpful... 我不知道是否有任何默认功能可以执行此操作...可能会有所帮助...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.