简体   繁体   English

如何在R中的表格中的单元格之间找到%匹配/相似度?

[英]How to find the % match/similarity between cells in a table in R?

I have a bunch of sequences in a table (ie. TCGATCGATCGA) and I want to find those that are 90% matches. 我在表中有一堆序列(即TCGATCGATCGA),我想找到90%匹配的序列。 I am looking at the RecordLinkage package and function levenshteinSim. 我正在查看RecordLinkage包和函数levenshteinSim。 I know I can manually import each of the sequences and compare, but I have over a 1000 sequences, so how would I get it to automatically compare each row to each other? 我知道我可以手动导入每个序列并进行比较,但是我有1000多个序列,那么如何获得它来自动比较每一行呢?

The same function is in Mako212's link, altough I want to give some explanations since I use this package sometimes, it can be quite useful. Mako212的链接中也有相同的功能,尽管我有时会使用此程序包,但我想给出一些解释,但这很有用。 We will use the levenshteinSim() function from the RecordLinkage package. 我们将使用RecordLinkage包中的levenshteinSim()函数。

Package: 包:

install.packages("RecordLinkage")
library(RecordLinkage)

Find those 90% matches: 查找那些90%的匹配项:

data <- c("tcgartyu", "tcgart", "tckael", "tcgatcgatc", "tcgatcgatcg")
[1] "tcgartyu"   "tcgart"     "tckael"     "tcgatcgatc"   "tcgatcgatcg"

matches <- levenshteinSim('tcgatcgatcga', data)
[1] 0.42 0.42 0.25 0.83 0.92

matches_90 <- matches > 0.9
[1] FALSE FALSE FALSE FALSE  TRUE

So with this function you will be able to get the rows that matches 90% (or greater like in my example). 因此,使用此功能,您将能够获得匹配90%的行(或在我的示例中更高的行)。 You can then use those % matches the way you wanted to. 然后,您可以按照需要使用这些%匹配项。

Please note that the str1 and str2 arguments from the levenshteinSim() function need to be character vectors. 请注意,来自levenshteinSim()函数的str1str2参数必须是字符向量。

For more informations go on https://cran.r-project.org/package=RecordLinkage . 有关更多信息,请访问https://cran.r-project.org/package=RecordLinkage

I would recommend you look at that string distance package. 我建议您查看该字符串距离包。 Specifically, this stringdist() function which gives you a numeric output related to how far one string is from another. 具体来说,此stringdist()函数为您提供一个数字输出,该输出与一个字符串与另一个字符串之间的距离有关。 You should be able to play around with thresholds to suit your purposes. 您应该能够按照自己的目的进行操作。

https://cran.r-project.org/web/packages/stringdist/stringdist.pdf https://cran.r-project.org/web/packages/stringdist/stringdist.pdf

Best, Mostafa 最好,莫斯塔法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM