简体   繁体   English

如何从R中的模糊字符串匹配返回实际匹配

[英]How to return the actual match from a fuzzy string match in R

I have a load of DNA sequences and I need to be able to match the ones that have a certain string. 我有大量的DNA序列,我需要能够匹配具有特定字符串的序列。 However there is some variation in the target so Im matching with fuzzy matching. 但是目标中存在一些变化,因此Im匹配具有模糊匹配。 Id like to get the actual match rather than the whole sequence. 我想获得实际的匹配,而不是整个序列。 agrep doesnt do this. agrep不会这样做。 Does anyone know of any package that does? 有人知道有什么包装吗?

example dataframe RepeatAlusSequencesdf : 示例数据帧RepeatAlusSequencesdf

>chr1:61695-62229      aattccaagagtattattgcaccaaaaggcatggacttaaaattcttgatacatgatttcaaaatattttctttaaggtttgaatcagtctatattccctccagcagcgtataaaagtgccaatttctctgatccttagccagtttgggtaataataattgtaaaacttttttttctttttttttgagacagagtctccctctgtcgccaggctgaagtgcagtggcgcaatctcggctcactgcaacctccgcctcccggggtcaagctattctcctgcctcagcctcccaagtagctgggactacaggcatgcaccaccatgcccagctaatttttgttatttttagtagagatggagtttccccatgttggacaggatggtctcgatctcttgacctcgtgatccaccctcctcggcctcccaaagtgctgggataacaggcgtgaacaaccatgcccggcctgtaaaactttttcctaatttaacagaaaaataatagtattatattttatcatatttctttgatttcta

>chr1:101718-102194   taaaaataaatgtattaagtatgaacaacaaaaaagctagtaaaggttgaacaacaactatccttaggaaagtggaaataatgtattaataaatatgaaagcaggctagccacggtgactcacatctgtaatcccagcactttgggaggctgaggcaggcagatcacctgaggtcaggagttccagaccagcctggccaacatggtgaaatcttgtctctcctacaaatacaaaaactagccaggcttggttgtgcactcctgtaattcgagctacttgggaggctgaggcaggagaatctcttgaacctgagaggcagaggttgcagtgagccaagatcatgccactgcactccagctggggcaacagagtgacactccatctcaaaataaataaataagaaagcagaaactaataaactagaaaacagaaacatagaactaatttataaatcaaagcactatgccttgaaaaga

the code i used: 我使用的代码:

RepeatAlusSequencesdfMatch <- RepeatAlusSequencesdf[agrep("aacctcaaagactggcctca", RepeatAlusSequencesdf[,2],ignore.case = TRUE, max.distance = 0.3), ]

what Id like returned: 我想返回什么:

aacctcaaagactggcctca
aacctcattgactggcctca

rather than the whole sequence 而不是整个序列

There might be a specialized package to do this, but this works: I create a vector of substrings of the same length as the string you are looking to match. 可能会有一个专门的程序包来执行此操作,但是这样做有效:我创建了一个向量,该向量的长度与要匹配的字符串相同。 I then use agrep to identify the matched substrings. 然后,我使用agrep识别匹配的子字符串。

#long string
s1<-"aattccaagagtattattgcaccaaaaggcatggacttaaaattcttgatacatgatttcaaaatattttctttaaggtttgaatcagtctatattccctccagcagcgtataaaagtgccaatttctctgatccttagccagtttgggtaataataattgtaaaacttttttttctttttttttgagacagagtctccctctgtcgccaggctgaagtgcagtggcgcaatctcggctcactgcaacctccgcctcccggggtcaagctattctcctgcctcagcctcccaagtagctgggactacaggcatgcaccaccatgcccagctaatttttgttatttttagtagagatggagtttccccatgttggacaggatggtctcgatctcttgacctcgtgatccaccctcctcggcctcccaaagtgctgggataacaggcgtgaacaaccatgcccggcctgtaaaactttttcctaatttaacagaaaaataatagtattatattttatcatatttctttgatttcta"
my.string <-"aacctcaaagactggcctca"
substrings <-substring(s1,seq(1,nchar(s1)-nchar(my.string)+1,1),seq(nchar(my.string),nchar(s1),1))
agrep(my.string, substrings,ignore.case = TRUE, max.distance = 0.35,value = TRUE)

[1] "caccaaaaggcatggactta" "accaaaaggcatggacttaa"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM