简体   繁体   English

通过成对对齐 R 中的多个文件 Alignment

[英]Aligning Multiple Files in R by Pairwise Alignment

I have 15 protein sequences as fasta format in one file.我在一个文件中有 15 个 fasta 格式的蛋白质序列。 I have to pairwise align them globally and locally then generate a distance score matrix 15x15 to construct dendrogram.我必须对它们进行全局和局部的成对对齐,然后生成一个 15x15 的距离得分矩阵来构建树状图。

But when I do, ie A sequence is not aligning with itself and I get NA result.但是当我这样做时,即一个序列不与自身对齐,我得到 NA 结果。 Moreover, AxB gives 12131 score but BxA gives NA.此外,AxB 给出 12131 分数,但 BxA 给出 NA。 Thus R can not construct phylogenetic tree.因此 R 无法构建系统发育树。

What should I do?我应该怎么办?

I'm using this script for the loop but it reads one way only:我将此脚本用于循环,但它仅以一种方式读取:

for (i in 1:150) { 
  globalpwa<-pairwiseAlignment(toString(ProtDF[D[1,i],2]) 
                              ,toString(ProtDF[D[2,i],2]),
                              substitutionMatrix = "BLOSUM62",
                              gapOpening = 0,
                              gapExtension = -2,
                              scoreOnly=FALSE,
                              type="global")
  ScoreX[i]<-c(globalpwa@score)   
  nameSeq1[i]<-c(as.character(ProtDF[D[1,i],1]))
  nameSeq2[i]<-c(as.character(ProtDF[D[2,i],1]))
}

I used an example fasta file, protein sequence of RPS29 in fungi.我使用了一个示例 fasta 文件,即真菌中 RPS29 的蛋白质序列。

ProtDF <-
c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD", 
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF", 
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR", 
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL", 
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK", 
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR", 
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF", 
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL", 
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR", 
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD", 
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR", 
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR", 
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR", 
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR", 
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR"
)

So you got it right to use combn.所以你正确地使用了combn。 As you said, you need a distance score matrix for dendrogram, so better to store the values in a matrix.正如你所说,你需要一个树状图的距离得分矩阵,所以最好将值存储在矩阵中。 See below, I simply named the matrix after the names of the fasta, and slot in the pairwise values见下文,我只是简单地以 fasta 的名称命名矩阵,并插入成对值

library(Biostrings)
# you can also read in your file
# like ProtDF = readAAStringSet("fasta")

ProtDF=AAStringSet(ProtDF)

# combination like you want
# here we just use the names
D = combn(names(ProtDF),2)

#make the pairwise matrix
mat = matrix(NA,ncol=length(ProtDF),nrow=length(ProtDF))
rownames(mat) = names(ProtDF)
colnames(mat) = names(ProtDF)

# loop through D

for(idx in 1:ncol(D)){
       i <- D[1,idx]
       j <- D[2,idx]
       globalpwa<-pairwiseAlignment(ProtDF[i], 
                                    ProtDF[j],
                              substitutionMatrix = "BLOSUM62",
                              gapOpening = 0,
                              gapExtension = -2,
                              scoreOnly=FALSE,
                              type="global")
       mat[i,j]<-globalpwa@score
       mat[j,i]<-globalpwa@score
}

# if you need to make diagonal zero
diag(mat) <- 0

# plot
plot(hclust(as.dist(mat)))

在此处输入图像描述

An alternate method, if you're interested, using the same example as above:另一种方法,如果您有兴趣,使用与上面相同的示例:

library(DECIPHER)

ProtDF <- c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD", 
            XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF", 
            TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR", 
            TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL", 
            TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK", 
            TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR", 
            TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF", 
            TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL", 
            `sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR", 
            `sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD", 
            `sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR", 
            TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR", 
            `sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR", 
            `sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR", 
            XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR")

# All pairwise alignments:

# Convert characters to an AA String Set
ProtDF <- AAStringSet(ProtDF)

# Initialize some outputs
AliMat <- matrix(data = list(),
                 ncol = length(ProtDF),
                 nrow = length(ProtDF))

DistMat <- matrix(data = 0,
                  ncol = length(ProtDF),
                  nrow = length(ProtDF))

# loop through the upper triangle of your matrix
for (m1 in seq_len(length(ProtDF) - 1L)) {
  for (m2 in (m1 + 1L):length(ProtDF)) {
    # Align each pair
    AliMat[[m1, m2]] <- AlignSeqs(myXStringSet = ProtDF[c(m1, m2)],
                                  verbose = FALSE)

    # Assign a distance to each alignment, fill both triangles of the matrix
    DistMat[m1, m2] <- DistMat[m2, m1] <- DistanceMatrix(myXStringSet = AliMat[[m1, m2]],
                                                         type = "dist", # return a single value
                                                         includeTerminalGaps = TRUE, # return a global distance
                                                         verbose = FALSE)
  }
}

dimnames(DistMat) <- list(names(ProtDF),
                          names(ProtDF))

Dend01 <- IdClusters(myDistMatrix = DistMat,
                     method = "NJ",
                     type = "dendrogram",
                     showPlot = FALSE,
                     verbose = FALSE)

# A single multiple alignment:

AllAli <- AlignSeqs(myXStringSet = ProtDF,
                    verbose = FALSE)

AllDist <- DistanceMatrix(myXStringSet = AllAli,
                          type = "matrix",
                          verbose = FALSE,
                          includeTerminalGaps = TRUE)

Dend02 <- IdClusters(myDistMatrix = AllDist,
                     method = "NJ",
                     type = "dendrogram",
                     showPlot = FALSE,
                     verbose = FALSE)

Dend01, from all the pairwise alignments: Dend01,来自所有成对对齐:

来自成对比对的树状图

Dend02, from a single multiple alignment: Dend02,来自单个多个 alignment:

来自单一多重对齐的树状图

Finally, DECIPHER has a function for loading up your alignment in your browser just to look at it, which, if your alignments are huge, can be a bit of a mistake, but in this case (and in cases up to a few hundred short sequences) is just fine:最后,DECIPHER 有一个 function 用于在浏览器中加载您的 alignment 只是为了查看它,如果您的对齐方式很大,可能有点错误,但在这种情况下(在这种情况下最多有几百个短序列)就好了:

BrowseSeqs(AllAli)

多重对齐

A side note about BrowseSeqs, for some reason it doesn't behave well with Safari, but it plays just fine with Chrome.关于 BrowseSeqs 的旁注,由于某种原因,它在 Safari 上表现不佳,但在 Chrome 上运行良好。 Sequences are displayed in the order in which they exist in the aligned string set.序列按照它们在对齐的字符串集中存在的顺序显示。

EDIT: BrowseSeqs DOES play fine with safari directly, but it does have a weird issue with being incorporated with HTMLs knitted together with RMarkdown.编辑:BrowseSeqs 确实可以直接与 safari 配合使用,但与与 RMarkdown 一起编织的 HTML 合并时确实存在一个奇怪的问题。 Weird, but not applicable here.很奇怪,但在这里不适用。

Another big aside: All of the functions i've used have a processors argument, which is set to 1 by default, if you'd like to get greedy with your cores you can set it to NULL and it'll just grab everything available.另一个重要的方面:我使用的所有函数都有一个processors参数,默认情况下设置为 1,如果你想对你的内核贪婪,你可以将它设置为 NULL,它只会抓住所有可用的. If you're aligning very large string sets, this can be pretty useful, if you're doing trivially small things like this example, not so much.如果你正在对齐非常大的字符串集,这可能非常有用,如果你正在做像这个例子这样微不足道的小事情,不是那么多。

combn is a great function and I use it all the time. combn 是一个很棒的 function,我一直在使用它。 However for these really simple set ups I like looping through the upper triangle, but that's just a personal preference.然而对于这些非常简单的设置,我喜欢循环通过上三角形,但这只是个人喜好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM