简体   繁体   English

如何在R中以NEWICK格式附加集群(树)节点的引导值

[英]How to append bootstrapped values of cluster's (tree) nodes in NEWICK format in R

I want to make a tree (cluster) using Interactive Tree of Life web-based tool (iTOL). 我想使用交互式生命之树基于Web的工具 (iTOL)创建一个树(集群)。 As an input file (or string) this tool uses Newick format which is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. 作为输入文件(或字符串),此工具使用Newick格式 ,这是一种使用括号和逗号表示边长的图理论树的方法。 Beside that, additional information might be supported such as bootstrapped values of cluster's nodes. 除此之外,可能还支持其他信息,例如群集节点的引导值

For example, here I created dataset for a cluster analysis using clusterGeneration package: 例如,在这里我使用clusterGeneration包为集群分析创建了数据集

library(clusterGeneration)
set.seed(1)    
tmp1 <- genRandomClust(numClust=3, sepVal=0.3, numNonNoisy=5,
        numNoisy=3, numOutlier=5, numReplicate=2, fileName="chk1")
data <- tmp1$datList[[2]]

Afterwards I performed cluster analysis and assessed support for the cluster's nodes by bootstrap using pvclust package: 之后,我执行了集群分析,并使用pvclust软件包通过bootstrap 评估了对集群节点支持

set.seed(2)    
y <- pvclust(data=data,method.hclust="average",method.dist="correlation",nboot=100)
plot(y)  

Here is the cluster and bootstrapped values: 这是集群和引导的值: 集群和引导的值

In order to make a Newick file , I used ape package: 为了制作一个Newick文件 ,我使用了ape包:

library(ape)
yy<-as.phylo(y$hclust)
write.tree(yy,digits=2)

write.tree function will print tree in a Newick format: write.tree函数将以Newick格式打印树:

((x2:0.45,x6:0.45):0.043,((x7:0.26,(x4:0.14,(x1:0.14,x3:0.14):0.0064):0.12):0.22,(x5:0.28,x8:0.28):0.2):0.011); ((X2:0.45,5233:0.45):0.043,((X7:0.26,(X4:0.14,(X1:0.14,X3:0.14):0.0064):0.12):0.22,(X5:0.28,X8:0.28 ):0.2):0.011);

Those numbers represent branch lengths (cluster's edge lengths). 这些数字代表分支长度 (簇的边长)。 Following instructions from iTOL help page ("Uploading and working with your own trees" section) I manually added bootstrapped values into my Newick file (bolded values below): 按照iTOL帮助页面的说明 (“上传并使用您自己的树”部分),我手动将自举值添加到我的Newick文件中(下面的粗体值):

((x2:0.45,x6:0.45) 74 :0.043,((x7:0.26,(x4:0.14,(x1:0.14,x3:0.14) 55 :0.0064) 68 :0.12) 100 :0.22,(x5:0.28,x8:0.28) 100 :0.2) 63 :0.011); ((x2:0.45,x6:0.45) 74 :0.043,((x7:0.26,(x4:0.14,(x1:0.14,x3:0.14) 55 :0.0064) 68 :0.12) 100 :0.22,(x5:0.28) ,x8:0.28) 100 :0.2) 63 :0.011);

It works fine when I upload the string into iTOL. 当我将字符串上传到iTOL时,它工作正常。 However, I have a huge cluster and doing it by hand seems tedious... 但是,我有一个巨大的集群,手工做这似乎很乏味......

QUESTION: What would be a code that can perform it instead of manual typing? 问题:什么是可以执行它而不是手动输入的代码?

Bootstrap values can be obtained by: Bootstrap值可以通过以下方式获得:

(round(y$edges,2)*100)[,1:2]

Branch lengths used to form Newick file can be obtained by: 用于形成Newick文件的分支长度可以通过以下方式获得:

yy$edge.length

I tried to figure out how write.tree function works after debugging it. 我试着弄清楚write.tree函数在调试之后是如何工作的。 However, I noticed that it internally calls function .write.tree2 and I couldn't understand how to efficiently change the original code and obtain bootstrapped values in appropriate position in a Newick file. 但是,我注意到它在内部调用函数.write.tree2 ,我无法理解如何有效地更改原始代码并获取Newick文件中适当位置的引导值。

Any suggestion are welcome. 欢迎任何建议。

Here is one solution for you: objects of class phylo have an available slot called node.label that, appropriately, gives you the label of a node. 下面是一个解决方案:类phylo对象有一个名为node.label的可用插槽,它适当地为您提供节点的标签。 You can use it to store your bootstrap values. 您可以使用它来存储引导值。 There will be written in your Newick File at the appropriate place as you can see in the code of .write.tree2 : 正如您在.write.tree2的代码中看到的那样,将在适当的位置写入您的Newick文件:

> .write.tree2
function (phy, digits = 10, tree.prefix = "") 
{
    brl <- !is.null(phy$edge.length)
    nodelab <- !is.null(phy$node.label)

...

    if (is.null(phy$root.edge)) {
        cp(")")
        if (nodelab) 
            cp(phy$node.label[1])
        cp(";")
    }
    else {
        cp(")")
        if (nodelab) 
            cp(phy$node.label[1])
        cp(":")
        cp(sprintf(f.d, phy$root.edge))
        cp(";")
    }

...

The real difficulty is to find the proper order of the nodes. 真正的困难是找到节点的正确顺序。 I searched and searched but couldn't find a way to find the right order a posteriori .... so that means we will have to get that information during the transformation from an object of class hclust to an object of class phylo . 我找啊找,但无法找到一个方法来找到正确的顺序事后 ....所以这意味着我们将不得不摆脱类的一个对象的转变过程中的信息hclust类的一个对象phylo

And luckily, if you look into the function as.phylo.hclust , there is a vector containing the nodes index in their correct order vis-à-vis the previous hclust object: 幸运的是,如果你查看函数as.phylo.hclust ,有一个向量包含节点索引,它们的顺序与前一个hclust对象相比hclust

> as.phylo.hclust
function (x, ...) 
{
    N <- dim(x$merge)[1]
    edge <- matrix(0L, 2 * N, 2)
    edge.length <- numeric(2 * N)
    node <- integer(N)              #<-This one
...

Which means we can make our own as.phylo.hclust with a nodenames parameter as long as it is in the same order as the nodes in the hclust object (which is the case in your example since pvclust keeps a coherent order internally, ie the order of the nodes in the hclust is the same as in the table in which you picked the bootstraps): 这意味着我们可以使用nodenames参数创建自己的as.phylo.hclust ,只要它与hclust对象中的节点的顺序相同(在您的示例中就是这种情况,因为pvclust在内部保持连贯的顺序,即hclust中节点的顺序与您选择bootstraps的表中的顺序相同):

# NB: in the following function definition I only modified the commented lines
as.phylo.hclust.with.nodenames <- function (x, nodenames, ...) #We add a nodenames argument
{
    N <- dim(x$merge)[1]
    edge <- matrix(0L, 2 * N, 2)
    edge.length <- numeric(2 * N)
    node <- integer(N)
    node[N] <- N + 2L
    cur.nod <- N + 3L
    j <- 1L
    for (i in N:1) {
        edge[j:(j + 1), 1] <- node[i]
        for (l in 1:2) {
            k <- j + l - 1L
            y <- x$merge[i, l]
            if (y > 0) {
                edge[k, 2] <- node[y] <- cur.nod
                cur.nod <- cur.nod + 1L
                edge.length[k] <- x$height[i] - x$height[y]
            }
            else {
                edge[k, 2] <- -y
                edge.length[k] <- x$height[i]
            }
        }
        j <- j + 2L
    }
    if (is.null(x$labels)) 
        x$labels <- as.character(1:(N + 1))
    node.lab <- nodenames[order(node)] #Here we define our node labels
    obj <- list(edge = edge, edge.length = edge.length/2, tip.label = x$labels, 
        Nnode = N, node.label = node.lab) #And you put them in the final object
    class(obj) <- "phylo"
    reorder(obj)
}

In the end, here is how you would use this new function in your case study: 最后,您将在案例研究中使用此新功能:

bootstraps <- (round(y$edges,2)*100)[,1:2]
yy<-as.phylo.hclust.with.nodenames(y$hclust, nodenames=bootstraps[,2])
write.tree(yy,tree.names=TRUE,digits=2)
[1] "((x5:0.27,x8:0.27)100:0.24,((x7:0.25,(x4:0.14,(x1:0.13,x3:0.13)61:0.014)99:0.11)100:0.23,(x2:0.46,x6:0.46)56:0.022)61:0.027)100;"
#See the bootstraps    ^^^ here for instance
plot(yy,show.node.label=TRUE) #To show that the order is correct
plot(y) #To compare with (here I used the yellow value)

在此输入图像描述在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM