简体   繁体   中英

Counting phylogenetic tree topologies in R

Given a multiPhylo object in R, what's the simplest way to count the number of duplicate topologies. For instance, if I randomly sample from all 15 possible resolutions of a 4 tip topology:

library(ape)
library(phytools)
m <- do.call(c, lapply(1:1000, function(x) multi2di(starTree(c('a','b','c','d')))))

I will have 1000 trees from 15 possible topologies. What's the simplest way to tabulate the count of trees with each topology (ie the sum of counts will be 1000).

Small trees

With smallish trees (< ~20 leaves), you can use the ' TreeTools ' package to convert each tree topology to a unique integer:

library('TreeTools')
library('phytools')
m <- do.call(c, lapply(1:1000, function(x) multi2di(starTree(c('a','b','c','d')))))

# Tabulate unique topologies
table(vapply64(m, as.TreeNumber, 1))

You can plot each numbered topology using

topologyToPlot <- 2
plot(as.phylo(topologyToPlot, nTip = 4))

Big trees

For larger trees, you can ensure that trees with an equivalent topology are represented identically within R by:

  • (if necessary) ensuring that trees' internal representation of tips is consistent using m <- RenumberTips(m, m[[1]]) .

  • reordering trees' internal edge and node numbering using m <- Preorder(m) .

Trees can then be compared using edge matrices as suggested by user12728748 .

I am no expert on those trees, but here maybe you could sort the edge matrices and then count the unique ones. That works to a certain extent for this example, but the three symmetrical cases where you could flip the tree are still counted separately (eg the 8th and 9th one where ab and cd vs cd and ab are children of one node). In the results table below, N would be the count and first would be the first occurrence of this topology in m . You could plot the unique trees by for(i in res$first) plot(m[[i]]) and inspect those symmetrical cases.

library(ape)
library(phytools)
#> Loading required package: maps
library(data.table)
set.seed(123)
m <- do.call(c, lapply(1:1000, function(x) multi2di(starTree(c('a','b','c','d')))))
edges <- data.table(do.call(rbind, lapply(m, function(x) unlist(data.table(x$edge, key=c("V1", "V2"))))))
res <- edges[,.(.N, first=head(.I, 1)), by = names(edges)]
res
#>     V11 V12 V13 V14 V15 V16 V21 V22 V23 V24 V25 V26  N first
#>  1:   5   5   6   6   7   7   2   6   1   7   3   4 53     1
#>  2:   5   5   6   6   7   7   4   6   2   7   1   3 65     2
#>  3:   5   5   6   6   7   7   4   6   3   7   1   2 60     3
#>  4:   5   5   6   6   7   7   2   6   4   7   1   3 56     4
#>  5:   5   5   6   6   7   7   6   7   1   4   2   3 63     5
#>  6:   5   5   6   6   7   7   6   7   2   3   1   4 55     6
#>  7:   5   5   6   6   7   7   1   6   3   7   2   4 66     7
#>  8:   5   5   6   6   7   7   6   7   3   4   1   2 55     8
#>  9:   5   5   6   6   7   7   6   7   1   2   3   4 40    15
#> 10:   5   5   6   6   7   7   6   7   1   3   2   4 66    16
#> 11:   5   5   6   6   7   7   3   6   2   7   1   4 52    20
#> 12:   5   5   6   6   7   7   1   6   4   7   2   3 46    24
#> 13:   5   5   6   6   7   7   3   6   1   7   2   4 44    27
#> 14:   5   5   6   6   7   7   1   6   2   7   3   4 54    28
#> 15:   5   5   6   6   7   7   2   6   3   7   1   4 60    29
#> 16:   5   5   6   6   7   7   4   6   1   7   2   3 53    32
#> 17:   5   5   6   6   7   7   6   7   2   4   1   3 63    39
#> 18:   5   5   6   6   7   7   3   6   4   7   1   2 49    43

Created on 2020-06-26 by the reprex package (v0.3.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM