用于计算树上两个位置之间距离的性能？

Question

这是一棵树。 第一列是分支的标识符，其中0是主干， L是左侧的第一分支， R是右侧的第一分支。 LL是第二个分支之后的最左端的分支， LL类推。可变length包含每个分支的长度。

> tree
  branch length
1      0     20
2      L     12
3     LL     19
4      R     19
5     RL     12
6    RLL     10
7    RLR     12
8     RR     17

tree = data.frame(branch = c("0","L", "LL", "R", "RL", "RLL", "RLR", "RR"), length=c(20,12,19,19,12,10,12,17))
tree$branch = as.character(tree$branch)

这是这棵树的图画

在此处输入图片说明

这棵树上有两个位置

posA = tree[4,]; posA$length = 12
posB = tree[6,]; posB$length = 3

位置由分支ID和到分支原点的距离（可变length ）给出（更多信息请参见编辑）。

我编写了以下凌乱的distance函数，以计算沿树上任意两点之间的分支的最短距离 。 沿着分支的最短距离可以理解为蚂蚁沿着分支行走以从另一个位置到达一个位置所需的最小距离。

distance = function(tree, pos1, pos2){
    if (identical(pos1$branch, pos2$branch)){Dist=pos1$length-pos2$length;return(Dist)}
    pos1path = strsplit(pos1$branch, "")[[1]]
    if (pos1path[1]!="0") {pos1path = c("0", pos1path)}
    pos2path = strsplit(pos2$branch, "")[[1]]
    if (pos2path[1]!="0") {pos2path = c("0", pos2path)}
    loop = 1:min(length(pos1path), length(pos2path))
    loop = loop[-which(loop == 1)]

    CommonTrace="included"; for (i in loop) {
        if (pos1path[i] != pos2path[i]) {
            CommonTrace = i-1; break
            }
        }

    if(CommonTrace=="included"){
        CommonTrace = min(length(pos1path), length(pos2path))
        if (length(pos1path) > length(pos2path)) {
            longerpos = pos1; shorterpos = pos2; longerpospath = pos1path
        } else {
            longerpos = pos2; shorterpos = pos1; longerpospath = pos2path
        }
        distToNode = 0
        if ((CommonTrace+1) != length(longerpospath)){
            for (i in (CommonTrace+1):(length(longerpospath)-1)){
                distToNode = distToNode + tree$length[tree$branch == paste0(longerpospath[2:i], collapse='')]
            }   
        }
        Dist = distToNode + longerpos$length + (tree[tree$branch == shorterpos$branch,]$length-shorterpos$length)
        if (identical(shorterpos, pos1)){Dist=-Dist}
        return(Dist)
    } else { # if they are sisterbranch
        Dist=0 
        if((CommonTrace+1) != length(pos1path)){
            for (i in (CommonTrace+1):(length(pos1path)-1)){
                Dist = Dist + tree$length[tree$branch == paste0(pos1path[2:i], collapse='')]
            }   
        }
        if((CommonTrace+1) != length(pos2path)){
            for (i in (CommonTrace+1):(length(pos2path)-1)){
                Dist = Dist + tree$length[tree$branch == paste(pos2path[2:i], collapse='')]
            }
        }
        Dist = Dist + pos1$length + pos2$length
        return(Dist)
    }
}

我认为该算法运行良好，但效率不高。 注意重要的距离符号。 仅当在“姐妹分支”上找不到两个位置时，此标志才有意义。 仅当两个位置之一在根与另一个位置之间的方式中找到时，该符号才有意义。

distance(tree, posA, posB) # -22

然后，我就这样循环浏览所有感兴趣的位置：

allpositions=rbind(tree, tree)
allpositions$length = c(1,5,8,2,2,3,5,6,7,8,2,3,1,2,5,6)
mat = matrix(-1, ncol=nrow(allpositions), nrow=nrow(allpositions))
    for (i in 1:nrow(allpositions)){
       for (j in 1:nrow(allpositions)){
          posA = allpositions[i,]
          posB = allpositions[j,]
          mat[i,j] = distance(tree, posA, posB)
       }
    }

#     1   2   3   4   5   6   7   8  9  10  11  12  13  14  15  16
# 1   0 -24 -39 -21 -40 -53 -55 -44 -6 -27 -33 -22 -39 -52 -55 -44
# 2  24   0 -15   7  26  39  41  30 18  -3  -9   8  25  38  41  30
# 3  39  15   0  22  41  54  56  45 33  12   6  23  40  53  56  45
# 4  21   7  22   0 -19 -32 -34 -23 15  10  16  -1 -18 -31 -34 -23
# 5  40  26  41  19   0 -13 -15   8 34  29  35  18   1 -12 -15   8
# 6  53  39  54  32  13   0   8  21 47  42  48  31  14   1   8  21
# 7  55  41  56  34  15   8   0  23 49  44  50  33  16   7   0  23
# 8  44  30  45  23   8  21  23   0 38  33  39  22   7  20  23   0
# 9   6 -18 -33 -15 -34 -47 -49 -38  0 -21 -27 -16 -33 -46 -49 -38
# 10 27   3 -12  10  29  42  44  33 21   0  -6  11  28  41  44  33
# 11 33   9  -6  16  35  48  50  39 27   6   0  17  34  47  50  39
# 12 22   8  23   1 -18 -31 -33 -22 16  11  17   0 -17 -30 -33 -22
# 13 39  25  40  18  -1 -14 -16   7 33  28  34  17   0 -13 -16   7
# 14 52  38  53  31  12  -1   7  20 46  41  47  30  13   0   7  20
# 15 55  41  56  34  15   8   0  23 49  44  50  33  16   7   0  23
# 16 44  30  45  23   8  21  23   0 38  33  39  22   7  20  23   0

例如，让我们考虑对象allpositions的第一个和第三个位置。 它们之间的距离为39 （和-39 ），因为蚂蚁需要在分支0上行走19个单位，然后在分支L上行走12个单位，最后，蚂蚁在分支LL上需要行走8单位。 19 + 12 + 8 = 39

问题是我有大约20棵非常大的树，大约有50000个位置，我想计算两个位置之间的距离。 因此，要计算20 * 50000 ^ 2的距离。 它需要永远！ 您能帮我改善代码吗？

编辑

请让我知道是否还有任何不清楚的地方

tree是对tree的描述。 树上有一定length树枝。 分支的名称（变量： branch ）指示分支之间的关系。 分支RL是两个分支RLL和RLR的“父分支”，其中R和L代表左右。

allpositions是一个data.frame，其中每一行代表树上的一个独立位置。 您可以想到松鼠的位置。 该位置由两个信息定义。 1）松鼠站立的分支（变量： branch ）以及分支的起点与松鼠的位置之间的距离（变量： length ）。

三个例子

考虑在分支RL （长度为12）位置（变量： length ）8处的第一只松鼠和在分支RLL或RLR位置（变量： length ）2处的第二只松鼠。 两个松鼠之间的距离为12-8 + 2 = 6（或-6）。

考虑在分支RL上的位置（变量： length ）8处的第一只松鼠和在分支RR上的位置（变量： length ）2处的第二只松鼠。 两个松鼠之间的距离是8 + 2 = 10（或-10）。

考虑在分支R （位置为19）上的位置（变量： length ）8处的第一只松鼠和在分支RLL处（位置： length ）2位置上的第二只松鼠。 知道分支RL的长度为12，则两个松鼠之间的距离为19-8 + 12 + 2 = 25（或-25）。

Answer 1

下面的代码使用igraph包来计算tree位置之间的距离，并且似乎比您在问题中发布的代码快得多。 该方法是在分支路口和在指定的位置，沿树枝位置创建图形顶点allpositions 。 图边缘是这些顶点之间的分支线段。 它使用igraph构建为树和的曲线图allpositions然后找到对应于顶点之间的距离allposition数据。

t.graph <- function(tree, positions) {
  library(igraph)
  #  Assign vertex name to tree branch intersections
  n_label <- nchar(tree$branch)
  tree$high_vert <- tree$branch
  tree$low_vert <- tree$branch
  tree$brnch_type <- "tree"
  for( i in 1:nrow(tree) ) {
    tree$low_vert[i] <- if(n_label[i] > 1) substr(tree$branch[i], 1, n_label[i]-1)
    else { if(tree$branch[i] %in% c("R","L")) "0"
           else "root" }
  }
  #  combine position data with tree data    
  positions$brnch_type <- "position"
  temp <- merge(positions, tree, by = "branch")
  positions <- temp[, c("branch","length.x","high_vert","low_vert","brnch_type.x")]
  positions$high_vert <- paste(positions$branch, positions$length.x, sep="_")
  colnames(positions) <- c("branch","length","high_vert","low_vert","brnch_type")
  tree <- rbind(tree, positions)
  #   use positions to segment tree branches    
  tree_brnch <- split(tree, tree$branch)
  tree <- data.frame( branch=NA_character_, length = NA_real_, high_vert = NA_character_, 
                      low_vert = NA_character_, brnch_type =NA_character_, seg_len= NA_real_)
  for( ib in 1: length(tree_brnch)) {
    brnch_seg <- tree_brnch[[ib]][order(tree_brnch[[ib]]$length, decreasing=TRUE), ]
    n_seg <- nrow(brnch_seg)
    brnch_seg$seg_len <- brnch_seg$length
    for( is in 1:(n_seg-1) ) {
      brnch_seg$seg_len[is] <- brnch_seg$length[is] - brnch_seg$length[is+1]
      brnch_seg$low_vert[is] <- brnch_seg$high_vert[is+1]
    }
    tree  <- rbind(tree, brnch_seg)
  }
  tree <- tree[-1,]
  #  Create graph of tree and positions
  tree_graph <- graph.data.frame(tree[,c("low_vert","high_vert")])  
  E(tree_graph)$label <- tree$high_vert
  E(tree_graph)$brnch_type <- tree$brnch_type
  E(tree_graph)$weight <- tree$seg_len
  #  calculate shortest distances between position vertices
  position_verts <- V(tree_graph)[grep("_", V(tree_graph)$name)] 
  vert_dist <- shortest.paths(tree_graph, v=position_verts, to=position_verts, mode="all")  
  return(dist_mat= vert_dist )
}

我通过使用distance函数为所有allposition数据上的代码制作一个名为Remi的函数，从而针对问题中发布的代码对igraph代码（ t.graph函数）进行了基准测试。 样本树被创建为树的扩展，以及allpositions和2048个分支的树的所有allpositions数据，并且所有allpositions等于这些大小的两倍。 执行时间的比较如下所示。 请注意，时间以毫秒为单位 。

 microbenchmark(matR16 <- Remi(tree, allpositions), matG16 <- t.graph(tree, allpositions),
                matR256 <- Remi(tree256, allpositions256), matG256 <- t.graph(tree256, allpositions256), times=2)
Unit: milliseconds
                                         expr          min           lq         mean       median           uq          max neval
           matR8 <- Remi(tree, allpositions)     58.82173     58.82173     59.92444     59.92444     61.02714     61.02714     2
        matG8 <- t.graph(tree, allpositions)     11.82064     11.82064     13.15275     13.15275     14.48486     14.48486     2
    matR256 <- Remi(tree256, allpositions256) 114795.50865 114795.50865 114838.99490 114838.99490 114882.48114 114882.48114     2
 matG256 <- t.graph(tree256, allpositions256)    379.54559    379.54559    379.76673    379.76673    379.98787    379.98787     2

与您发布的代码相比，在8个分支的情况下， igraph结果仅快5倍左右，而在256个分支的情况下， igraph结果却快300倍以上，因此igraph似乎可以更好地扩展大小。 我还对2048分支案例的igraph代码进行了基准测试，结果如下。 时间又以毫秒为单位 。

microbenchmark(matG8 <- t.graph(tree, allpositions), matG64 <- t.graph(tree64, allpositions64),
               matG256 <- t.graph(tree256, allpositions256),  matG2k <- t.graph(tree2k, allpositions2k), times=2)
Unit: milliseconds
                                         expr         min          lq        mean      median          uq         max neval
         matG8 <- t.graph(tree, allpositions)    11.78072    11.78072    12.00599    12.00599    12.23126    12.23126     2
    matG64 <- t.graph(tree64, allpositions64)    73.29006    73.29006    73.49409    73.49409    73.69812    73.69812     2
 matG256 <- t.graph(tree256, allpositions256)   377.21756   377.21756   410.01268   410.01268   442.80780   442.80780     2
    matG2k <- t.graph(tree2k, allpositions2k) 11311.05758 11311.05758 11362.93701 11362.93701 11414.81645 11414.81645     2

因此在不到12秒的时间内即可计算出约4000个位置的距离矩阵。 t.graph返回距离矩阵，其中矩阵的行和列由branch names - position on the branch标记branch names - position on the branch ，例如

      0_7 0_1 L_8 L_5 LL_8 LL_2 R_3 R_2 RL_2 RL_1 RLL_3 RLL_2 RLR_5 RR_6
L_5    18  24   3   0   15    9   8   7   26   25    39    38    41   30

显示了从L-5 （沿L分支的5个单位的位置）到其他位置的距离。 我不知道这将处理您最大的情况，但对于某些情况可能会有所帮助。 您还需要满足最大案例的存储需求。

用于计算树上两个位置之间距离的性能？

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-04-02 04:18:48

用于计算树上两个位置之间距离的性能？

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-04-02 04:18:48

解决方案1
0 已采纳 2015-04-02 04:18:48