简体   繁体   English


[英]R directed network from sequence

(using: R 3.1.0) (使用:R 3.1.0)

Hi - I feel like this should be simpler than I'm finding it. 嗨-我觉得这应该比我发现的要简单。 I have a set of sequences and I'd like to visualise them as a directed network. 我有一组序列,我想将它们可视化为有向网络。 A pure graph probably isn't right because each sequence can have multiple instances of nodes and the repetition order is important in the sequence. 单纯的图形可能不合适,因为每个序列可以具有多个节点实例,并且重复顺序在序列中很重要。 So, for example I might have: 因此,例如,我可能有:

Seq    Count
AB     8000
AC     5500
CB     4900
CBA    4300
ACD    4000
ACACA  3740
CA     2800
...    ...

Where the sequence ends up is interesting, so for each final node I'd like to show the paths to it and their weights. 序列的结尾位置很有趣,因此对于每个最终节点,我都希望显示其路径及其权重。 So in my (very small) example above: 因此,在我上面的示例中(非常小):

  • end point B: A->B has weight 8000 and C->B has weight 4900. 端点B: A-> B的权重为8000, C-> B的权重为4900。

     8000 A-+ |-->B 4900 C-+ 
  • end point A: C->B->A has weight 4300, A->C->A->C->A has weight 3740, C->A has weight 2800 端点A: C-> B-> A的重量为4300, A-> C-> A-> C-> A的重量为3740, C-> A的重量为2800

      4300 C--->B-+ | 4740 A-->C-->A-->C-+--->A | 2800 C-+ 

Its important to note that route CA is not part of ACACA, but a separate route. 重要的是要注意,路由CA不是ACACA的一部分,而是单独的路由。

The raw data is actually a list of events in time grouped by a sequence number, so it may be easier to start from that point (rather than the aggregated view above). 原始数据实际上是按时间顺序按序列号分组的事件列表,因此从该点开始可能更容易(而不是上面的聚合视图)。 Like this: 像这样:

seqNo. Node  Time
1      A     0.0
1      B     2.1
2      A     0.0
2      C     3.2
3      C     0.0
3      B     8.1
4      C     0.0
4      B     1.2
4      A     2.3
...    ...   ...

I'd like to know what package (if any) is best to use to work with sequences like this, and how to reduce the data to a directed network view. 我想知道哪种软件包(如果有的话)最适合用于这样的序列,以及如何将数据减少到定向网络视图。 The iGraph package looks like it could help but I think there might be some concepts I'm missing, particularly in this case where an adjacency matrix isn't really valid (due to multiple adjacencies in the graph for each pair of nodes). iGraph软件包看起来可能会有所帮助,但我认为可能缺少一些概念,尤其是在这种情况下,邻接矩阵并不是真正有效的(由于图中每对节点都有多个邻接关系)。

UPDATE - this this is an idea of the type of output I'm looking for: 更新-这是我正在寻找的输出类型的想法:


Cheers and thanks for any help, 欢呼,感谢您的帮助,

Andy. 安迪。

You seem to be saying that only start and end nodes are of interest as nodes so you could use these nodes as vertices and display the intermediate nodes as edge labels as shown in the following code and plot. 您似乎在说,只有起点和终点才是节点,因此可以将这些节点用作顶点,并将中间节点显示为边缘标签,如以下代码和图所示。 Assume df contains your aggregate data. 假设df包含您的汇总数据。

last_char <- nchar(as.character(df$Seq))
df_g <- cbind(v1=substr(df$Seq, 1,1),
              v2=substr(df$Seq, last_char, last_char), df)
g <- graph.data.frame(df_g)
plot(g, edge.label=paste(E(g)$Seq, "\n", E(g)$Count))

The visual presentation of the plot could be improved but this shows a way in which the aggregate data can produce a directed network view. 该图的可视化表示可以改善,但这显示了汇总数据可以产生定向网络视图的方式。 One could imagine some alternative ways of representing the interior nodes between start and end nodes but these would seem to lead to more complicated plots. 可以想象一些替代方法来表示起始节点和结束节点之间的内部节点,但是这些方法似乎会导致更为复杂的绘图。

UPDATE 2 更新2

Your comment made things clearer. 您的评论使事情变得更加清晰。 Most of the work in getting your diagram is generating the edges and vertices for a graph from your sequence data. 获取图表的大部分工作是从序列数据生成图形的边和顶点。 Once that is defined, you can format and send to a plotting package to display. 定义后,您可以格式化并发送到绘图包进行显示。 The code below constructs a data frame df_g containing the edge connectivity and end locations, uses df_g to generate a data frame df_v containing vertex data, and then passes both to igraph for plotting. 下面的代码构造的数据帧df_g含有边缘连接位置和结束位置,使用df_g以产生数据帧df_v包含顶点数据,然后通过既igraph用于绘图。 You can get an idea of what the code is doing by examining df_g and df_v . 您可以通过检查df_gdf_vdf_g代码的df_v

  last_char <- nchar(df$Seq)
  df <- df[order(substr(df$Seq, last_char, last_char), df$Seq),]
  edges <- as.character(df$Seq)
  df_g <- data.frame(v1=NA_character_, v2=NA_character_, Seq=NA_character_, 
                     Count=NA_character_, label=NA_character_, arrow.mode = NA_character_, end = NA_character_, 
                     x1 = NA_integer_, x2 = NA_integer_, y1=NA_integer_, y2=NA_integer_,  type=NA_character_,
  for( i in 1:nrow(df)){
 #  Make sequence edges
      edge <- edges[i]
      num_vert <- nchar(edge)
      j <- 1:(num_vert-1)
      df_g_j <- data.frame( v1=paste(edge, j,sep="_"), v2=paste(edge, j+1,sep="_"), 
                         Seq=edge, Count=df$Count[i], label=sapply(j, function(x) substr(edge, x, x)), 
                         arrow.mode = ">", end=substr(edge,num_vert,num_vert),
                         x1=j-num_vert, x2=j+1-num_vert,  y1=i, y2=i, type="seq", stringsAsFactors=FALSE) 
      df_g_j[num_vert-1, "arrow.mode"] <- "-"       # make connector vertex   
      df_g_con <- transform(df_g_j[num_vert-1,], v1=v2, v2=paste(end, "connector", sep="_"), x1=0, label=NA, type="connector")
      df_g <- rbind(df_g, df_g_j, df_g_con)    
    df_g <- df_g[-1,]
    df_g[df_g$type=="connector",] <- within(df_g[df_g$type=="connector",], y2 <- tapply(y2, v2, mean)[v2])
    cn_vert <- aggregate(v2 ~ end, data=df_g[df_g$type=="connector", ], length)
    colnames(cn_vert) <- c("end","num")
    for( end in cn_vert$end){
      cn_vert_row <- which(df_g$end == end & df_g$type == "connector")[1]
      if( cn_vert$num[cn_vert$end==end] > 1 ) {
        df_g <- rbind(df_g,with(df_g[cn_vert_row,], 
                                data.frame(v1=v2, v2=end, Seq=NA_character_, Count=NA_character_, label=NA,
                                           arrow.mode = ">", end=end, x1=x2, x2= 1, y1 = y2, y2=y2, type = "common_end", 
                                          stringsAsFactors=FALSE)) ) }
      else df_g[cn_vert_row,] <- transform(df_g[cn_vert_row,], v2=end, label=NA, arrow.mode=">", x2=1,type="common_end")
#  make vertices
  df_v <- with(df_g, data.frame(v=v1, label = label, x=x1, y=y1, color = "black", size = 15, stringsAsFactors=FALSE))
  df_v <- rbind(df_v, with(df_g[df_g$type == "common_end",], 
                           data.frame(v=end, label = v2, x=x2, y=y2, color="black", size=15, stringsAsFactors=FALSE)))
  df_v[is.na(df_v$label),] <- transform(df_v[is.na(df_v$label),], color = NA, size = 0)
#  make graph from edges and vertices
  g <- graph.data.frame(df_g, vertices=df_v)
  E(g)$label <- NA                       # assign Counts as labels to sequence start vertices
  e_start <- grep("_1",get.edgelist(g)[,1])
  E(g)[e_start]$label <- E(g)[e_start]$Count
# adjust and scale edge label positions
  h_jst <- 0            # values between 0 and .2
  edge_label_x  <- 1 - 2*(1.5 + h_jst - E(g)$x1)/diff(range(V(g)$x))
  num_color <-12                           # assign colors to Count labels; num_color is number of colors in pallette
  counts <- as.integer(E(g)$Count)
  edge_label_color <- rainbow(num_color, start=0, end=.75)[num_color- 
  plot(g, vertex.label.color="white", vertex.frame.color=V(g)$color, 
       edge.color="blue", edge.arrow.size=.6, edge.label.x= edge_label_x, 
       edge.label.color=edge_label_color, edge.label.font=2, edge.label.cex=1.1)

For your sample data, this gives the diagram shown below. 对于您的示例数据,这给出了下图所示。 The Count labels have greater separation from the vertices when the plots are enlarged but you can further adjust this by with the variable h_jst inside the code. 扩大绘图时,“计数”标签与顶点之间的距离更大,但是您可以通过在代码内使用变量h_jst进一步进行调整。


I have discovered a package that neatly (although verbosely) solves this problem in a way that was acceptable, although not exactly what I was looking for from a formatting point of view. 我发现了一个可以(虽然很冗长)以一种可以接受的方式很好解决了这个问题的程序包,尽管从格式化的角度来看,这并不是我所寻找的。

Using the DigrammeR package (which implements graphViz through the grViz function) I could design a network that looked like my desired output in the question. 使用DigrammeR包(可通过grViz函数实现graphViz ),我可以设计一个看起来像问题中所需输出的网络。 The language is verbose, but it would be easy to construct the code to give to grViz algorithmically once you'd discovered the appropriate network paths. 该语言比较冗长,但是一旦发现合适的网络路径, grViz容易构造可grViz算法提供给grViz的代码。

The code is: 代码是:


  digraph {
    node [shape = circle, style='filled', fillcolor = black, fontname=Arial, fontcolor=white];

    A1 -> C1 -> D1              [color='cornflowerblue', penwidth=3];
    A2 -> C2                    [color='cornflowerblue', penwidth=3];
    C3 -> B1                    [color='cornflowerblue', penwidth=3];
    A3 -> B1                    [color='cornflowerblue', penwidth=3];
    C4 -> B2 -> A4              [color='cornflowerblue', penwidth=3];
    C5 -> A4                    [color='cornflowerblue', penwidth=3];
    A5 -> C6 -> A6 -> C7 -> A4  [color='cornflowerblue', penwidth=3];

    w1 -> A1 [dir=none, style=dotted];
    w2 -> A2 [dir=none, style=dotted];
    w3 -> C3 [dir=none, style=dotted];
    w4 -> A3 [dir=none, style=dotted];
    w5 -> C4 [dir=none, style=dotted];
    w6 -> C5 [dir=none, style=dotted];
    w7 -> A5 [dir=none, style=dotted];

    w1 [shape=box];
    w2 [shape=box];
    w3 [shape=box];
    w4 [shape=box];
    w5 [shape=box];
    w6 [shape=box];
    w7 [shape=box];

    w1 [label='4000', fillcolor='yellow3'];
    w2 [label='5500', fillcolor='pink'];
    w3 [label='4900', fillcolor='orange'];
    w4 [label='8000', fillcolor='red'];
    w5 [label='4300', fillcolor='orange'];
    w6 [label='2800', fillcolor='yellow'];
    w7 [label='3740', fillcolor='yellow3'];

    A1 [label='A'];
    A2 [label='A'];
    A3 [label='A'];
    A4 [label='A'];
    A5 [label='A'];
    A6 [label='A'];
    B1 [label='B'];
    B2 [label='B'];
    C1 [label='C'];
    C2 [label='C'];
    C3 [label='C'];
    C4 [label='C'];
    C5 [label='C'];
    C6 [label='C'];
    C7 [label='C'];
    D1 [label='D'];

write(graph.svg, "C:/graph.svg")

This produces a standard SVG file that looks like this: 这将生成一个标准的SVG文件,如下所示:


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM