R directed network from sequence

Question

(using: R 3.1.0)

Hi - I feel like this should be simpler than I'm finding it. I have a set of sequences and I'd like to visualise them as a directed network. A pure graph probably isn't right because each sequence can have multiple instances of nodes and the repetition order is important in the sequence. So, for example I might have:

Seq    Count
AB     8000
AC     5500
CB     4900
CBA    4300
ACD    4000
ACACA  3740
CA     2800
...    ...

Where the sequence ends up is interesting, so for each final node I'd like to show the paths to it and their weights. So in my (very small) example above:

end point B: A->B has weight 8000 and C->B has weight 4900.
```
 8000 A-+ |-->B 4900 C-+ 
```
end point A: C->B->A has weight 4300, A->C->A->C->A has weight 3740, C->A has weight 2800
```
  4300 C--->B-+ | 4740 A-->C-->A-->C-+--->A | 2800 C-+ 
```

Its important to note that route CA is not part of ACACA, but a separate route.

The raw data is actually a list of events in time grouped by a sequence number, so it may be easier to start from that point (rather than the aggregated view above). Like this:

seqNo. Node  Time
1      A     0.0
1      B     2.1
2      A     0.0
2      C     3.2
3      C     0.0
3      B     8.1
4      C     0.0
4      B     1.2
4      A     2.3
...    ...   ...

I'd like to know what package (if any) is best to use to work with sequences like this, and how to reduce the data to a directed network view. The iGraph package looks like it could help but I think there might be some concepts I'm missing, particularly in this case where an adjacency matrix isn't really valid (due to multiple adjacencies in the graph for each pair of nodes).

UPDATE - this this is an idea of the type of output I'm looking for:

我要寻找的例子

Cheers and thanks for any help,

Andy.

Answer 1

You seem to be saying that only start and end nodes are of interest as nodes so you could use these nodes as vertices and display the intermediate nodes as edge labels as shown in the following code and plot. Assume df contains your aggregate data.

library(igraph)
last_char <- nchar(as.character(df$Seq))
df_g <- cbind(v1=substr(df$Seq, 1,1),
              v2=substr(df$Seq, last_char, last_char), df)
g <- graph.data.frame(df_g)
plot(g, edge.label=paste(E(g)$Seq, "\n", E(g)$Count))

The visual presentation of the plot could be improved but this shows a way in which the aggregate data can produce a directed network view. One could imagine some alternative ways of representing the interior nodes between start and end nodes but these would seem to lead to more complicated plots.

UPDATE 2

Your comment made things clearer. Most of the work in getting your diagram is generating the edges and vertices for a graph from your sequence data. Once that is defined, you can format and send to a plotting package to display. The code below constructs a data frame df_g containing the edge connectivity and end locations, uses df_g to generate a data frame df_v containing vertex data, and then passes both to igraph for plotting. You can get an idea of what the code is doing by examining df_g and df_v .

  library(igraph)
  last_char <- nchar(df$Seq)
  df <- df[order(substr(df$Seq, last_char, last_char), df$Seq),]
  edges <- as.character(df$Seq)
  df_g <- data.frame(v1=NA_character_, v2=NA_character_, Seq=NA_character_, 
                     Count=NA_character_, label=NA_character_, arrow.mode = NA_character_, end = NA_character_, 
                     x1 = NA_integer_, x2 = NA_integer_, y1=NA_integer_, y2=NA_integer_,  type=NA_character_,
                     stringsAsFactors=FALSE)
  for( i in 1:nrow(df)){
 #  Make sequence edges
      edge <- edges[i]
      num_vert <- nchar(edge)
      j <- 1:(num_vert-1)
      df_g_j <- data.frame( v1=paste(edge, j,sep="_"), v2=paste(edge, j+1,sep="_"), 
                         Seq=edge, Count=df$Count[i], label=sapply(j, function(x) substr(edge, x, x)), 
                         arrow.mode = ">", end=substr(edge,num_vert,num_vert),
                         x1=j-num_vert, x2=j+1-num_vert,  y1=i, y2=i, type="seq", stringsAsFactors=FALSE) 
      df_g_j[num_vert-1, "arrow.mode"] <- "-"       # make connector vertex   
      df_g_con <- transform(df_g_j[num_vert-1,], v1=v2, v2=paste(end, "connector", sep="_"), x1=0, label=NA, type="connector")
      df_g <- rbind(df_g, df_g_j, df_g_con)    
    }
    df_g <- df_g[-1,]
    df_g[df_g$type=="connector",] <- within(df_g[df_g$type=="connector",], y2 <- tapply(y2, v2, mean)[v2])
    cn_vert <- aggregate(v2 ~ end, data=df_g[df_g$type=="connector", ], length)
    colnames(cn_vert) <- c("end","num")
    for( end in cn_vert$end){
      cn_vert_row <- which(df_g$end == end & df_g$type == "connector")[1]
      if( cn_vert$num[cn_vert$end==end] > 1 ) {
        df_g <- rbind(df_g,with(df_g[cn_vert_row,], 
                                data.frame(v1=v2, v2=end, Seq=NA_character_, Count=NA_character_, label=NA,
                                           arrow.mode = ">", end=end, x1=x2, x2= 1, y1 = y2, y2=y2, type = "common_end", 
                                          stringsAsFactors=FALSE)) ) }
      else df_g[cn_vert_row,] <- transform(df_g[cn_vert_row,], v2=end, label=NA, arrow.mode=">", x2=1,type="common_end")
  }
#  make vertices
  df_v <- with(df_g, data.frame(v=v1, label = label, x=x1, y=y1, color = "black", size = 15, stringsAsFactors=FALSE))
  df_v <- rbind(df_v, with(df_g[df_g$type == "common_end",], 
                           data.frame(v=end, label = v2, x=x2, y=y2, color="black", size=15, stringsAsFactors=FALSE)))
  df_v[is.na(df_v$label),] <- transform(df_v[is.na(df_v$label),], color = NA, size = 0)
#
#  make graph from edges and vertices
  g <- graph.data.frame(df_g, vertices=df_v)
  E(g)$label <- NA                       # assign Counts as labels to sequence start vertices
  e_start <- grep("_1",get.edgelist(g)[,1])
  E(g)[e_start]$label <- E(g)[e_start]$Count
# adjust and scale edge label positions
  h_jst <- 0            # values between 0 and .2
  edge_label_x  <- 1 - 2*(1.5 + h_jst - E(g)$x1)/diff(range(V(g)$x))
  num_color <-12                           # assign colors to Count labels; num_color is number of colors in pallette
  counts <- as.integer(E(g)$Count)
  edge_label_color <- rainbow(num_color, start=0, end=.75)[num_color- 
                                         floor((num_color-1)*(counts-min(counts,na.rm=TRUE))/diff(range(counts,na.rm=TRUE)))]
  plot(g, vertex.label.color="white", vertex.frame.color=V(g)$color, 
       edge.color="blue", edge.arrow.size=.6, edge.label.x= edge_label_x, 
       edge.label.color=edge_label_color, edge.label.font=2, edge.label.cex=1.1)

For your sample data, this gives the diagram shown below. The Count labels have greater separation from the vertices when the plots are enlarged but you can further adjust this by with the variable h_jst inside the code.

在此处输入图片说明

Answer 2

I have discovered a package that neatly (although verbosely) solves this problem in a way that was acceptable, although not exactly what I was looking for from a formatting point of view.

Using the DigrammeR package (which implements graphViz through the grViz function) I could design a network that looked like my desired output in the question. The language is verbose, but it would be easy to construct the code to give to grViz algorithmically once you'd discovered the appropriate network paths.

The code is:

library(DiagrammeR)
library(V8)
library(XML)

gph<-grViz("
  digraph {
    outputorder=edgesfirst;
    rankdir='LR';
    node [shape = circle, style='filled', fillcolor = black, fontname=Arial, fontcolor=white];

    A1 -> C1 -> D1              [color='cornflowerblue', penwidth=3];
    A2 -> C2                    [color='cornflowerblue', penwidth=3];
    C3 -> B1                    [color='cornflowerblue', penwidth=3];
    A3 -> B1                    [color='cornflowerblue', penwidth=3];
    C4 -> B2 -> A4              [color='cornflowerblue', penwidth=3];
    C5 -> A4                    [color='cornflowerblue', penwidth=3];
    A5 -> C6 -> A6 -> C7 -> A4  [color='cornflowerblue', penwidth=3];

    w1 -> A1 [dir=none, style=dotted];
    w2 -> A2 [dir=none, style=dotted];
    w3 -> C3 [dir=none, style=dotted];
    w4 -> A3 [dir=none, style=dotted];
    w5 -> C4 [dir=none, style=dotted];
    w6 -> C5 [dir=none, style=dotted];
    w7 -> A5 [dir=none, style=dotted];

    w1 [shape=box];
    w2 [shape=box];
    w3 [shape=box];
    w4 [shape=box];
    w5 [shape=box];
    w6 [shape=box];
    w7 [shape=box];

    w1 [label='4000', fillcolor='yellow3'];
    w2 [label='5500', fillcolor='pink'];
    w3 [label='4900', fillcolor='orange'];
    w4 [label='8000', fillcolor='red'];
    w5 [label='4300', fillcolor='orange'];
    w6 [label='2800', fillcolor='yellow'];
    w7 [label='3740', fillcolor='yellow3'];

    A1 [label='A'];
    A2 [label='A'];
    A3 [label='A'];
    A4 [label='A'];
    A5 [label='A'];
    A6 [label='A'];
    B1 [label='B'];
    B2 [label='B'];
    C1 [label='C'];
    C2 [label='C'];
    C3 [label='C'];
    C4 [label='C'];
    C5 [label='C'];
    C6 [label='C'];
    C7 [label='C'];
    D1 [label='D'];

  }")
graph.svg<-exportSVG(gph)
write(graph.svg, "C:/graph.svg")

This produces a standard SVG file that looks like this:

在此处输入图片说明

R directed network from sequence

Question

2 answers

solution1
1 ACCPTED 2015-04-28 15:14:35

solution2
0 2015-05-26 15:59:40

R directed network from sequence

Question

2 answers

solution1 1 ACCPTED 2015-04-28 15:14:35

solution2 0 2015-05-26 15:59:40

solution1
1 ACCPTED 2015-04-28 15:14:35

solution2
0 2015-05-26 15:59:40