简体   繁体   中英

Switching from indices to names (or other attributes) in lists in R for large datasets. (iGraph)

I am working with a graph object (igraph package) in R. I apply a function called" get.shortest.paths()" that provides the shortest paths from a given vertex to all the other vertices in the graph. The algorithm returns a list, where each element of the list corresponds to a target vertex, and contains the vertex indices of all the vertices on the shortest path between the source and the target. For example;

head(get.shortest.paths(graph, from = V(graph)[1], to = V(graph), mode = "out"))
[[1]]
[1] 0 (source and target are the same)
[[2]]
[1]     0 91835 38405 89704     1
[[3]]
[1]     0 91835 12104 39002 22670     2
[[4]]
[1]     0 62386 36754 89246 31045     3

The problem is when I want to go from vertex indices to vertex names. Something like this;

[[1]]
[1] "gene 1"
[[2]]
[1]     "gene 1"  "protein 45" "protein 83" "protein 70"     "gene 2"
[[3]]
[1]     "gene 1" "protein 45" "protein 30"  "reaction 2" "protein 404"     "gene 3"
[[4]]
[1]     "gene 1" "protein 4" "reaction 12" "protein 19"  "protein 494"   "gene 4"

I try to do this by using lapply()

path.index.list <-  get.shortest.paths(graph, from = V(graph)[1], to = V(cn), mode = "out")
path.name.list <- lapply(path.index.list, FUN = function(path) V(graph)[path]$name)

... but this takes a very long time. "For" loops take just as long. In fact, the exact time I needed to covert from indices to names for just one source vertex to all other 100,000+ vertices was...

system.time(lapply(path.index.list, FUN = function(path) V(graph)[path]$name))
  user  system elapsed
608.62  152.69  761.66

... which comes to about 900 days for the whole graph.

Is this one of those a "pass-by-reference" vs "pass-by-value" problems and if so can someone help me understand how to solve it? I have heard of using hashes or environment functions in R to solve things like this, can anyone comment on that? I have also heard of some packages in R that can help address this?

Basically, how can I solve this without having to code in C?

Query the names of the vertices in advance and index that in lapply :

names <- V(graph)$name
lapply(path.index.list, FUN = function(path) names[path])

I guess this is going to be much faster because lapply won't have to build V(graph) and the name list every time just to select a sublist of it.

Yes, I originally used the lapply method described by use "Tamás". I am getting about 230 seconds per iteration (about 2 seconds per 1000 items). I tried using the "fastmatch" package combined with memory allocation using matrices and speed actually went down. I took this to mean this was more an issue with how fast R looks up items then memory. I need to get this down to less than 6 seconds per iteration for this actually to be practical. I guess I'm going to C...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM