简体   繁体   English

改进R中用于社交网络分析的处理性能

[英]Improve processing performance in R for Social Network Analysis

I am doing social network analysis using igraph package in R and I am dealing with close to 2 million vertices and edges. 我正在使用R中的igraph包进行社交网络分析,并且正在处理近200万个顶点和边。 Also calculating degrees of separations which are nearly 8 million vertices and edges. 还可以计算出接近800万个顶点和边的分离度。 Usually, it takes somewhere between 2 to 3 hours for execution which is way too much high. 通常,执行需要2到3个小时左右的时间,这太高了。 I need some input and suggestions to improve this performance. 我需要一些意见和建议来改善此性能。 Below is the sample code I am using: 以下是我正在使用的示例代码:

g <- graph.data.frame( ids, directed = F) # ids contains approximately 2 million records
distances(graph = g, v = t_ids$ID_from[x], to = t_ids$ID_to[x], weights = NA)
# t_ids contains approximately 8 million records for which degrees of separation is to be calculated using Shortest Path Algorithms

Thanking in advance! 预先感谢!

I don't think so, but I'd be very happy to be proven wrong. 我不这么认为,但是我很高兴被证明是错误的。

You should look into other ways of optimising the code that is running. 您应该研究其他方法来优化正在运行的代码。

If your data is fixed, you could compute the distances once, save the (probably rather big) distance matrix, and ask that for degrees of separation. 如果您的数据是固定的,则可以计算一次距离,保存(可能相当大)距离矩阵,并要求其分离度。

If your analysis does not require distances between all x vertices, you should look into making optimisations in your code by shortening t_ids$ID_from[x] . 如果分析要求所有之间的距离x顶点,你应该考虑通过缩短制作的优化代码中的t_ids$ID_from[x] Get only the distances you need. 仅获取您需要的距离。 I suspect that you're already doing this, though. 我怀疑您已经在这样做了。

distances() actually computes rather quickly. distances()实际上计算速度很快。 At 10'000 nodes (which amounts to 4,99*10^6 undirected distances), my crappy machine gets a full 700MB large distance-matrix in a few seconds. 在1万个节点(相当于4,99 * 10 ^ 6个无向距离)上,我的cr脚机器在几秒钟内获得了700MB的大型距离矩阵。

I first thought about the different algorithms you can choose in distances() , but now I doubt that they will help you. 我首先想到了可以在distances()选择的不同算法,但是现在我怀疑它们是否会对您有所帮助。 I ran a speed-test on the different algorithms to see if I could recommend any of them to you, but they all seem to run at more or less the same speed (results are relations to time to compute using automatic algorithm that would be used in your code above): 我对不同的算法进行了速度测试,以查看是否可以向您推荐它们中的任何一个,但是它们似乎都以差不多相同的速度运行(结果是与使用自动算法进行计算的时间的关系)在上面的代码中):

  sample automatic unweighted  dijkstra bellman-ford   johnson
1     10         1  0.9416667 0.9750000    1.0750000 1.0833333
2    100         1  0.9427083 0.9062500    0.8906250 0.8958333
3   1000         1  0.9965636 0.9656357    0.9977090 0.9873998
4   5000         1  0.9674200 0.9947269    0.9691149 1.0007533
5  10000         1  1.0070885 0.9938136    0.9974223 0.9953602

I don't think anything can be concluded from this, but it's running on an Erdős-Rényi model. 我认为不能由此得出任何结论,但是它是在Erdős-Rényi模型上运行的。 It's possible that your network structure favours one algorithm over another, but they would still not give you anywhere near the performance boost that you're hoping for. 您的网络结构可能会偏爱一种算法而不是另一种算法,但是它们仍然无法给您带来您所希望的性能提升。

The code is here: 代码在这里:

# igrpah
library(igraph)

# setup:
samplesizes <- c(10, 100, 1000, 5000, 10000)
reps <- c(100, 100, 15, 3, 1)
algorithms = c("automatic", "unweighted", "dijkstra", "bellman-ford", "johnson")
df <- as.data.frame(matrix(ncol=length(algorithms), nrow=0), stringsAsFactors = FALSE)
names(df) <- algorithms

# any random graph
g <- erdos.renyi.game(10000, 10000, "gnm")

# These are the different algorithms used by distances:
m.auto <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="automatic")
m.unwg <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="unweighted")
m.dijk <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="dijkstra")
m.belm <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="bellman-ford")
m.john <- distances(g, v=V(g), to=V(g), weights=NA, algorithm="johnson")

# They produce the same result:
sum(m.auto == m.unwg & m.auto == m.dijk & m.auto == m.belm & m.auto == m.john) == length(m.auto)


# Use this function will be used to test the speed of distances() run using different algorithms
test_distances <- function(alg){
       m.auto <- distances(g, v=V(g), to=V(g), weights=NA, algorithm=alg)
       (TRUE)
}

# Build testresults
for(i.sample in 1:length(samplesizes)){
       # Create a random network to test
       g <- erdos.renyi.game(samplesizes[i.sample], (samplesizes[i.sample]*1.5), type = "gnm", directed = FALSE, loops = FALSE)

       i.rep <- reps[i.sample]

       for(i.alg in 1:length(algorithms)){
              df[i.sample,i.alg] <- system.time( replicate(i.rep, test_distances(algorithms[i.alg]) ) )[['elapsed']]
       }
}

# Normalize benchmark results
dfn <- df

dfn[,1:length(df[,])] <- df[,1:length(df[,])] / df[,1]
dfn$sample <- samplesizes
dfn <- dfn[,c(6,1:5)]
dfn

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM