[英]Social graph analysis. 60GB and 100 million nodes
Good evening, 晚上好,
I am trying to analyse the forementioned data(edgelist or pajek format). 我正在尝试分析上述数据(edgelist或pajek格式)。 First thought was R-project with igraph package.
首先想到的是带有igraph软件包的R项目。 But memory limitations(6GB) wont do the trick.
但是内存限制(6GB)不能解决问题。 Will a 128GB PC be able to handle the data?
一台128GB的PC能够处理数据吗? Are there any alternatives that don't require whole graph in RAM?
是否有不需要在RAM中整个图形的替代方案?
Thanks in advance. 提前致谢。
PS: I have found several programs but I would like to hear some pro(yeah, that's you) opinions on the matter. PS:我找到了几个程序,但是我想听听一些赞成(是的,就是你)对此事的意见。
If you only want degree distributions, you likely don't need a graph package at all. 如果只需要度数分布,则可能根本不需要图形包。 I recommend the bigtablulate package so that
我推荐bigtablulate软件包,以便
foreach
foreach
并行度计算 Check out their website for more details. 查看他们的网站以获取更多详细信息。 To give a quick example of this approach, let's first create an example with an edgelist involving 1 million edges among 1 million nodes.
为了给出这种方法的快速示例,我们首先创建一个边缘列表示例,其中包含100万个节点中的100万个边缘。
set.seed(1)
N <- 1e6
M <- 1e6
edgelist <- cbind(sample(1:N,M,replace=TRUE),
sample(1:N,M,replace=TRUE))
colnames(edgelist) <- c("sender","receiver")
write.table(edgelist,file="edgelist-small.csv",sep=",",
row.names=FALSE,col.names=FALSE)
I next concatenate this file 10 times to make the example a bit bigger. 接下来,我将该文件连接10次以使示例更大。
system("
for i in $(seq 1 10)
do
cat edgelist-small.csv >> edgelist.csv
done")
Next we load the bigtabulate
package and read in the text file with our edgelist. 接下来,我们加载
bigtabulate
包,并使用边列表读取文本文件。 The command read.big.matrix()
creates a file-backed object in R. 命令
read.big.matrix()
在R中创建一个文件支持的对象。
library(bigtabulate)
x <- read.big.matrix("edgelist.csv", header = FALSE,
type = "integer",sep = ",",
backingfile = "edgelist.bin",
descriptor = "edgelist.desc")
nrow(x) # 1e7 as expected
We can compute the outdegrees by using bigtable()
on the first column. 我们可以通过在第一列上使用
bigtable()
计算出学位。
outdegree <- bigtable(x,1)
head(outdegree)
Quick sanity check to make sure table is working as expected: 快速健全性检查,以确保表按预期工作:
# Check table worked as expected for first "node"
j <- as.numeric(names(outdegree[1])) # get name of first node
all.equal(as.numeric(outdegree[1]), # outdegree's answer
sum(x[,1]==j)) # manual outdegree count
To get indegree, just do bigtable(x,2)
. 要获取度数,只需执行
bigtable(x,2)
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.