简体   繁体   English

社会图分析。 60GB和1亿个节点

[英]Social graph analysis. 60GB and 100 million nodes

Good evening, 晚上好,

I am trying to analyse the forementioned data(edgelist or pajek format). 我正在尝试分析上述数据(edgelist或pajek格式)。 First thought was R-project with igraph package. 首先想到的是带有igraph软件包的R项目。 But memory limitations(6GB) wont do the trick. 但是内存限制(6GB)不能解决问题。 Will a 128GB PC be able to handle the data? 一台128GB的PC能够处理数据吗? Are there any alternatives that don't require whole graph in RAM? 是否有不需要在RAM中整个图形的替代方案?

Thanks in advance. 提前致谢。

PS: I have found several programs but I would like to hear some pro(yeah, that's you) opinions on the matter. PS:我找到了几个程序,但是我想听听一些赞成(是的,就是你)对此事的意见。

If you only want degree distributions, you likely don't need a graph package at all. 如果只需要度数分布,则可能根本不需要图形包。 I recommend the bigtablulate package so that 我推荐bigtablulate软件包,以便

  1. your R objects are file backed so that you aren't limited by RAM 您的R对象是文件支持的,因此您不受RAM的限制
  2. you can parallelize the degree computation using foreach 您可以使用foreach并行度计算

Check out their website for more details. 查看他们的网站以获取更多详细信息。 To give a quick example of this approach, let's first create an example with an edgelist involving 1 million edges among 1 million nodes. 为了给出这种方法的快速示例,我们首先创建一个边缘列表示例,其中包含100万个节点中的100万个边缘。

set.seed(1)
N <- 1e6
M <- 1e6
edgelist <- cbind(sample(1:N,M,replace=TRUE),
                  sample(1:N,M,replace=TRUE))
colnames(edgelist) <- c("sender","receiver")
write.table(edgelist,file="edgelist-small.csv",sep=",",
            row.names=FALSE,col.names=FALSE)

I next concatenate this file 10 times to make the example a bit bigger. 接下来,我将该文件连接10次以使示例更大。

system("
for i in $(seq 1 10) 
do 
  cat edgelist-small.csv >> edgelist.csv 
done")

Next we load the bigtabulate package and read in the text file with our edgelist. 接下来,我们加载bigtabulate包,并使用边列表读取文本文件。 The command read.big.matrix() creates a file-backed object in R. 命令read.big.matrix()在R中创建一个文件支持的对象。

library(bigtabulate)
x <- read.big.matrix("edgelist.csv", header = FALSE, 
                     type = "integer",sep = ",", 
                     backingfile = "edgelist.bin", 
                     descriptor = "edgelist.desc")
nrow(x)  # 1e7 as expected

We can compute the outdegrees by using bigtable() on the first column. 我们可以通过在第一列上使用bigtable()计算出学位。

outdegree <- bigtable(x,1)
head(outdegree)

Quick sanity check to make sure table is working as expected: 快速健全性检查,以确保表按预期工作:

# Check table worked as expected for first "node"
j <- as.numeric(names(outdegree[1]))  # get name of first node
all.equal(as.numeric(outdegree[1]),   # outdegree's answer
          sum(x[,1]==j))              # manual outdegree count

To get indegree, just do bigtable(x,2) . 要获取度数,只需执行bigtable(x,2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM