简体   繁体   English

从数据集构建 R 中的有向网络图

[英]Build a directed network graph in R from a dataset

I'm having trouble creating a directed graph (with the igraph package) from my dataset (data table of 10 columns) in R. The task is as follows: I need to build a directed (network) graph, where an individual X is connected to individual Y if X invited Y to the platform.我在 R 中从我的数据集(10 列的数据表)创建有向图(使用 igraph 包)时遇到问题。任务如下:我需要构建一个有向(网络)图,其中单个 X 是如果 X 邀请 Y 加入平台,则连接到个人 Y。 Ultimately, I need to identify the size of the longest chain of the network and calculate the clustering coefficient.最终,我需要确定网络最长链的大小并计算聚类系数。

After filtering my dt, dt.user consists of the following 2 columns: user_id, inviter_id.过滤我的 dt 后,dt.user 由以下 2 列组成:user_id、inviter_id。

user_id: user identification
inviter_id: id of the user that invited this user to the platform

After cleaning the data (removing all NA values), I'm trying to make this work, but I'm not sure if I'm doing it in the right way since my clustering coefficient is 0 (which seems very unlikely):清理数据(删除所有 NA 值)后,我正在尝试进行这项工作,但我不确定我是否以正确的方式进行操作,因为我的聚类系数为 0(这似乎不太可能):

all.users <- dt.users[, list(inviter_id, user_id)]

g.invites.network <- graph.data.frame(all.users, directed = TRUE)

I've tried switching the direction of the connections, but I still get the same results in terms of diameter and clustering coefficient:我尝试切换连接的方向,但在直径和聚类系数方面仍然得到相同的结果:

all.users <- dt.users[, list(user_id, inviter_id)]

My question is, is my directed graph wrong?我的问题是,我的有向图是错误的吗? If so, what am I doing wrong?如果是这样,我做错了什么? I believe that my answer is wrong because of the clustering coefficient of 0. To me, it seems very unlikely that there seems to be no cluster forming at all in this network.我相信我的答案是错误的,因为聚类系数为 0。对我来说,在这个网络中似乎根本没有形成聚类的可能性很小。 And should I keep ...list(inviter_id), user_id instead of ...list(user_id, inviter_id) ?我应该保留...list(inviter_id), user_id而不是...list(user_id, inviter_id)吗?

Sample data (40 rows):示例数据(40 行):

dt.users <- data.table::data.table(
  inviter_id = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 23L, 22L, 31L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 63L, 4L, 4L, 4L), 
  user_id = c(17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 32L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 58L, 59L, 60L, 64L, 71L, 75L, 76L, 78L)
)

Any help would be greatly appreciated!任何帮助将不胜感激!

At least for your sample data, 0 is the correct answer and I suspect that this will always be true for your full data set because of the way that it is constructed.至少对于您的样本数据,0 是正确答案,我怀疑由于它的构造方式,这对于您的完整数据集总是正确的。

I assume that when you say you are computing "clustering coefficient" that you are computing transitivity(g.invites.network) which does give zero as the answer.我假设当您说您正在计算“聚类系数”时,您正在计算transitivity(g.invites.network) ,它确实给出了零作为答案。 According to the documentation:根据文档:

This is simply the ratio of the triangles and the connected triples in the graph.这只是图中三角形和相连三元组的比率。 For directed graph the direction of the edges is ignored.对于有向图,边的方向被忽略。

Of course, I don't know for sure how your data was constructed, but it appears that only one individual gets "credit" for inviting any other user, that is, there are never two arrows coming in to a vertex.当然,我不确定您的数据是如何构建的,但似乎只有一个人因邀请任何其他用户而获得“信用”,也就是说,永远不会有两个箭头进入一个顶点。 Assuming that is true, your data will never have any triangles.假设这是真的,您的数据将永远不会有任何三角形。 Therefore, the "ratio of the triangles and the connected triples in the graph" will have a numerator of zero and will always be zero.因此,“图中三角形与相连三元组的比率”的分子为零且始终为零。

This is obvious in the graph of your sample data.这在您的样本数据图中很明显。

plot(g.invites.network)

没有三角形的网络

Addition based on comments根据评论添加
There are two kinds of diameter to compute - directed and undirected.有两种直径需要计算 - 有向和无向。 For your example data, the directed diameter is 2 and the undirected diameter is 4.对于您的示例数据,有向直径为 2,无向直径为 4。

diameter(g.invites.network)
[1] 2
diameter(g.invites.network, directed=FALSE)
[1] 4

You can get the vertices that make up these paths using get_diameter您可以使用get_diameter获取构成这些路径的顶点

get_diameter(g.invites.network)
+ 3/43 vertices, named:
[1] 4  23 25
get_diameter(g.invites.network, directed=FALSE)
+ 5/43 vertices, named:
[1] 25 23 4  22 26

To subset the graph to get an idea of the diameters, you can use induced_subgraph .要对图形进行子集化以了解直径,您可以使用induced_subgraph For example, to get just those nodes:例如,要获取这些节点:

DiamPath =  get_diameter(g.invites.network, directed=FALSE)
DiameterGraph = induced_subgraph(g.invites.network, DiamPath)
plot(DiameterGraph)

只是直径顶点

Or maybe you want to look at the diameter in context, you could color the diameter vertices differently.或者,您可能想在上下文中查看直径,您可以为直径顶点着色不同的颜色。

DiamPath =  get_diameter(g.invites.network, directed=FALSE)
VC = rep("orange", vcount(g.invites.network))
VC[DiamPath] = "red"
plot(g.invites.network, vertex.color=VC)

全图中的直径

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM