简体   繁体   English

有向图的最大强连通分量

[英]Largest strongly connected components of a directed graph

I am working on an Networkx .MultiDiGraph() object built from a total of 82927 directed email data. 我正在开发一个Networkx .MultiDiGraph()对象,该对象由总共82927个定向电子邮件数据构建。 At current stage, I am trying to get the largest strongly connected components from the .MultiDiGraph() object and its corresponding subgraph. 在当前阶段,我试图从.MultiDiGraph()对象及其相应的子图中获取最大的强连接组件。 The text data can be accessed here . 可以在此处访问文本数据。 Here's my working code: 这是我的工作代码:

 import networkx as nx import pandas as pd import matplotlib.pyplot as plt email_df = pd.read_csv('email_network.txt', delimiter = '->') edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"}) email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight') G = nx.MultiDiGraph() G.add_edges_from(email.edges(data=True)) # G is a .MultiDiGraph object # using .strongly_connected_components() to get the part of G that has the most nodes # using list comprehension number_of_nodes = [len(n) for n in sorted(nx.strongly_connected_components(G))] number_of_nodes # 'number_of_nodes' return a list of [1, 1, 1,...,1] of length 167 (which is the exact number of nodes in the network) # using the recommended method in networkx documentation largest = max(nx.strongly_connected_components(G), key=len) largest # 'largest' returns {92}, not sure what this means... 

As I noted in the above code block, the list comprehension method returns a list of [1, 1, 1,..., 1] of length 167 (which is the total number of nodes in my data), while the max(nx.strongly_connected_components(G), key=len) returned {92} , I am not sure what this means. 正如我在上面的代码块中所提到的,列表推导方法返回长度为167的[1,1,1,...,1]列表(这是我数据中的节点总数),而max(nx.strongly_connected_components(G), key=len)返回{92} ,我不知道这意味着什么。

It looks like there's something wrong with my code and I might have missed several key steps in processing the data. 看起来我的代码出了问题,我可能错过了处理数据的几个关键步骤。 Could anyone care to take a look at and enlighten me on this? 有人可以关注这个并开导我吗?

Thank you. 谢谢。

Note: Revised code (kudos to Eric and Joel) 注意:修改后的代码(对Eric和Joel的称赞)

 import networkx as nx import pandas as pd import matplotlib.pyplot as plt email_df = pd.read_csv('email_network.txt', delimiter = ' ') edge_groups = email_df.groupby(["#Sender", "Recipient"], as_index=False).count().rename(columns={"time":"weight"}) # per @Joel's comment, adding 'create_using = nx.DiGraph()' email = nx.from_pandas_dataframe(edge_groups, '#Sender', 'Recipient', edge_attr = 'weight', create_using = nx.DiGraph()) # adding this 'directed' edge list to .MultiDiGraph() object G = nx.MultiDiGraph() G.add_edges_from(email.edges(data=True)) 

We now examine the largest strongly connected component (in terms of the number of nodes) in this network. 我们现在检查该网络中最大的强连接组件(就节点数而言)。

 In [1]: largest = max(nx.strongly_connected_components(G), key=len) In [2]: len(largest) Out [2]: 126 

The largest strongly connected component consists of 126 nodes. 最大的强连通组件由126个节点组成。

[Updates] Upon further trial and error, I found that one needs to use create_using = .MultiDiGraph() (instead of .DiGraph() ) when loading data onto networkx , otherwise, even if you get correct number of nodes for your MultiDiGraph and its weakly/strongly connected subgraphs, you might still get the number of edges wrong! [更新]经过进一步的试验和错误,我发现在将数据加载到networkx时需要使用create_using = .MultiDiGraph() (而不是.DiGraph() ),否则,即使您获得了MultiDiGraph正确节点数量,它的弱/强连接子图,你可能仍然得到错误的边数! This will reflect in you .strongly_connected_subgraphs() outputs. 这将反映在你.strongly_connected_subgraphs()输出中。

For my case here, I will recommend others to use this one-liner 对于我的情况,我会建议其他人使用这种单行程

 import networkx as nx import pandas as pd import matplotlib.pyplot as plt G = nx.read_edgelist(path="email_network.txt", data=[('time', int)], create_using=nx.MultiDiGraph(), nodetype=str) 

And we can implement .strongly_connected_components(G) and strongly_connected_subgraphs to verify. 我们可以实现.strongly_connected_components(G)strongly_connected_subgraphs验证。

If you use the networkx output G from the first code block, max(nx.strongly_connected_components(G), key=len) will give an output with 126 nodes and 52xx something edges, but if you apply the one-liner I listed above, you will get: 如果你使用第一个代码块中的networkx输出Gmax(nx.strongly_connected_components(G), key=len)将给出126个节点和52xx边缘的输出,但是如果你应用我上面列出的单行,你会得到:

 In [1]: largest = max(nx.strongly_connected_components(G), key=len) In [2]: G_sc = max(nx.strongly_connected_subgraphs(G), key=len) In [3]: nx.number_of_nodes(G_sc) Out [3]: 126 In [4]: nx.number_of_nodes(G_sc) Out [4]: 82130 

You will get the same number of nodes with both methods but different number of edges owing to different counting mechanisms associated with different networkx graph classes. 由于与不同networkx图类关联的不同计数机制,您将获得具有两种方法但具有不同边数的节点的相同数量的节点。

The underlying cause of your error is that nx.from_pandas_dataframe defaults to creating an undirected graph. 您的错误的根本原因是nx.from_pandas_dataframe默认为创建无向图。 So email is an undirected graph. 因此, email是一个无向图。 When you then create the directed graph, each edge appears in only one direction. 然后,当您创建有向图时,每条边只出现在一个方向上。

To fix it use nx.from_pandas_dataframe with the argument create_using = DiGraph 要修复它使用nx.from_pandas_dataframe的说法create_using = DiGraph


older comments related to the output you were getting 与您获得的输出相关的旧评论

All your strongly connected components have a single node. 所有强连接组件都有一个节点。

When you do max(nx.strongly_connected_components(G), key=len) it finds the set of nodes which has the longest length and returns it. 当你执行max(nx.strongly_connected_components(G), key=len)它会找到长度最长的节点集并返回它。 In your case, they all have length 1, so it returns one of them (I believe whichever networkx happened to put into nx.strongly_connected_components(G) first). 在你的情况下,它们都有长度1,所以它返回其中一个(我相信哪个networkx碰巧先放入nx.strongly_connected_components(G) )。 But it's returning the set , not the length . 但它正在返回集合 ,而不是长度 So {92} is the set of nodes it is returning. 所以{92}是它返回的节点集。

It happens that {92} was chosen to be the "longest" length 1 component in nx.strongly_connected_components(G) by the tiebreaker. 碰巧是{92}nx.strongly_connected_components(G) “最长”的长度为1的组件。

Example: 例:

max([{1}, {3}, {5}], key = len)
> {1}
[1, 1, 1,...,1] of length 167 (which is the exact number of nodes in the network)

This means that there's basically no strongly connected component in your graph (except for lone vertices, that is). 这意味着图中基本上没有强连通分量 (除了孤立顶点之外)。

If you sort those components by length, you get a randon component of one single vertex since the components all have the same length ( 1 ). 如果按长度对这些组件进行排序,则会获得单个顶点的randon组件,因为组件都具有相同的长度( 1 )。 In your example, {92} , which could have been any other vertex. 在您的示例中, {92} ,可能是任何其他顶点。

The import looks correct and there's really no strongly connected component, it means that nobody ever replied to any email. 导入看起来是正确的,并且实际上没有强连接组件,这意味着没有人回复过任何电子邮件。

To check if the problem doesn't come from pandas , MultiDiGraph or your import, I wrote: 为了检查问题是不是来自pandasMultiDiGraph或你的导入,我写道:

G = nx.DiGraph()

with open('email_network.txt') as f:
    for line in f:
        n1, n2, time = line.split()
        if n1.isdigit():
            G.add_edge(int(n1),int(n2))

It didn't change the result. 它没有改变结果。

Just adding an edge with G.add_edge(2,1) creates a large strongly connected component, though: 只需使用G.add_edge(2,1)添加边G.add_edge(2,1)创建一个大的强连接组件,但是:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 126, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 115, 117, 118, 119, 120, 121, 122, 123, 124, 128, 129, 134, 149, 151}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM