简体   繁体   English

如何在python中有效地计算无向图中的三元组人口普查

[英]How to efficiently calculate triad census in undirected graph in python

I am calculating triad census as follows for my undirected network . 我正在为我的undirected network计算triad census

import networkx as nx
G = nx.Graph()
G.add_edges_from(
    [('A', 'B'), ('A', 'C'), ('D', 'B'), ('E', 'C'), ('E', 'F'),
     ('B', 'H'), ('B', 'G'), ('B', 'F'), ('C', 'G')])

from itertools import combinations
#print(len(list(combinations(G.nodes, 3))))

triad_class = {}
for nodes in combinations(G.nodes, 3):
    n_edges = G.subgraph(nodes).number_of_edges()
    triad_class.setdefault(n_edges, []).append(nodes)
print(triad_class)

It works fine with small networks. 它适用于小型网络。 However, now I have a bigger network with approximately 4000-8000 nodes. 但是,现在我有一个更大的网络,大约有4000-8000个节点。 When I try to run my existing code with a network of 1000 nodes, it takes days to run. 当我尝试使用1000个节点的网络运行现有代码时,运行需要数天。 Is there a more efficient way of doing this? 有更有效的方法吗?

My current network is mostly sparse. 我目前的网络大多是稀疏的。 ie there are only few connections among the nodes. 即节点之间只有很少的连接。 In that case, can I leave the unconnected nodes and do the computation first and later add the unconnceted nodes to the output? 在这种情况下,我可以先离开未连接的节点并先进行计算,然后将未同步的节点添加到输出中吗?

I am also happy to get approximate answers without calculating every combination. 我也很乐意在不计算每个组合的情况下得到近似答案。

Example of triad census: 三合会人口普查的例子:

Triad census is dividing the triads (3 nodes) in to the four categories shown in the below figure. 三合会人口普查将三合会(3个节点)划分为下图所示的四个类别。

四类黑社会人口普查

For example consider the network below. 例如,考虑下面的网络。

在此输入图像描述

The triad census of the four classes are; 四个班的三合会普查是;

{3: [('A', 'B', 'C')], 
2: [('A', 'B', 'D'), ('B', 'C', 'D'), ('B', 'D', 'E')], 
1: [('A', 'B', 'E'), ('A', 'B', 'F'), ('A', 'B', 'G'), ('A', 'C', 'D'), ('A', 'C', 'E'), ('A', 'C', 'F'), ('A', 'C', 'G'), ('A', 'D', 'E'), ('A', 'F', 'G'), ('B', 'C', 'E'), ('B', 'C', 'F'), ('B', 'C', 'G'), ('B', 'D', 'F'), ('B', 'D', 'G'), ('B', 'F', 'G'), ('C', 'D', 'E'), ('C', 'F', 'G'), ('D', 'E', 'F'), ('D', 'E', 'G'), ('D', 'F', 'G'), ('E', 'F', 'G')], 
0: [('A', 'D', 'F'), ('A', 'D', 'G'), ('A', 'E', 'F'), ('A', 'E', 'G'), ('B', 'E', 'F'), ('B', 'E', 'G'), ('C', 'D', 'F'), ('C', 'D', 'G'), ('C', 'E', 'F'), ('C', 'E', 'G')]}

I am happy to provide more details if needed. 如果需要,我很乐意提供更多细节。

EDIT: 编辑:

I was able to resolve the memory error by commenting the line #print(len(list(combinations(G.nodes, 3)))) as suggested in the answer. 我能够通过#print(len(list(combinations(G.nodes, 3))))来解决memory error ,如答案所示。 However, my program is still slow and takes days to run even with a network of 1000 nodes. 但是,我的程序仍然很慢,即使使用1000个节点的网络也需要数天才能运行。 I am looking for a more efficient way of doing this in python. 我正在寻找一种更有效的方法在python中执行此操作。

I am not limited to networkx and happy to accept answers using other libraries and languages as well. 我不仅限于networkx ,也很乐意接受使用其他库和语言的答案。

As always I am happy to provide more details as needed. 一如既往,我很乐意根据需要提供更多详细信息。

Let's check the numbers. 我们来看看数字吧。 Let n be the number of vertices, e the number of edges. n为顶点数, e为边数。

0 triads are in O( n ^3) 0个三元组在O( n ^ 3)

1 triads are in O( e * n ) 1个三元组在O( e * n

2 + 3 triads are in O( e ) O +( e )中有2 + 3个三元组

To get the 2 + 3 triads: 获得2 + 3三合会:

For every node a:
   For every neighbor of a b:
      For every neighbor of b c:
        if a and c are connected, [a b c] is a 3 triad
        else [a b c] is a 2 triad
   remove a from list of nodes (to avoid duplicate triads)

The next step depends on what the goal is. 下一步取决于目标是什么。 If you just need the number of 1 and 0 triads, then this is sufficient: 如果你只需要1和0三元组的数量,那么这就足够了:

#(1个三元组)= e *(n -2) - #(2个三元组) - #(3个三元组)

#(0个三元组)= {n \\选择3}  - #(3个三元组) - #(2个三元组) - #(1个三元组)

Explanation: 说明:

The 1 triads are all connected nodes + 1 unconnected node, so we get the number by computing the number of connected nodes + 1 other node, and subtract the cases where the other node is connected (2 and 3 triads) 1个三元组都是连接节点+ 1个未连接节点,因此我们通过计算连接节点数量+ 1个其他节点得到数字,并减去其他节点连接的情况(2和3个三元组)

The 0 triads is just all combinations of nodes minus the other triads. 0三元组只是节点的所有组合减去其他三元组。

If you need to actually list the triads, you are pretty much out of luck because no matter what you do, listing the 0 triads is in O(n^3) and will kill you once the graphs get bigger. 如果你需要实际列出三元组,你几乎没有运气,因为无论你做什么,列出0三元组都在O(n ^ 3)并且一旦图形变大就会杀了你。

The above algo for 2 + 3 triads is in O(e * max(# neighbors)), the other parts are in O(e + n) for counting the nodes and edges. 2 + 3三元组的上述算法在O(e * max(#neighbors))中,其他部分在O(e + n)中用于计算节点和边缘。 Much better than O (n^3) which you would need to explicitely list the 0 triads. 比O(n ^ 3)要好得多,你需要明确地列出0个三元组。 Listing the 1 triads could still be done in O(e * n). 列出1个三元组仍然可以在O(e * n)中完成。

The idea is simple: Instead of working on the graph directly I use the adjacency matrix. 这个想法很简单:我没有直接使用图形,而是使用邻接矩阵。 I thought this would be more efficient, and it seems I was right. 我认为这会更有效率,而且看起来我是对的。

例如,邻接矩阵

In an adjacency matrix a 1 indicates there is an edge between the two nodes, for example the first row can be read as "There is a link between A and B as well as C" 在邻接矩阵中,1表示在两个节点之间存在边缘,例如第一行可以被读作“A和B之间存在链接以及C”

From there I looked at your four types and found the following: 从那里我看了你的四种类型,发现了以下内容:

  • for type 3 there must be an edge between a N1 and N2, N1 and N3 and between N2 and N3. 对于类型3,在N1和N2,N1和N3之间以及N2和N3之间必须存在边缘。 In the adjacency matrix we can find this by going over each row (where each row represents a node and its connections, this is N1) and find nodes it is connected to (that would be N2). 在邻接矩阵中,我们可以通过遍历每一行(其中每一行代表一个节点及其连接,这是N1)并找到它所连接的节点(即N2)来找到它。 Then, in the row of N2 we check all connected nodes (this is N3) and keep those where there is a positive entry in the row of N1. 然后,在N2行中,我们检查所有连接的节点(这是N3)并保留N1行中存在正条目的那些节点。 An example of this is "A, B, C", A has a connection to B. B has a connection to C, and A also has a connection to C 一个例子是“A,B,C”,A与B有连接.B与C有连接,A也与C有连接。

  • for type 2 it works almost identical to type 3. Except now we want to find a 0 for the N3 column in the row of N1. 对于类型2,它的工作方式几乎与类型3相同。除了现在我们想要在N1行的N3列中找到0。 An example of this is "A, B, D". 一个例子是“A,B,D”。 A has a connection to B, B has a 1 in the D column, but A does not. A与B连接,B在D列中有1,但A没有。

  • for type 1 we just look at the row of N2 and find all columns for which both the N1 row and N2 row have a 0. 对于类型1,我们只查看N2行并找到N1行和N2行都为0的所有列。

  • lastly, for type 0 look at all columns in the N1 row for which the entry is 0, and then check the rows for those, and find all the columns that have a 0 as well. 最后,对于类型0,查看条目为0的N1行中的所有列,然后检查这些行,并找到所有具有0的列。

This code should work for you. 此代码应该适合您。 For 1000 nodes it took me about 7 minutes (on a machine with a i7-8565U CPU) which is still relatively slow, but a far cry from the multiple days it currently takes you to run your solution. 对于1000个节点,我花了大约7分钟(在具有i7-8565U CPU的计算机上)仍然相对较慢,但与目前运行解决方案的多天相差甚远。 I have included the example from your pictures so you can verify the results. 我已经从您的图片中包含了示例,因此您可以验证结果。 Your code produces a graph that is different from the example you show below by the way. 您的代码生成的图表与您在下面显示的示例不同。 The example graph in the code and the adjacency matrix both refer to the picture you have included. 代码中的示例图和邻接矩阵都是指您包含的图片。

The example with 1000 nodes uses networkx.generators.random_graphs.fast_gnp_random_graph . 1000个节点的示例使用networkx.generators.random_graphs.fast_gnp_random_graph 1000 is the number of nodes, 0.1 is the probability for edge creation, and the seed is just for consistency. 1000是节点数,0.1是边创建的概率,种子只是为了一致性。 I have set the probability for edge creation because you mentioned your graph is sparse. 我已经设置了边创建的概率,因为你提到你的图是稀疏的。

networkx.linalg.graphmatrix.adjacency_matrix : "If you want a pure Python adjacency matrix representation try networkx.convert.to_dict_of_dicts which will return a dictionary-of-dictionaries format that can be addressed as a sparse matrix." networkx.linalg.graphmatrix.adjacency_matrix :“如果你想要一个纯Python邻接矩阵表示,请尝试networkx.convert.to_dict_of_dicts,它将返回一个可以作为稀疏矩阵寻址的字典字典格式。”

The dictionary structure has M dictionaries (= rows) with up to M dictionaries nested in them. 字典结构具有M字典(=行),其中嵌入了多达M字典。 Note that the nested dictionaries are empty so checking for the existence of the key in them is equivalent to checking for a 1 or 0 as described above. 请注意,嵌套字典为空,因此检查其中是否存在密钥等同于如上所述检查1或0。

import time

import networkx as nx


def triads(m):
    out = {0: set(), 1: set(), 2: set(), 3: set()}
    nodes = list(m.keys())
    for i, (n1, row) in enumerate(m.items()):
        print(f"--> Row {i + 1} of {len(m.items())} <--")
        # get all the connected nodes = existing keys
        for n2 in row.keys():
            # iterate over row of connected node
            for n3 in m[n2]:
                # n1 exists in this row, all 3 nodes are connected to each other = type 3
                if n3 in row:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[3].add(t)
                # n2 is connected to n1 and n3 but not n1 to n3 = type 2
                else:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[2].add(t)
            # n1 and n2 are connected, get all nodes not connected to either = type 1
            for n3 in nodes:
                if n3 not in row and n3 not in m[n2]:
                    if len({n1, n2, n3}) == 3:
                        t = tuple(sorted((n1, n2, n3)))
                        out[1].add(t)
        for j, n2 in enumerate(nodes):
            if n2 not in row:
                # n2 not connected to n1
                for n3 in nodes[j+1:]:
                    if n3 not in row and n3 not in m[n2]:
                        # n3 is not connected to n1 or n2 = type 0
                        if len({n1, n2, n3}) == 3:
                            t = tuple(sorted((n1, n2, n3)))
                            out[0].add(t)
    return out


if __name__ == "__main__":
    g = nx.Graph()
    g.add_edges_from(
        [("E", "D"), ("G", "F"), ("D", "B"), ("B", "A"), ("B", "C"), ("A", "C")]
    )
    _m = nx.convert.to_dict_of_dicts(g)
    _out = triads(_m)
    print(_out)

    start = time.time()
    g = nx.generators.fast_gnp_random_graph(1000, 0.1, seed=42)
    _m = nx.convert.to_dict_of_dicts(g)
    _out = triads(_m)
    end = time.time() - start
    print(end)
  1. You program most probably crashes when you try to convert all combinations to a list: print(len(list(combinations(G.nodes, 3)))) . 当您尝试将所有组合转换为列表时,您编程很可能崩溃: print(len(list(combinations(G.nodes, 3)))) Never do it because combinations returns an iterator that consumes a little amount of memory, but list can easily eat gigabytes of memory. 永远不要这样做,因为combinations返回一个占用少量内存的迭代器,但是list可以很容易地占用数十亿字节的内存。

  2. If you have sparse graph, it is more reasonable to find triads in connected components : nx.connected_components(G) 如果您有稀疏图,则在连接的组件中找到三元组是更合理的: nx.connected_components(G)

  3. Networkx has triads submodule but looks like it will not fit you. Networkx有三元组子模块,但看起来它不适合你。 I already modified the networkx.algorithms.triads code to return triads, not their count. 我已经修改了networkx.algorithms.triads代码来返回三元组,而不是它们的计数。 You can find it here . 你可以在这里找到它。 Note that it uses DiGraphs. 请注意,它使用DiGraphs。 If you want to use it with undirected graphs, you should convert them to directed first. 如果要将其与无向图一起使用,则应首先将它们转换为定向图。

import networkx as nx
from time import sleep
from itertools import combinations


G = nx.Graph()
arr=[]
for i in range(1000):
    arr.append(str(i))

for i,j in combinations(arr, 2):
    G.add_edges_from([(i,j)])

#print(len(list(combinations(G.nodes, 3))))
triad_class = [[],[],[],[]]

for nodes in combinations(G.subgraph(arr).nodes, 3):
            n_edges = G.subgraph(nodes).number_of_edges()
            triad_class[n_edges].append(nodes)


print(triad_class)

i think using list would be fast insertion than dictionary, as dictionary grows exponentially and will take more time. 我认为使用list会比字典快速插入,因为字典呈指数级增长并且需要更多时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM