简体   繁体   English

有向无环图的哈希值

[英]Hash value for directed acyclic graph

How do I transform a directed acyclic graph into a hash value such that any two isomorphic graphs hash to the same value? 如何将有向非循环图转换为哈希值,使得任何两个同构图哈希到相同的值? It is acceptable, but undesirable for two isomorphic graphs to hash to different values, which is what I have done in the code below. 这是可接受的,但两个同构图散列到不同的值是不可取的,这是我在下面的代码中所做的。 We can assume that the number of vertices in the graph is at most 11. 我们可以假设图中的顶点数最多为11。

I am particularly interested in Python code. 我对Python代码特别感兴趣。

Here is what I did. 这就是我做的。 If self.lt is a mapping from node to descendants (not children!), then I relabel the nodes according to a modified topological sort (that prefers to order elements with more descendants first if it can). 如果self.lt是从节点到后代(而不是子节点!)的映射,那么我根据修改的拓扑排序重新标记节点(如果可以的话,首先优先命令具有更多后代的元素)。 Then, I hash the sorted dictionary. 然后,我哈希排序的字典。 Some isomorphic graphs will hash to different values, especially as the number of nodes grows. 一些同构图将散列到不同的值,尤其是随着节点数量的增长。

I have included all the code to motivate my use case. 我已经包含了所有代码来激励我的用例。 I am calculating the number of comparisons required to find the median of 7 numbers. 我正在计算找到7个数字的中位数所需的比较次数。 The more that isomorphic graphs hash to the same value the less work that has to be redone. 同构图散列到相同值的越多,重做的工作就越少。 I considered putting larger connected components first, but didn't see how to do that quickly. 我考虑先放置更大的连接组件,但没有看到如何快速完成。

from tools.decorator import memoized  # A standard memoization decorator


class Graph:
    def __init__(self, n):
        self.lt = {i: set() for i in range(n)}

    def compared(self, i, j):
        return j in self.lt[i] or i in self.lt[j]

    def withedge(self, i, j):
        retval = Graph(len(self.lt))
        implied_lt = self.lt[j] | set([j])
        for (s, lt_s), (k, lt_k) in zip(self.lt.items(),
                                        retval.lt.items()):
            lt_k |= lt_s
            if i in lt_k or k == i:
                lt_k |= implied_lt
        return retval.toposort()

    def toposort(self):
        mapping = {}
        while len(mapping) < len(self.lt):
            for i, lt_i in self.lt.items():
                if i in mapping:
                    continue
                if any(i in lt_j or len(lt_i) < len(lt_j)
                       for j, lt_j in self.lt.items()
                       if j not in mapping):
                    continue
                mapping[i] = len(mapping)
        retval = Graph(0)
        for i, lt_i in self.lt.items():
            retval.lt[mapping[i]] = {mapping[j]
                                     for j in lt_i}
        return retval

    def median_known(self):
        n = len(self.lt)
        for i, lt_i in self.lt.items():
            if len(lt_i) != n // 2:
                continue
            if sum(1
                   for j, lt_j in self.lt.items()
                   if i in lt_j) == n // 2:
                return True
        return False

    def __repr__(self):
        return("[{}]".format(", ".join("{}: {{{}}}".format(
            i,
            ", ".join(str(x) for x in lt_i))
                                       for i, lt_i in self.lt.items())))

    def hashkey(self):
        return tuple(sorted({k: tuple(sorted(v))
                             for k, v in self.lt.items()}.items()))

    def __hash__(self):
        return hash(self.hashkey())

    def __eq__(self, other):
        return self.hashkey() == other.hashkey()


@memoized
def mincomps(g):
    print("Calculating:", g)
    if g.median_known():
        return 0
    nodes = g.lt.keys()
    return 1 + min(max(mincomps(g.withedge(i, j)),
                       mincomps(g.withedge(j, i)))
                   for i in nodes
                   for j in nodes
                   if j > i and not g.compared(i, j))


g = Graph(7)
print(mincomps(g))

To effectively test for graph isomorphism you will want to use nauty . 要有效地测试图形同构,你需要使用nauty Specifically for Python there is the wrapper pynauty , but I can't attest its quality (to compile it correctly I had to do some simple patching on its setup.py ). 特别是Python有包装pynauty ,但我无法证明它的质量(正确编译它我必须在它的setup.py上做一些简单的修补)。 If this wrapper is doing everything correctly, then it simplifies nauty a lot for the uses you are interested and it is only a matter of hashing pynauty.certificate(somegraph) -- which will be the same value for isomorphic graphs. 如果这个包装器正确地执行了所有操作,那么它会为您感兴趣的用途简化nauty,这只是散列pynauty.certificate(somegraph) - 对于同pynauty.certificate(somegraph) ,它将是相同的值。

Some quick tests showed that pynauty is giving the same certificate for every graph (with same amount of vertices). 一些快速测试表明, pynauty为每个图形(具有相同数量的顶点)提供相同的证书。 But that is only because of a minor issue in the wrapper when converting the graph to nauty's format. 但这只是因为在将图形转换为nauty格式时,包装器中存在一个小问题。 After fixing this, it works for me (I also used the graphs at http://funkybee.narod.ru/graphs.htm for comparison). 修好后,它适用于我(我也使用http://funkybee.narod.ru/graphs.htm上的图表进行比较)。 Here is the short patch which also considers the modifications needed in setup.py : 这是一个简短的补丁,它也考虑了setup.py所需的修改:

diff -ur pynauty-0.5-orig/setup.py pynauty-0.5/setup.py
--- pynauty-0.5-orig/setup.py   2011-06-18 20:53:17.000000000 -0300
+++ pynauty-0.5/setup.py        2013-01-28 22:09:07.000000000 -0200
@@ -31,7 +31,9 @@

 ext_pynauty = Extension(
         name = MODULE + '._pynauty',
-        sources = [ pynauty_dir + '/' + 'pynauty.c', ],
+        sources = [ pynauty_dir + '/' + 'pynauty.c',
+            os.path.join(nauty_dir, 'schreier.c'),
+            os.path.join(nauty_dir, 'naurng.c')],
         depends = [ pynauty_dir + '/' + 'pynauty.h', ],
         extra_compile_args = [ '-O4' ],
         extra_objects = [ nauty_dir + '/' + 'nauty.o',
diff -ur pynauty-0.5-orig/src/pynauty.c pynauty-0.5/src/pynauty.c
--- pynauty-0.5-orig/src/pynauty.c      2011-03-03 23:34:15.000000000 -0300
+++ pynauty-0.5/src/pynauty.c   2013-01-29 00:38:36.000000000 -0200
@@ -320,7 +320,7 @@
     PyObject *adjlist;
     PyObject *p;

-    int i,j;
+    Py_ssize_t i, j;
     int adjlist_length;
     int x, y;

Graph isomorphism for directed acyclic graphs is still GI-complete. 有向无环图的图同构仍然是GI完全的。 Therefore there is currently no known (worst case sub-exponential) solution to guarantee that two isomorphic directed acyclic graphs will yield the same hash. 因此,目前还没有已知的(最坏情况下的亚指数)解决方案来保证两个同构有向无环图将产生相同的散列。 Only if the mapping between different graphs is known - for example if all vertices have unique labels - one could efficiently guarantee matching hashes. 仅当已知不同图之间的映射时 - 例如,如果所有顶点都具有唯一标签 - 则可以有效地保证匹配哈希。

Okay, let's brute force this for a small number of vertices. 好吧,让我们对少数顶点强行执行此操作。 We have to find a representation of the graph that is independent of the ordering of the vertices in the input and therefore guarantees that isomorphic graphs yield the same representation. 我们必须找到与输入中顶点排序无关的图形表示,从而保证同构图产生相同的表示。 Further this representation must ensure that no two non-isomorphic graphs yield the same representation. 此外,该表示必须确保没有两个非同构图产生相同的表示。

The simplest solution is to construct the adjacency matrix for all n! 最简单的解决方案是为所有n构造邻接矩阵! permutations of the vertices and just interpret the adjacency matrix as n 2 bit integer. 顶点的排列,只是将邻接矩阵解释为n 2位整数。 Then we can just pick the smallest or largest of this numbers as canonical representation. 然后我们可以选择最小或最大的这个数字作为规范表示。 This number completely encodes the graph and therefore ensures that no two non-isomorphic graphs yield the same number - one could consider this function a perfect hash function . 这个数字完全对图形进行编码,因此确保没有两个非同构图产生相同的数字 - 人们可以认为这个函数是一个完美的哈希函数 And because we choose the smallest or largest number encoding the graph under all possible permutations of the vertices we further ensure that isomorphic graphs yield the same representation. 并且因为我们在顶点的所有可能排列下选择编码图的最小或最大数字,所以我们进一步确保同构图产生相同的表示。

How good or bad is this in the case of 11 vertices? 在11个顶点的情况下,这有多好或坏? Well, the representation will have 121 bits. 那么,表示将有121位。 We can reduce this by 11 bits because the diagonal representing loops will be all zeros in an acyclic graph and are left with 110 bits. 我们可以将其减少11位,因为表示循环的对角线在非循环图中将全部为零并且保留为110位。 This number could in theory be decreased further; 从理论上讲,这个数字可以进一步减少; not all 2 110 remaining graphs are acyclic and for each graph there may be up to 11! 并非所有2 110个剩余图表都是非循环的,每个图表最多可能有11个! - roughly 2 25 - isomorphic representations but in practice this might be quite hard to do. - 大约2 25 - 同构表示,但在实践中这可能很难做到。 Does anybody know how to compute the number of distinct directed acyclic graphs with n vertices? 有谁知道如何计算具有n个顶点的不同有向无环图的数量?

How long will it take to find this representation? 找到这种表示需要多长时间? Naively 11! 天真11! or 39,916,800 iterations. 或39,916,800次迭代。 This is not nothing and probably already impractical but I did not implement and test it. 这不是什么都没有,可能已经不切实际但我没有实现和测试它。 But we can probably speed this up a bit. 但我们可能会加快这一点。 If we interpret the adjacency matrix as integer by concatenating the rows from top to bottom left to right we want many ones (zeros) at the left of the first row to obtain a large (small) number. 如果我们通过从左到右连接从上到下的行来将邻接矩阵解释为整数,我们希望在第一行左边的许多(零)获得大(小)数。 Therefore we pick as first vertex the one (or one of the vertices) with largest (smallest) degree (indegree or outdegree depending on the representation) and than vertices connected (not connected) to this vertex in subsequent positions to bring the ones (zeros) to the left. 因此,我们选择具有最大(最小)度的一个(或一个顶点)作为第一个顶点(取决于表示的indegree或outdegree),然后选择在后续位置连接(未连接)到该顶点的顶点以带来那些(零) ) 向左转。

There are likely more possibilities to prune the search space but I am not sure if there are enough to make this a practical solution. 修剪搜索空间的可能性更大,但我不确定是否有足够的可以使其成为一个实用的解决方案。 Maybe there are or maybe somebody else can at least build something upon this idea. 也许有或者其他人至少可以根据这个想法建立一些东西。

How good does the hash have to be? 哈希有多好? I assume that you do not want a full serialization of the graph. 我假设您希望图表的完整序列化。 A hash rarely guarantees that there is no second (but different) element (graph) that evaluates to the same hash. 哈希很少保证没有第二个(但不同的)元素(图形)评估为相同的哈希。 If it is very important to you, that isomorphic graphs (in different representations) have the same hash, then only use values that are invariant under a change of representation. 如果它对您来说非常重要,那么同构图(在不同的表示中)具有相同的哈希值,那么只使用在表示变化下不变的值。 Eg: 例如:

  • the total number of nodes 节点总数
  • the total number of (directed) connections (定向)连接的总数
  • the total number of nodes with (indegree, outdegree) = (i,j) for any tuple (i,j) up to (max(indegree), max(outdegree)) (or limited for tuples up to some fixed value (m,n) ) 对于任何元组(i,j)最多(max(indegree), max(outdegree))具有(indegree, outdegree) = (i,j)的节点总数(或者对于某些固定值(m,n)元组限制(m,n)

All these informations can be gathered in O(#nodes) [assuming that the graph is stored properly]. 所有这些信息都可以在O(#nodes)中收集[假设图表存储正确]。 Concatenate them and you have a hash. 连接它们并且你有一个哈希。 If you prefer you can use some well known hash algorithm like sha on these concatenated informations. 如果您愿意,可以在这些连接的信息中使用一些众所周知的哈希算法,例如sha Without additional hashing it is a continuous hash (it allows to find similar graphs), with additional hashing it is uniform and fixed in size if the chosen hash algorithm has these properties. 没有额外的散列,它是一个连续的散列 (它允许查找类似的图形),如果所选的散列算法具有这些属性,则使用额外的散列它是均匀的并且固定大小。

As it is, it is already good enough to register any added or removed connection. 实际上,注册任何添加或删除的连接已经足够了。 It might miss connections that were changed though ( a -> c instead of a -> b ). 它可能会错过已更改的连接( a -> c而不是a -> b )。


This approach is modular and can be extended as far as you like. 这种方法是模块化的,可以根据需要进行扩展。 Any additional property that is being included will reduce the number of collisions but increase the effort necessary to get the hash value. 包含的任何其他属性都将减少冲突次数,但会增加获取哈希值所需的工作量。 Some more ideas: 更多想法:

  • same as above but with second order in- and outdegree. 与上述相同,但具有二阶in-anddedegree。 Ie. IE浏览器。 the number of nodes that can be reached by a node->child->child chain ( = second order outdegree) or respectively the number of nodes that lead to the given node in two steps. node->child->child chain(= second order outdegree)可以达到的节点数,或者分别通过两个步骤到达给定节点的节点数。
  • or more general n-th order in- and outdegree (can be computed in O((average-number-of-connections) ^ (n-1) * #nodes) ) 或者更一般的n阶in-anddedegree(可以用O((平均连接数)^(n-1)* #nodes计算))
  • number of nodes with eccentricity = x (again for any x) 偏心率 = x的节点数(同样适用于任何x)
  • if the nodes store any information (other than their neighbours) use a xor of any kind of hash of all the node-contents. 如果节点存储任何信息(除了它们的邻居之外),则使用所有节点内容的任何类型的散列的xor Due to the xor the specific order in which the nodes where added to the hash does not matter. 由于xor ,添加到散列的节点的特定顺序无关紧要。

You requested "a unique hash value" and clearly I cannot offer you one. 你要求“一个独特的哈希值”,显然我不能给你一个。 But I see the terms "hash" and "unique to every graph" as mutually exclusive (not entirely true of course) and decided to answer the "hash" part and not the "unique" part. 但我认为术语“哈希”和“每个图形的唯一”是相互排斥的(当然不完全正确),并决定回答“哈希”部分,而不是“独特”部分。 A "unique hash" ( perfect hash ) basically needs to be a full serialization of the graph (because the amount of information stored in the hash has to reflect the total amount of information in the graph). “唯一散列”( 完美散列 )基本上需要是图的完整序列化(因为散列中存储的信息量必须反映图中的信息总量)。 If that is really what you want just define some unique order of nodes (eg. sorted by own outdegree, then indegree, then outdegree of children and so on until the order is unambiguous) and serialize the graph in any way (using the position in the formentioned ordering as index to the nodes). 如果这真的是你想要的,只需定义一些独特的节点顺序(例如,按自己的outdegree排序,然后是indegree,然后是outdegree of children等等,直到顺序明确)并以任何方式序列化图形(使用上述排序作为节点的索引)。

Of course this is much more complex though. 当然,这要复杂得多。

Imho, If the graph could be topologically sorted, the very straightforward solution exists. Imho,如果图形可以在拓扑上排序,则存在非常简单的解决方案。

  1. For each vertex with index i, you could build an unique hash (for example, using the hashing technique for strings) of his (sorted) direct neighbours (pe if vertex 1 has direct neighbours {43, 23, 2,7,12,19,334} the hash functions should hash the array of {2,7,12,19,23,43,334}) 对于具有索引i的每个顶点,您可以为其(已排序的)直接邻居构建唯一的哈希(例如,使用字符串的散列技术)(如果顶点1具有直接邻居,则为pe {43,23,2,7,12, 19,334}哈希函数应该散列{2,7,12,19,23,43,334}的数组
  2. For the whole DAG you could create a hash, as a hash of a string of hashes for each node: Hash(DAG) = Hash(vertex_1) U Hash(vertex_2) U ..... Hash(vertex_N); 对于整个DAG,您可以创建一个哈希,作为每个节点的哈希字符串的哈希:哈希(DAG)=哈希(vertex_1)U哈希(vertex_2)U .....哈希(vertex_N); I think the complexity of this procedure is around (N*N) in the worst case. 我认为在最坏的情况下,这个过程的复杂性大约是(N * N)。 If the graph could not be topologically sorted, the approach proposed is still applicable, but you need to order vertices in an unique way (and this is the hard part) 如果图形无法进行拓扑排序,则所提出的方法仍然适用,但您需要以独特的方式排序顶点(这是困难的部分)

I will describe an algorithm to hash an arbitrary directed graph, not taking into account that the graph is acyclic. 我将描述一种算法来散列任意有向图,而不是考虑图是非循环的。 In fact even counting the acyclic graphs of a given order is a very complicated task and I believe here this will only make the hashing significantly more complicated and thus slower. 事实上,即使计算给定顺序的非循环图也是一项非常复杂的任务,我相信这只会使散列变得更加复杂,从而变得更慢。

A unique representation of the graph can be given by the neighbourhood list. 邻域列表可以给出图的唯一表示。 For each vertex create a list with all it's neighbours. 对于每个顶点,创建一个包含所有邻居的列表。 Write all the lists one after the other appending the number of neighbours for each list to the front. 将所有列表一个接一个地写入前面,将每个列表的邻居数量附加到前面。 Also keep the neighbours sorted in ascending order to make the representation unique for each graph. 还要保持邻居按升序排序,以使每个图表的表示都是唯一的。 So for example assume you have the graph: 例如,假设您有图表:

1->2, 1->5
2->1, 2->4
3->4
5->3

What I propose is that you transform this to ({2,2,5}, {2,1,4}, {1,4}, {0}, {1,3}) , here the curly brackets being only to visualize the representation, not part of the python's syntax. 我建议你将它转换为({2,2,5}, {2,1,4}, {1,4}, {0}, {1,3}) ,这里的大括号只是为了可视化表示,而不是python语法的一部分。 So the list is in fact: (2,2,5, 2,1,4, 1,4, 0, 1,3) . 所以列表实际上是: (2,2,5, 2,1,4, 1,4, 0, 1,3)

Now to compute the unique hash, you need to order these representations somehow and assign a unique number to them. 现在要计算唯一的哈希值,您需要以某种方式对这些表示进行排序并为它们分配唯一的数字。 I suggest you do something like a lexicographical sort to do that. 我建议你做类似于词典编排的事情。 Lets assume you have two sequences (a1, b1_1, b_1_2,...b_1_a1,a2, b_2_1, b_2_2,...b_2_a2,...an, b_n_1, b_n_2,...b_n_an) and (c1, d1_1, d_1_2,...d_1_c1,c2, d_2_1, d_2_2,...d_2_c2,...cn, d_n_1, d_n_2,...d_n_cn) , Here c and a are the number of neighbours for each vertex and b_i_j and d_k_l are the corresponding neighbours. 让我们假设您有两个序列(a1, b1_1, b_1_2,...b_1_a1,a2, b_2_1, b_2_2,...b_2_a2,...an, b_n_1, b_n_2,...b_n_an)(c1, d1_1, d_1_2,...d_1_c1,c2, d_2_1, d_2_2,...d_2_c2,...cn, d_n_1, d_n_2,...d_n_cn) ,这里c和a是每个顶点的邻居数,b_i_j和d_k_l是相应的邻居。 For the ordering first compare the sequnces (a1,a2,...an) and (c1,c2, ...,cn) and if they are different use this to compare the sequences. 对于排序,首先比较sequnces (a1,a2,...an)(c1,c2, ...,cn) ,如果它们不同,则使用它来比较序列。 If these sequences are different, compare the lists from left to right first comparing lexicographically (b_1_1, b_1_2...b_1_a1) to (d_1_1, d_1_2...d_1_c1) and so on until the first missmatch. 如果这些序列不同,则首先从左到右比较列表,然后按字典顺序(b_1_1, b_1_2...b_1_a1)(d_1_1, d_1_2...d_1_c1) ,依此类推,直到第一次失配。

In fact what I propose to use as hash the lexicographical number of a word of size N over the alphabet that is formed by all possible selections of subsets of elements of {1,2,3,...N} . 事实上,我建议使用哈希作为散列字母大小为N的单词的字典数字,这是由{1,2,3,...N}元素子集的所有可能选择形成的。 The neighbourhood list for a given vertex is a letter over this alphabet eg {2,2,5} is the subset consisting of two elements of the set, namely 2 and 5 . 给定顶点的邻域列表是该字母表上的字母,例如{2,2,5}是由该集合的两个元素组成的子集,即25

The alphabet (set of possible letters ) for the set {1,2,3} would be(ordered lexicographically ): 集合{1,2,3}字母表 (可能的字母集合{1,2,3}将是(按字典顺序排列):

{0}, {1,1}, {1,2}, {1,3}, {2, 1, 2}, {2, 1, 3}, {2, 2, 3}, {3, 1, 2, 3}

First number like above is the number of elements in the given subset and the remaining numbers- the subset itself. 如上所述的第一个数字是给定子集中的元素数量和剩余数量 - 子集本身。 So form all the 3 letter words from this alphabet and you will get all the possible directed graphs with 3 vertices. 因此,从这个字母表中形成所有3个字母的单词 ,您将得到所有可能的有3个顶点的有向图。

Now the number of subsets of the set {1,2,3,....N} is 2^N and thus the number of letters of this alphabet is 2^N . 现在该组的子集的数目{1,2,3,....N}2^N并且因此这个字母表中的字母的数目是2^N Now we code each directed graph of N nodes with a word with exactly N letters from this alphabet and thus the number of possible hash codes is precisely: (2^N)^N . 现在,我们用这个字母表中的 N 字母单词编码N节点的每个有向图,因此可能的哈希码的数量精确地为: (2^N)^N This is to show that the hash code grows really fast with the increase of N . 这表明哈希码随着N的增加而增长得非常快。 Also this is the number of possible different directed graphs with N nodes so what I suggest is optimal hashing in the sense it is bijection and no smaller hash can be unique. 这也就是具有N节点的可能的不同有向图的数量,因此我建议的是最佳散列,因为它是双射的,并且没有较小的散列可以是唯一的。

There is a linear algorithm to get a given subset number in the the lexicographical ordering of all subsets of a given set, in this case {1,2,....N} . 有一种线性算法可以在给定集合的所有子集的词典排序中获得给定的子集编号,在本例中为{1,2,....N} Here is the code I have written for coding/decoding a subset in number and vice versa. 这是我编写的用于编码/解码数量的子集的代码,反之亦然。 It is written in C++ but quite easy to understand I hope. 它是用C++编写的,但我希望很容易理解。 For the hashing you will need only the code function but as the hash I propose is reversable I add the decode function - you will be able to reconstruct the graph from the hash which is quite cool I think: 对于散列,你只需要代码函数,但由于我提出的散列是可反转的,我添加了解码函数 - 你将能够从散列中重建图形,这是非常酷的我认为:

typedef long long ll;

// Returns the number in the lexicographical order of all combinations of n numbers
// of the provided combination. 
ll code(vector<int> a,int n)
{
    sort(a.begin(),a.end());  // not needed if the set you pass is already sorted.
    int cur = 0;
    int m = a.size();

    ll res =0;
    for(int i=0;i<a.size();i++)
    {
        if(a[i] == cur+1)
        {
            res++;
            cur = a[i];
            continue;
        }
        else
        {
            res++;
            int number_of_greater_nums = n - a[i];
            for(int j = a[i]-1,increment=1;j>cur;j--,increment++)
                res += 1LL << (number_of_greater_nums+increment);
            cur = a[i];
        }
    }
    return res;
}
// Takes the lexicographical code of a combination of n numbers and returns the 
// combination
vector<int> decode(ll kod, int n)
{
    vector<int> res;
    int cur = 0;

    int left = n; // Out of how many numbers are we left to choose.
    while(kod)
    {
        ll all = 1LL << left;// how many are the total combinations
        for(int i=n;i>=0;i--)
        {
            if(all - (1LL << (n-i+1)) +1 <= kod)
            {
                res.push_back(i);
                left = n-i;
                kod -= all - (1LL << (n-i+1)) +1;
                break;
            }
        }
    }
    return res;
}

Also this code stores the result in long long variable, which is only enough for graphs with less than 64 elements. 此代码还将结果存储在long long变量中,这对于少于64个元素的图形来说足够了。 All possible hashes of graphs with 64 nodes will be (2^64)^64 . 具有64个节点的所有可能的图形哈希将是(2^64)^64 This number has about 1280 digits so maybe is a big number. 这个数字有大约1280 位,所以可能是一个很大的数字。 Still the algorithm I describe will work really fast and I believe you should be able to hash and 'unhash' graphs with a lot of vertices. 我描述的算法仍然可以非常快速地工作,我相信你应该能够使用很多顶点来散列和“解开”图形。

Also have a look at this question . 还看看这个问题

I'm not sure that it's 100% working, but here is an idea: 我不确定它是100%工作,但这是一个想法:

Let's code a graph into a string and then take its hash. 让我们将图形编码为字符串,然后取其哈希值。

  1. hash of an empty graph is "" 空图的哈希是“”
  2. hash of a vertex with no outgoing edges is "." 没有传出边的顶点的散列是“。”
  3. hash of a vertex with outgoing edges is concatenation of every child hash with some delimiter (eg ",") 具有传出边的顶点的散列是每个子散列与某些分隔符的串联(例如“,”)

To produce the same hash for isomorphic graphs before concatenation in step3 just sort the hashes (eg in lexicographical order). 要在步骤3中进行连接之前为同构图生成相同的哈希,只需对哈希进行排序(例如,按字典顺序排序)。

For hash of a graph just take hash of its root (or sorted concatenation, if there are several roots). 对于图的哈希,只需获取其根的哈希值(或者如果有多个根则进行排序连接)。

edit While I hoped that the resulting string will describe graph without collisions, hynekcer found that sometimes non-isomorphic graphs will get the same hash. 编辑虽然我希望得到的字符串能描述没有碰撞的图形,但hynekcer发现有时非同构图形会得到相同的哈希值。 That happens when a vertex has several parents - then it "duplicated" for every parent. 当一个顶点有几个父项时会发生这种情况 - 然后它会为每个父项“重复”。 For example, the algorithm does not differentiate a "diamond" {A->B->C,A->D->C} from the case {A->B->C,A->D->E}. 例如,该算法不区分“菱形”{A-> B-> C,A-> D-> C}与情况{A-> B-> C,A-> D-> E}。

I'm not familiar with Python and it's hard for me to understand how graph stored in the example, but here is some code in C++ which is likely convertible to Python easily: 我不熟悉Python,我很难理解图中存储的图形,但这里有一些C ++代码很容易转换为Python:

THash GetHash(const TGraph &graph)
{
    return ComputeHash(GetVertexStringCode(graph,FindRoot(graph)));
}
std::string GetVertexStringCode(const TGraph &graph,TVertexIndex vertex)
{
    std::vector<std::string> childHashes;
    for(auto c:graph.GetChildren(vertex))
        childHashes.push_back(GetVertexStringCode(graph,*c));
    std::sort(childHashes.begin(),childHashes.end());
    std::string result=".";
    for(auto h:childHashes)
        result+=*h+",";
    return result;
}

When I saw the question, I had essentially the same idea as @example. 当我看到这个问题时,我的想法与@example基本相同。 I wrote a function providing a graph tag such that the tag coincides for two isomorphic graphs. 我编写了一个提供图形标记的函数,使得标记与两个同构图重合。

This tag consists of the sequence of out-degrees in ascending order. 此标记由递增顺序的出度序列组成。 You can hash this tag with the string hash function of your choice to obtain a hash of the graph. 您可以使用您选择的字符串哈希函数来哈希此标记,以获取图表的哈希值。

Edit: I expressed my proposal in the context of @NeilG's original question. 编辑:我在@ NeilG原始问题的背景下表达了我的提议。 The only modification to make to his code is to redefine the hashkey function as: 对他的代码进行的唯一修改是将hashkey函数重新定义为:

def hashkey(self): 
    return tuple(sorted(map(len,self.lt.values())))

I am assuming there are no common labels on vertices or edges, for then you could put the graph in a canonical form, which itself would be a perfect hash. 我假设在顶点或边上没有常见的标签,因为你可以将图形放在规范形式中,这本身就是一个完美的哈希。 This proposal is therefore based on isomorphism only. 因此,该提议仅基于同构。

For this, combine hashes for as many simple aggregate characteristics of a DAG as you can imagine, picking those that are quick to compute. 为此,将DAG的哈希值与您想象的DAG简单聚合特征相结合,选择那些快速计算的特征。 Here is a starter list: 这是一个入门名单:

  1. 2d histogram of nodes' in and out degrees. 节点进出度的2d直方图。
  2. 4d histogram of edges a->b where a and b are both characterized by in/out degree. 边缘a-> b的4d直方图,其中a和b均由入/出度表征。

Addition Let me be more explicit. 加法让我更明确。 For 1, we'd compute a set of triples <I,O;N> (where no two triples have the same I , O values), signifying that there are N nodes with in-degree I and out-degree O . 对于1,我们计算一组三元组<I,O;N> (其中没有两个三元组具有相同的IO值),表示存在具有度I和出度O N节点。 You'd hash this set of triples or better yet use the whole set arranged in some canonical order eg lexicographically sorted. 你可以对这组三元组进行哈希处理,或者更好地使用按照规范顺序排列的整个集合,例如按字典顺序排序。 For 2, we compute a set of quintuples <aI,aO,bI,bO;N> signifying that there are N edges from nodes with in degree aI and out degree aO , to nodes with bI and bO respectively. 对于2,我们计算一组五元组<aI,aO,bI,bO;N>表示从具有度aI和out度aO节点的N边缘分别到具有bIbO节点。 Again hash these quintuples or else use them in canonical order as-is for another part of the final hash. 再次哈希这些五元组,或者按照规范顺序使用它们作为最终哈希的另一部分。

Starting with this and then looking at collisions that still occur will probably provide insights on how to get better. 从此开始,然后查看仍然发生的碰撞可能会提供有关如何变得更好的见解。

Years ago, I created a simple and flexible algorithm for exactly this problem (finding duplicate structures in a database of chemical structures by hashing them). 多年前,我为这个问题创建了一个简单而灵活的算法(通过对它们进行散列来查找化学结构数据库中的重复结构)。

I named it "Powerhash", and to create the algorithm it required two insights. 我把它命名为“Powerhash”,要创建算法,它需要两个见解。 The first is the power iteration graph algorithm, also used in PageRank. 第一个是功率迭代图算法,也用于PageRank。 The second is the ability to replace power iteration's inside step function with anything that we want. 第二个是能够用我们想要的任何东西替换功率迭代的内部步进功能。 I replaced it with a function that does the following on each step, and for each node: 我用一个函数替换它,在每个步骤和每个节点上执行以下操作:

  • Sort the hashes of the node's neighbors 对节点的邻居的哈希值进行排序
  • Hash the concatenated sorted hashes 散列连续排序的哈希值

On the first step, a node's hash is affected by its direct neighbors. 在第一步,节点的哈希受其直接邻居的影响。 On the second step, a node's hash is affected by the neighborhood 2-hops away from it. 在第二步,节点的哈希受到距离它的2跳的邻域的影响。 On the Nth step a node's hash will be affected by the neighborhood N-hops around it. 在第N步,节点的散列将受到其周围的邻域N跳的影响。 So you only need to continue running the Powerhash for N = graph_radius steps. 所以你只需要继续运行Powerhash for N = graph_radius步骤。 In the end, the graph center node's hash will have been affected by the whole graph. 最后,图中心节点的哈希将受到整个图的影响。

To produce the final hash, sort the final step's node hashes and concatenate them together. 要生成最终哈希,请对最终步骤的节点哈希值进行排序并将它们连接在一起。 After that, you can compare the final hashes to find if two graphs are isomorphic. 之后,您可以比较最终的哈希值,以查找两个图形是否同构。 If you have labels, then add them in the internal hashes that you calculate for each node (and at each step). 如果您有标签,则将它们添加到您为每个节点(以及每个步骤)计算的内部哈希中。

For more on this you can look at my post here: 有关这方面的更多信息,请点击此处查看我的帖子:

https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF

The algorithm above was implemented inside the "madIS" functional relational database. 上面的算法是在“madIS”功能关系数据库中实现的。 You can find the source code of the algorithm here: 您可以在此处找到算法的源代码:

https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py

With suitable ordering of your descendents (and if you have a single root node, not a given, but with suitable ordering (maybe by including a virtual root node)), the method for hashing a tree ought to work with a slight modification. 通过适当排序您的后代(如果您有一个单独的根节点,不是给定的,但具有合适的排序(可能包括虚拟根节点)),散列树的方法应该稍作修改。

Example code in this StackOverflow answer , the modification would be to sort children in some deterministic order (increasing hash?) before hashing the parent. 此StackOverflow答案中的示例代码,修改将是在对父项进行散列之前以某种确定性顺序(增加散列?)对子项进行排序。

Even if you have multiple possible roots, you can create a synthetic single root, with all roots as children. 即使你有多个可能的根,你也可以创建一个合成的单根,所有的根都是子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM