简体   繁体   English

修剪杂散节点的大图

[英]Pruning large graphs of stray nodes

I have a graph consisting of about 35,000 nodes represented in plain text: 我有一个图表,包含大约35,000个以纯文本表示的节点:

node1 -> node35000
node29420 -> node35000
node2334 -> node4116
...

I'd like to trim it down by removing nodes that are not part of a chain at least three long. 我想通过删除不属于链的节点至少三个长来修剪它。 So if I had only 所以,如果我只有

1 -> 2;
2 -> 3;
3 -> 4;
0 -> 4;

I'd like to keep 1, 2, 3, and 4 (since 1 -> 2 -> 3 -> 4 is four nodes long) but discard 0, that is, remove 0 -> 4 . 我想保留1,2,3和4(因为1 -> 2 -> 3 -> 4是四个节点长)但丢弃0,即删除0 -> 4

Any idea of a good way to do this? 有没有想过这样做的好方法? I tried a combination of Perl and shell functions but I think I need a better approach. 我尝试了Perl和shell函数的组合,但我认为我需要一个更好的方法。 Unless maybe there are tools to do this already? 除非有工具可以做到这一点? The data is in graphviz format but I didn't see any tools in that suite relevant to the task at hand. 数据采用graphviz格式,但我没有看到该套件中的任何工具与手头的任务相关。

Oh, and if there's an easy way to do something like this I'm open to suggestions -- it doesn't need to be exactly the task I suggested. 哦,如果有一种简单的方法可以做这样的事情,我会接受建议 - 它不一定是我建议的任务。 I'm just looking for a way to remove most of the noise surrounding the big clumps (which are rare and mostly just a few intersecting chains). 我只是想找到一种方法来消除大块周围的大部分噪音(这种情况很少见,而且大部分只是一些相交的链条)。

The tool gvpr which is part of the graphviz tools allows to apply rules to a graph and output the modified graph. 作为graphviz工具一部分的工具gvpr允许将规则应用于图形并输出修改后的图形。

From the description: 从描述:

It copies input graphs to its output, possibly transforming their structure and attributes, creating new graphs, ... 它将输入图复制到其输出,可能转换其结构和属性,创建新图,...

It looks like you want to remove all nodes having an indegree of 0 and having only linked nodes (successors) with an outdegree of 0. 看起来您想要删除所有具有0的indegree并且仅具有outdegree为0的链接节点(后继者)的节点。

Here's my version of a gvpr script nostraynodes.gv : 这是我的gvpr脚本版本nostraynodes.gv

BEGIN {node_t n; int candidates[]; int keepers[];}
E{
  if (tail.indegree == 0 && head.outdegree == 0)
  {
    candidates[tail] = 1;
    candidates[head] = 1;
  }
  else if (tail.indegree == 0)
  {
    keepers[tail] = 1;
  }
  else if (head.outdegree == 0)
  {
    keepers[head] = 1;
  }
}

END_G {
  for (candidates[n]){
    if (n in keepers == 0)
    {
       delete(NULL, n);
    }
  }
}

Here's what the script does: 这是脚本的作用:

  1. Loop through all edges one time and populate two lists: 遍历所有边一个时间和填充两个列表:

    • candidates - a list of nodes which may have to be removed, and 候选人 - 可能必须删除的节点列表,以及
    • keepers - a list of nodes which may end up in candidates but should not be removed. keepers - 节点列表,可能最终出现在候选者中但不应被删除。

    So what gets added to which list? 那么什么被添加到哪个列表?

    • Any two nodes connected to each other, where the tail node does not have any incoming edges and the head node does not have any outgoing edges, form a chain of only 2 nodes and are therefore candidates to be deleted; 任何两个节点彼此连接,其中尾节点没有任何进入边缘,并且头节点没有任何输出边缘,形成仅2个节点的链,因此是要删除的候选节点; that is, unless the same nodes are part of an other chain longer than 2 nodes: 也就是说,除非相同的节点是长于2个节点的另一个链的一部分:
    • A tail node without any incoming edges, but connected to a head node which itself has outgoing edges, is a keeper ; 没有任何入射边缘但连接到自身具有输出边缘的头节点的尾节点是保持器 ; and
    • A head node without any outgoing edges, but connected to a tail node which itself has incoming edges, is also a keeper . 没有任何输出边缘但连接到尾节点的头节点也是一个保持器 ,该节点本身具有输入边缘。
  2. Delete all candidates not in keepers 删除所有不在饲养员中的 候选人

This solution is not generic and only works for the problem stated in the question, that is keeping only chains at least 3 nodes long. 此解决方案不是通用的,仅适用于问题中所述的问题,即仅保留链长度至少为3个节点。 It also won't delete short loops (two nodes connected to each other). 它也不会删除短循环(两个节点相互连接)。

You can call this using the following line: 您可以使用以下行调用它:

gvpr -c -f .\nostraynodes.gv .\graph.dot

The output using your sample graph is: 使用示例图表的输出是:

digraph g {
    1 -> 2;
    2 -> 3;
    3 -> 4;
}

Please note that this is my first gvpr script - there are probably better ways to write this, and I'm not sure how this handles 35000 nodes, though I'm confident this should not be a big deal. 请注意,这是我的第一个gvpr脚本 - 可能有更好的方法来编写它,我不知道它如何处理35000个节点,但我相信这不应该是一个大问题。


See also Graphviz/Dot - how to mark all leaves in a tree with a distinctive color? 另请参见Graphviz / Dot - 如何用独特的颜色标记树中的所有叶子? for a simpler example of graph transformation. 有关图变换的简单示例。

Gephi is an excellent open-source GUI tool for visualizing and manipulating graphs, and you will probably be able to find some kind of filter in there for this sort of thing... Maybe a degree filter would do: it would remove nodes which only have one edge. Gephi是一个出色的开源GUI工具,用于可视化和操作图形,你可能会在那里找到某种类型的过滤器......也许一个度过滤器会这样做:它会删除只有的节点有一个优势。 You can also filter on in-degree, out-degree, you can compute PageRank etc. It's also got some really nice size/label/colour options and is easy to zoom in/out of. 您还可以过滤度数,度数,可以计算PageRank等。它还有一些非常好的尺寸/标签/颜色选项,并且易于放大/缩小。

Supposing that any given node can have arbitrarily many predecessors or successors, then in-degree and out-degree of nodes is irrelevant to solving the problem. 假设任何给定节点可以具有任意多个前驱或后继节点,则节点的度数和出度与解决问题无关。

Following is a simple O(N+E) algorithm for all graphs of N nodes and E edges, under the path-length-3 criterion. 以下是针对路径长度为3的标准的N个节点和E边缘的所有图的简单O(N + E)算法。 This algorithm can be easily implemented in Perl or C. The method is based on a definition and an assertion: Define a "made node" as any node that has a parent and child (predecessor and successor). 该算法可以在Perl或C中轻松实现。该方法基于定义和断言:将“制造节点”定义为具有父节点和子节点(前导节点和后继节点节点)的任何节点。 Every node that will be kept is a made node or is a parent or child of a made node. 将保留的每个节点都是一个节点,或者是一个节点的父节点或子节点。

  1. Initialize a status array S[Nmax] to all zeroes. 将状态数组S [Nmax]初始化为全零。 Nmax is the maximum node number. Nmax是最大节点数。 If Nmax is not known at outset, read all the data and find it out. 如果一开始就不知道Nmax,请读取所有数据并找出它。

  2. Read in the given list of edges. 读入给定的边缘列表。 Each input item specifies a directed edge (p, q) from node p to node q. 每个输入项指定从节点p到节点q的有向边(p,q)。 For each (p, q) item that is read in: Set S[p] to S[p] | 对于读入的每个(p,q)项:将S [p]设置为S [p] | 1 to denote that p has a child, and set S[q] to S[q] | 1表示p具有子节点,并将S [q]设置为S [q] | 2 to denote that q has a parent. 2表示q有父母。 (After this step, every made node n has S[n] == 3.) (在此步骤之后,每个节点n都有S [n] == 3.)

  3. Read the list of edges again. 再次阅读边缘列表。 For each (p, q) item that is read in: If (S[p]==3) or (S[q] == 3) output edge (p,q). 对于读入的每个(p,q)项:If(S [p] == 3)或(S [q] == 3)输出边(p,q)。

To extend this method to path length K other than 3, keep the edge list in memory, maintain Sp[] and Sc[] with lengths of parent chains and child chains, and perform K/2 extra passes. 要将此方法扩展到3以外的路径长度K,请将边列表保留在内存中,使用父链和子链的长度维护Sp []和Sc [],并执行K / 2次额外通过。 Might be possible to do in time O(N+K*E). 可能在时间O(N + K * E)做。 The problem does not specify whether the graph is a DAG (directed acyclic graph) but the example given is a DAG. 该问题没有指定图是否是DAG(有向无环图),但给出的示例是DAG。 For K>3, it may make a difference. 对于K> 3,它可能会有所不同。

Update 1 Here's a more precise statement of a K>3 algorithm, with H[i].p and H[i].q being endpoints of edge #i, and pc[j], cc[j] being lengths of predecessor and successor chains about node j. 更新1这里是K> 3算法的更精确的陈述,其中H [i] .p和H [i] .q是边缘#i的端点,并且pc [j],cc [j]是前身的长度和关于节点j的后继链。 Also, let E = # of edges; 另外,设E =边缘数; N = # of nodes; N =节点数; and K = desired min chain length for keeping an edge. 和K =保持边缘所需的最小链长。

  1. Read E edge data entries into H[ ] array. 将E edge数据条目读入H []数组。 Set all pc[j], cc[j] entries to 0. 将所有pc [j],cc [j]条目设置为0。

  2. For i = 1 to E, set cc[H[i].p]=1 and pc[H[i].q]=1. 对于i = 1到E,设置cc [H [i] .p] = 1并且pc [H [i] .q] = 1。

  3. For j = 1 to K+1, { for i = 1 to E, { Let p=H[i].p and q=H[i].q. 对于j = 1至K + 1,{i = 1至E,{令p = H [i] .p且q = H [i] .q。 Set cc[p] = max(cc[p], 1+cc[q]) and pc[q] = max(pc[q], 1+pc[p]). 设定cc [p] = max(cc [p],1 + cc [q])和pc [q] = max(pc [q],1 + pc [p])。 } } }}

  4. For i = 1 to E, { Let p=H[i].p and q=H[i].q. 对于i = 1到E,{令p = H [i] .p并且q = H [i] .q。 Output edge (p,q) if pc[p]+cc[p]+1 >= K and pc[q]+cc[q]+1 >= K.} 输出边沿(p,q)如果pc [p] + cc [p] +1> = K且pc [q] + cc [q] +1> = K.}

This method can make mistakes if graph is not a DAG and contains short looped paths. 如果图形不是DAG并且包含短循环路径,则此方法可能会出错。 For example, if graph edges include (1,2) and (2,1) and no other nodes link to nodes 1 or 2, then neither of those edges should be output; 例如,如果图形边缘包括(1,2)和(2,1)并且没有其他节点链接到节点1或2,则不应输出这些边缘; but we end up with K+2 for cc[] and pc[] of those nodes, so they get output anyway. 但是我们最终得到了那些节点的cc []和pc []的K + 2,所以无论如何它们都得到输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM