Python - 遍历边列表，对于具有特定属性的节点，找到具有不同特定属性的所有连接节点？

Question

I have an edge list containing 24 000 different edges between produts.我有一个边缘列表，其中包含产品之间的 24 000 个不同的边缘。 An edge is created between A and B if product B is a sub component of A.如果产品 B 是 A 的子组件，则在 A 和 B 之间创建一条边。

The edge list is on the following format:边缘列表采用以下格式：

 Parent | Child | Root | Child Meta
  AA1      BB1    AA1      ...  
  AA1      BB2    AA1      ...
  BB2      CC1    AA1      ...  
  AA2      BB3    AA2
  AA2      BB4    AA2
  BB4      CC1    AA2      ... 
  BB4      DD1    AA2      ...
  DD1      EE1    AA2
  DD1      EE2    AA2
  BB4      FF1    AA2
  FF1      GG1    AA2      ...
  GG1      EE3    AA2

So by grouping by Root I want, for all parents on the form DD* and FF* , find children on the form EE* they have a direct connection with.因此，通过我想要的Root分组，对于表单DD*和FF*上的所有父母，在表单EE*上找到与他们有直接联系的孩子。 In the example above I want the output dataframe to look like在上面的示例中，我希望输出数据帧看起来像

 Parent | Child | Root | Child Meta
   DD1     EE1    AA2      ... 
   DD1     EE2    AA2      ...
   FF1     EE3    AA2      ...

The only way I know how to do this is by iterating over a pandas DataFrame and using recursive functions iterating over the children until I hit an EE* child.我知道如何做到这一点的唯一方法是迭代 Pandas DataFrame 并使用递归函数迭代子项，直到我遇到EE*子项。 This takes forever.这需要永远。 Is there a smart way to use networkx here maybe?有没有一种聪明的方法可以在这里使用networkx ？ Or are there any other way I can do this using pandas that would be faster?或者有没有其他方法可以使用更快的熊猫来做到这一点？

Answer 1

If I understand the issue correctly, then it might be faster if you start at the bottom and find nodes going upwards.如果我正确理解这个问题，那么如果你从底部开始并找到向上的节点，它可能会更快。

Since you know the subset of children (E*) you want to find, if you start with the target children, all parents are by definition part of the result, and you don't have to filter.由于您知道要查找的子项 (E*) 的子集，如果您从目标子项开始，则根据定义，所有父项都是结果的一部分，您不必进行过滤。

In a plain iterative Python approach, something like this would find all parent nodes for "E*" children:在一个简单的迭代 Python 方法中，这样的事情会找到“E*”子节点的所有父节点：

(Please note that I have added an extra line with "BB3 DD1 AA2" to have another duplicate.) （请注意，我添加了一个带有“BB3 DD1 AA2”的额外行以进行另一个重复。）

data = """AA1      BB1    AA1
  AA1      BB2    AA1 
  BB2      CC1    AA1 
  AA2      BB3    AA2
  AA2      BB4    AA2
  BB4      CC1    AA2 
  BB3      DD1    AA2
  BB4      DD1    AA2
  DD1      EE1    AA2
  DD1      EE2    AA2
  BB4      FF1    AA2
  FF1      GG1    AA2
  GG1      EE3    AA2"""

# tuple (parent, child, root)
tuples = {tuple(l.split()) for l in data.split("\n")}

parentsByChild = {}
for node in tuples:
    p = set(parentsByChild.get(node[1], frozenset()))
    p.add(node)
    parentsByChild[node[1]] = frozenset(p)
# alternatively:
# from itertools import groupby
# parentsByChild = {c:frozenset(nodes) for c, nodes in groupby(sorted(tuples, key=lambda n: n[1]), lambda n: n[1])}

def expand(nodes):
    todo, found = set(nodes), set() 
    while todo:
        node = todo.pop()        
        if not node in found:
            found.add(node)
            todo.update((p for p in parentsByChild.get(node[0], set()) if p not in found))
    return found

leaves = {n for n in tuples if n[1].startswith("E")}
for t in expand(leaves):
    print(t)

This should be linear in the number of edges: We iterate over them once to collect the tuples and a second time to group the parents.这应该与边的数量成线性：我们迭代它们一次以收集元组，第二次对父项进行分组。 The expand call iterates over all "interesting" children and their parents, expanding parents only for new nodes, so we never do work twice for the same node. expand调用遍历所有“有趣的”子节点和它们的父节点，只为新节点扩展父节点，所以我们永远不会为同一个节点做两次工作。

Python - 遍历边列表，对于具有特定属性的节点，找到具有不同特定属性的所有连接节点？

问题描述

1 个解决方案

解决方案1
1 2020-03-07 11:11:21

Python - 遍历边列表，对于具有特定属性的节点，找到具有不同特定属性的所有连接节点？

问题描述

1 个解决方案

解决方案1 1 2020-03-07 11:11:21

解决方案1
1 2020-03-07 11:11:21