[英]Python - Iterate over an edge list, for nodes with a specific attribute, find all connected nodes with a different specific attribute?
I have an edge list containing 24 000 different edges between produts.我有一个边缘列表,其中包含产品之间的 24 000 个不同的边缘。 An edge is created between A and B if product B is a sub component of A.如果产品 B 是 A 的子组件,则在 A 和 B 之间创建一条边。
The edge list is on the following format:边缘列表采用以下格式:
Parent | Child | Root | Child Meta
AA1 BB1 AA1 ...
AA1 BB2 AA1 ...
BB2 CC1 AA1 ...
AA2 BB3 AA2
AA2 BB4 AA2
BB4 CC1 AA2 ...
BB4 DD1 AA2 ...
DD1 EE1 AA2
DD1 EE2 AA2
BB4 FF1 AA2
FF1 GG1 AA2 ...
GG1 EE3 AA2
So by grouping by Root
I want, for all parents on the form DD*
and FF*
, find children on the form EE*
they have a direct connection with.因此,通过我想要的Root
分组,对于表单DD*
和FF*
上的所有父母,在表单EE*
上找到与他们有直接联系的孩子。 In the example above I want the output dataframe to look like在上面的示例中,我希望输出数据帧看起来像
Parent | Child | Root | Child Meta
DD1 EE1 AA2 ...
DD1 EE2 AA2 ...
FF1 EE3 AA2 ...
The only way I know how to do this is by iterating over a pandas DataFrame and using recursive functions iterating over the children until I hit an EE*
child.我知道如何做到这一点的唯一方法是迭代 Pandas DataFrame 并使用递归函数迭代子项,直到我遇到EE*
子项。 This takes forever.这需要永远。 Is there a smart way to use networkx
here maybe?有没有一种聪明的方法可以在这里使用networkx
? Or are there any other way I can do this using pandas that would be faster?或者有没有其他方法可以使用更快的熊猫来做到这一点?
If I understand the issue correctly, then it might be faster if you start at the bottom and find nodes going upwards.如果我正确理解这个问题,那么如果你从底部开始并找到向上的节点,它可能会更快。
Since you know the subset of children (E*) you want to find, if you start with the target children, all parents are by definition part of the result, and you don't have to filter.由于您知道要查找的子项 (E*) 的子集,如果您从目标子项开始,则根据定义,所有父项都是结果的一部分,您不必进行过滤。
In a plain iterative Python approach, something like this would find all parent nodes for "E*" children:在一个简单的迭代 Python 方法中,这样的事情会找到“E*”子节点的所有父节点:
(Please note that I have added an extra line with "BB3 DD1 AA2" to have another duplicate.) (请注意,我添加了一个带有“BB3 DD1 AA2”的额外行以进行另一个重复。)
data = """AA1 BB1 AA1
AA1 BB2 AA1
BB2 CC1 AA1
AA2 BB3 AA2
AA2 BB4 AA2
BB4 CC1 AA2
BB3 DD1 AA2
BB4 DD1 AA2
DD1 EE1 AA2
DD1 EE2 AA2
BB4 FF1 AA2
FF1 GG1 AA2
GG1 EE3 AA2"""
# tuple (parent, child, root)
tuples = {tuple(l.split()) for l in data.split("\n")}
parentsByChild = {}
for node in tuples:
p = set(parentsByChild.get(node[1], frozenset()))
p.add(node)
parentsByChild[node[1]] = frozenset(p)
# alternatively:
# from itertools import groupby
# parentsByChild = {c:frozenset(nodes) for c, nodes in groupby(sorted(tuples, key=lambda n: n[1]), lambda n: n[1])}
def expand(nodes):
todo, found = set(nodes), set()
while todo:
node = todo.pop()
if not node in found:
found.add(node)
todo.update((p for p in parentsByChild.get(node[0], set()) if p not in found))
return found
leaves = {n for n in tuples if n[1].startswith("E")}
for t in expand(leaves):
print(t)
This should be linear in the number of edges: We iterate over them once to collect the tuples and a second time to group the parents.这应该与边的数量成线性:我们迭代它们一次以收集元组,第二次对父项进行分组。 The expand
call iterates over all "interesting" children and their parents, expanding parents only for new nodes, so we never do work twice for the same node. expand
调用遍历所有“有趣的”子节点和它们的父节点,只为新节点扩展父节点,所以我们永远不会为同一个节点做两次工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.