[英]Identifying root parents and all their children in trees
I have a pandas dataframe as such:我有一个 pandas dataframe 这样的:
parent child parent_level child_level
A B 0 1
B C 1 2
B D 1 2
X Y 0 2
X D 0 2
Y Z 2 3
This represents a tree that looks like this这代表一棵看起来像这样的树
A X
/ / \
B / \
/\ / \
C D Y
|
Z
I want to produce something that looks like this:我想制作如下所示的东西:
root children
A [B,C,D]
X [D,Y,Z]
or或者
root child
A B
A C
A D
X D
X Y
X Z
What is the fastest way to do so without looping.没有循环的最快方法是什么。 I have a really large dataframe.
我有一个非常大的 dataframe。
I suggest you use networkx , as this is a graph problem.我建议您使用networkx ,因为这是一个图形问题。 In particular the descendants function:
特别是后代function:
import networkx as nx
import pandas as pd
data = [['A', 'B', 0, 1],
['B', 'C', 1, 2],
['B', 'D', 1, 2],
['X', 'Y', 0, 2],
['X', 'D', 0, 2],
['Y', 'Z', 2, 3]]
df = pd.DataFrame(data=data, columns=['parent', 'child', 'parent_level', 'child_level'])
roots = df.parent[df.parent_level.eq(0)].unique()
dg = nx.from_pandas_edgelist(df, source='parent', target='child', create_using=nx.DiGraph)
result = pd.DataFrame(data=[[root, nx.descendants(dg, root)] for root in roots], columns=['root', 'children'])
print(result)
Output Output
root children
0 A {D, B, C}
1 X {Z, Y, D}
def find_root(tree, child):
if child in tree:
return {p for x in tree[child] for p in find_root(tree, x)}
else:
return {child}
tree = {}
for parent, child in zip(df.parent, df.child):
tree.setdefault(child, set()).add(parent)
descendents = {}
for child in tree:
for parent in find_root(tree, child):
descendents.setdefault(parent, set()).add(child)
pd.DataFrame(descendents.items(), columns=['root', 'children'])
root children
0 A {B, D, C}
1 X {Z, D, Y}
You could alternatively set up find_root
as a generator您也可以将
find_root
设置为生成器
def find_root(tree, child):
if child in tree:
for x in tree[child]:
yield from find_root(tree, x)
else:
yield child
Further, if you want to avoid recursion depth issues, you can use the "stack of iterators" pattern to define find_root
此外,如果您想避免递归深度问题,您可以使用“迭代器堆栈”模式来定义
find_root
def find_root(tree, child):
stack = [iter([child])]
while stack:
for node in stack[-1]:
if node in tree:
stack.append(iter(tree[node]))
else:
yield node
break
else: # yes! that is an `else` clause on a for loop
stack.pop()
My approach is this, you start from the bottom-most parent_level and collect the children in a dictionary.我的方法是这样的,你从最底层的 parent_level 开始,将孩子收集到字典中。 As you go up, when you find that a parent in the dict is the child of another parent, you add those children to the new parent, then delete the old parent.
当您 go 向上时,当您发现 dict 中的父级是另一个父级的子级时,您将这些子级添加到新父级,然后删除旧父级。
I've made a quick %time
test, this method much faster (4.77 µs compared to 5.58 ms using networkx).我做了一个快速的
%time
测试,这种方法要快得多(4.77 µs 与使用 networkx 的 5.58 ms 相比)。 Not too sure if it's the case when you scale up.不太确定当你扩大规模时是否是这种情况。 You can give it a try.
你可以试一试。
import pandas as pd
data = [['A', 'B', 0, 1],
['B', 'C', 1, 2],
['B', 'D', 1, 2],
['X', 'Y', 0, 2],
['X', 'D', 0, 2],
['Y', 'Z', 2, 3]]
df = pd.DataFrame(data=data, columns=['parent', 'child', 'parent_level', 'child_level'])
current_roots = {}
for parent_level in range(df.parent_level.max(), -1, -1):
children_to_remove_from_root = []
for (root, rows) in df[df.parent_level == parent_level].groupby('parent'):
children = rows['child'].values.tolist()
current_roots[root] = children
for child in children:
if child in current_roots:
current_roots[root] += current_roots[child]
children_to_remove_from_root.append(child)
for child in children_to_remove_from_root:
del current_roots[child]
print(current_roots)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.