简体   繁体   English

在Python中查找所有后代的点数

[英]Find all descendants for points in Python

I need to get all descendants point of links represented with side_a - side_b (in one dataframe) until reach for each side_a their end_point (in other dataframe). 我需要得到所有后代用side_a - side_b(在一个数据帧中)表示的链接,直到达到每个side_a他们的end_point(在其他数据帧中)。 So: 所以:

df1:
side_a   side_b
  a        b
  b        c
  c        d
  k        l
  l        m
  l        n
  p        q
  q        r
  r        s

df2:
side_a    end_point
  a          c
  b          c
  c          c
  k          m
  k          n
  l          m
  l          n
  p          s
  q          s
  r          s

The point is to get all points for each side_a value until reach end_point from df2 for that value. 关键是获取每个side_a值的所有点,直到从df2到达该值的end_point。 If it has two end_point values (like "k" does) that it should be two lists. 如果它有两个end_point值(如“k”那样),它应该是两个列表。

I have some code but it's not written with this approach, it drops all rows from df1 if df1['side_a'] == df2['end_points'] and that causes certain problems. 我有一些代码,但它不是用这种方法编写的,如果df1['side_a'] == df2['end_points'] ,它会从df1中删除所有行,这会导致某些问题。 But if someone wants me to post the code I will, of course. 但是,如果有人要我发布代码,我当然会。

The desired output would be something like this: 期望的输出将是这样的:

side_a    end_point
  a          [b, c]
  b          [c]
  c          [c]
  k          [l, m]
  k          [l, n]
  l          [m]
  l          [n]
  p          [q, r, s]
  q          [r, s]
  r          [s]

And one more thing, if there is the same both side, that point doesn't need to be listed at all, I can append it later, whatever it's easier. 还有一件事,如果两边都有相同的东西,那么根本不需要列出这一点,我可以稍后追加它,不管它更容易。

import pandas as pd
import numpy as np
import itertools

def get_child_list(df, parent_id):
    list_of_children = []
    list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
    for c_, r_ in df[df['side_a'] == parent_id].iterrows():
        if r_['side_b'] != parent_id:
            list_of_children.append(get_child_list(df, r_['side_b']))

    # to flatten the list 
    list_of_children =  [item for sublist in list_of_children for item in sublist]
    return list_of_children

new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
    temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
    temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
    temp_df['side_a'] = row['side_a']

    new_df = new_df.append(temp_df)

So, the problem with this code is that works if I drop rows where side_a is equal to end_point from df2. 因此,如果我从df2中删除side_a等于end_point的行,则此代码的问题是有效的。 I don't know how to implement condition that if catch the df2 in side_b column, then stop, don't go further. 我不知道如何实现条件,如果在side_b列中捕获df2,那么停止,不要再往前走了。

Any help or hint is welcomed here, truly. 这里真的欢迎任何帮助或提示。 Thanks in advance. 提前致谢。

You can use networkx library and graphs: 您可以使用networkx库和图表:

import networkx as nx
G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
                     nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)

Output: 输出:

  side_a  end_point
0      a     [b, c]
1      b        [c]
2      c         []
3      k     [l, m]
4      k     [l, n]
5      l        [m]
6      l        [n]
7      p  [q, r, s]
8      q     [r, s]
9      r        [s]

Your rules are inconsistent and your definitions are unclear so you may need to add some constraints here and there because it is unclear exactly what you are asking. 你的规则不一致,你的定义也不清楚,所以你可能需要在这里和那里添加一些约束,因为不清楚你究竟在问什么。 By organizing the data-structure to fit the problem and building a more robust function for traversal (shown below) it will be easier to add/edit constraints as needed - and solve the problem completely. 通过组织数据结构以适应问题为遍历构建更强大的函数 (如下所示),可以更容易地根据需要添加/编辑约束 - 并完全解决问题。

Transform the df to a dict to better represent a tree structure df转换为dict以更好地表示树结构

This problem is a lot simpler if you transform the data structure to be more intuitive to the problem, instead of trying to solve the problem in the context of the current structure. 如果将数据结构转换为对问题更直观,而不是尝试在当前结构的上下文中解决问题,则此问题要简单得多。

## Example dataframe
df = pd.DataFrame({'side_a':['a','b','c','k','l','l','p','q','r'],'side_b':['b','c','d','l','m','n','q','r','s']})

## Instantiate blank tree with every item
all_items = set(list(df['side_a']) + list(df['side_b']))
tree = {ii : set() for ii in all_items}

## Populate the tree with each row
for idx, row in df.iterrows():
    tree[row['side_a']] =  set(list(tree[row['side_a']]) + list(row['side_b']))

Traverse the Tree 穿越树

This is much more straightforward now that the data structure is intuitive. 由于数据结构直观,因此这一点要简单得多。 Any standard Depth-First-Search algorithm w/ path saving will do the trick. 任何具有路径保存的标准深度优先搜索算法都可以解决问题。 I modified the one in the link to work with this example. 我修改了链接中的那个以使用此示例。

Edit: Reading again it looks you have a condition for search termination in endpoint (you need to be more clear in your question what is input and what is output). 编辑:再次阅读它看起来你在endpoint有一个搜索终止的条件(你需要在你的问题中更明确什么是输入和什么是输出)。 You can adjust dfs_path(tree,**target**, root) and change the termination condition to return only the correct paths. 您可以调整dfs_path(tree,**target**, root)并更改终止条件以仅返回正确的路径。

## Standard DFS pathfinder
def dfs_paths(tree, root):
    stack = [(root, [root])]
    while stack:
        (node, path) = stack.pop()
        for nextNode in tree[node] - set(path):
            # Termination condition. 
            ### I set it to terminate search at the end of each path.
            ### You can edit the termination condition to fit the 
            ### constraints of your goal
            if not tree[nextNode]:
                yield set(list(path) + list(nextNode)) - set(root)
            else:
                stack.append((nextNode, path + [nextNode]))

Build a dataframe from the generators we yielded 从我们产生的生成器构建数据帧

If you're not super comfortable with generators, you can structure the DFS traversal so that it outputs in a list. 如果您对发电机不太满意,可以构建DFS遍历,以便在列表中输出。 instead of a generator 而不是发电机

set_a = []
end_points = []
gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items]
for gen in gen_dict:
    for row in list(gen.values()).pop():
        set_a.append(list(gen.keys()).pop())
        end_points.append(row)

## To dataframe
df_2 = pd.DataFrame({'set_a':set_a,'end_points':end_points}).sort_values('set_a')

Output 产量

df_2[['set_a','end_points']]


set_a   end_points
a       {b, c, d}
b       {c, d}
c       {d}
k       {n, l}
k       {m, l}
l       {n}
l       {m}
p       {s, r, q}
q       {s, r}
r       {s}

If you're OK with an extra import, this can be posed as a path problem on a graph and solved in a handful of lines using NetworkX : 如果您可以使用额外的导入,这可以在图表上作为路径问题提出,并使用NetworkX在少数几行中解决:

import networkx

g = networkx.DiGraph(zip(df1.side_a, df1.side_b))

outdf = df2.apply(lambda row: [row.side_a, 
                               set().union(*networkx.all_simple_paths(g, row.side_a, row.end_point)) - {row.side_a}], 
                  axis=1)    

outdf looks like this. outdf看起来像这样。 Note that this contains sets instead of lists as in your desired output - this allows all the paths to be combined in a simple way. 请注意,这包含集合而不是所需输出中的列表 - 这允许以简单的方式组合所有路径。

  side_a  end_point
0      a     {c, b}
1      b        {c}
2      c         {}
3      k     {l, m}
4      k     {l, n}
5      l        {m}
6      l        {n}
7      p  {r, q, s}
8      q     {r, s}
9      r        {s}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM