简体   繁体   English

Python中的寻路效率

[英]Path-finding efficiency in Python

I have written some code that finds all the paths upstream of a given reach in a dendritic stream network. 我编写了一些代码,以找到树状流网络中给定范围上游的所有路径。 As an example, if I represent the following network: 例如,如果我代表以下网络:

     4 -- 5 -- 8
    / 
   2 --- 6 - 9 -- 10
  /           \ 
 1              -- 11
  \
   3 ----7

as a set of parent-child pairs: 作为一组父子对:

{(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}

it will return all of the paths upstream of a node, for instance: 它将返回节点上游的所有路径,例如:

get_paths(h, 1)  # edited, had 11 instead of 1 in before
[[Reach(2), Reach(6), Reach(9), Reach(11)], [Reach(2), Reach(6), Reach(9), Reach(10)], [Reach(2), Reach(4), Reach(5), Reach(8)], [Reach(3), Reach(7)]]

The code is included below. 该代码包含在下面。

My question is: I am applying this to every reach in a very large (eg, New England) region for which any given reach may have millions of paths. 我的问题是:我将此方法应用于非常大的区域(例如,新英格兰)中的每个范围,任何给定范围都可能具有数百万条路径。 There's probably no way to avoid this being a very long operation, but is there a pythonic way to perform this operation such that brand new paths aren't generated with each run? 可能没有办法避免这是一个很长的操作,但是有没有一种pythonic的方法来执行此操作,以使每次运行都不会生成全新的路径?

For example, if I run get_paths(h, 2) and all paths upstream from 2 are found, can I later run get_paths(h, 1) without retracing all of the paths in 2? 例如,如果我运行get_paths(h,2)并找到了2上游的所有路径,我以后是否可以运行get_paths(h,1)而不必追溯2中的所有路径?

import collections

# Object representing a stream reach.  Used to construct a hierarchy for accumulation function
class Reach(object):
    def __init__(self):
        self.name = None
        self.ds = None
        self.us = set()

    def __repr__(self):
        return "Reach({})".format(self.name)


def build_hierarchy(flows):
    hierarchy = collections.defaultdict(lambda: Reach())
    for reach_id, parent in flows:
        if reach_id:
            hierarchy[reach_id].name = reach_id
            hierarchy[parent].name = parent
            hierarchy[reach_id].ds = hierarchy[parent]
            hierarchy[parent].us.add(hierarchy[reach_id])
    return hierarchy

def get_paths(h, start_node):
    def go_up(n):
        if not h[n].us:
            paths.append(current_path[:])
        for us in h[n].us:
            current_path.append(us)
            go_up(us.name)
        if current_path:
            current_path.pop()
    paths = []
    current_path = []
    go_up(start_node)
    return paths

test_tree = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
h = build_hierarchy(test_tree)
p = get_paths(h, 1)

EDIT: A few weeks ago I asked a similar question about finding "ALL" upstream reaches in a network and received an excellent answer that was very fast: 编辑:几周前,我问了一个类似的问题,关于在网络中查找“所有”上游数据,并收到了一个非常快的出色答案:

class Node(object):

    def __init__(self):
        self.name = None
        self.parent = None
        self.children = set()
        self._upstream = set()

    def __repr__(self):
        return "Node({})".format(self.name)

    @property
    def upstream(self):
        if self._upstream:
            return self._upstream
        else:
            for child in self.children:
                self._upstream.add(child)
                self._upstream |= child.upstream
            return self._upstream

import collections

edges = {(11, 9), (10, 9), (9, 6), (6, 2), (8, 5), (5, 4), (4, 2), (2, 1), (3, 1), (7, 3)}
nodes = collections.defaultdict(lambda: Node())

for node, parent in edges:
    nodes[node].name = node
    nodes[parent].name = parent
    nodes[node].parent = nodes[parent]
    nodes[parent].children.add(nodes[node])

I noticed that the def upstream(): part of this code adds upstream nodes in sequential order, but because it's an iterative function I can't find a good way to append them to a single list. 我注意到def upper():此代码的一部分按顺序添加了上游节点,但是由于它是一个迭代函数,所以找不到将它们附加到单个列表的好方法。 Perhaps there is a way to modify this code that preserves the order. 也许有一种方法可以修改此代码以保留顺序。

Yes, you can do this. 是的,您可以这样做。 I'm not fully sure what your constraints are; 我不确定您的限制是什么; however, this should get you on the right track. 但是,这应该可以使您走上正确的道路。 The worst case run time of this is O(|E|+|V|), with the only difference being that in p.dfsh , we are caching previously evaluated paths, as opposed to p.dfs , we are not. 最差的运行时间是O(| E | + | V |),唯一的区别是在p.dfsh ,我们正在缓存先前评估的路径,而不是p.dfs

This will add additional space overhead, so be aware of that tradeoff – you'll save many iterations (depending on your data set) at the expensive of more memory taken up no matter what. 这将增加额外的空间开销,因此请注意这一折衷–您将节省许多迭代(取决于您的数据集),而无论如何都将占用更多的内存。 Unfortunately, the caching doesn't improve the order of growth, only the practical run time: 不幸的是,缓存并不能改善增长顺序,只能改善实际运行时间:

points = set([
    (11, 9),
    (10, 9), 
    (9, 6), 
    (6, 2), 
    (8, 5), 
    (5, 4), 
    (4, 2), 
    (2, 1), 
    (3, 1),
    (7, 3),
])

class PathFinder(object):

    def __init__(self, points):
        self.graph  = self._make_graph(points)
        self.hierarchy = {}

    def _make_graph(self, points):
        graph = {}
        for p in points:
            p0, p1 = p[0], p[1]
            less, more = min(p), max(p)

            if less not in graph:
                graph[less] = set([more])
            else:
                graph[less].add(more)

        return graph

    def dfs(self, start):
        visited = set()
        stack = [start]

        _count = 0
        while stack:
            _count += 1
            vertex = stack.pop()
            if vertex not in visited:
                visited.add(vertex)
                if vertex in self.graph:
                    stack.extend(v for v in self.graph[vertex])

        print "Start: {s} | Count: {c} |".format(c=_count, s=start),
        return visited

    def dfsh(self, start):
        visited = set()
        stack = [start]

        _count = 0
        while stack:
            _count += 1

            vertex = stack.pop()
            if vertex not in visited:
                if vertex in self.hierarchy:
                    visited.update(self.hierarchy[vertex])
                else:
                    visited.add(vertex)
                    if vertex in self.graph:
                        stack.extend([v for v in self.graph[vertex]])
        self.hierarchy[start] = visited

        print "Start: {s} | Count: {c} |".format(c=_count, s=start),
        return visited

p = PathFinder(points)
print p.dfsh(1)
print p.dfsh(2)
print p.dfsh(9)
print p.dfsh(6)
print p.dfsh(2)
print 
print p.dfs(1)
print p.dfs(2)
print p.dfs(9)
print p.dfs(6)
print p.dfs(2)

The output for p.dfsh this is the following: p.dfsh的输出如下:

Start: 1 | Count: 11 | set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Start: 2 | Count: 8 | set([2, 4, 5, 6, 8, 9, 10, 11])
Start: 9 | Count: 3 | set([9, 10, 11])
Start: 6 | Count: 2 | set([9, 10, 11, 6])
Start: 2 | Count: 1 | set([2, 4, 5, 6, 8, 9, 10, 11])

The output for just the regular p.dfs is: 常规p.dfs的输出为:

Start: 1 | Count: 11 | set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
Start: 2 | Count: 8 | set([2, 4, 5, 6, 8, 9, 10, 11])
Start: 9 | Count: 3 | set([9, 10, 11])
Start: 6 | Count: 4 | set([9, 10, 11, 6])
Start: 2 | Count: 8 | set([2, 4, 5, 6, 8, 9, 10, 11])

As you can see, I do a DFS, but I keep track of previous iterations, within reason. 如您所见,我执行了DFS,但是在合理范围内,我跟踪以前的迭代。 I don't want to keep track of all possible previous paths because if you're using this on a large data set, it would take up ridiculous amounts of memory. 我不想跟踪所有可能的先前路径,因为如果在大型数据集上使用它,则会占用大量的内存。

In the output, you can see the iteration count for p.dfsh(2) go from 8 to 1. And likewise the count for p.dfsh(6) is also dropped to 2 because of the previous computation of p.dfsh(9) . 在输出中,您可以看到p.dfsh(2)的迭代计数从8变为1。同样,由于先前对p.dfsh(9) p.dfsh(6)的计算, p.dfsh(6)的计数也下降到了2。 p.dfsh(9) This is a modest runtime improvement from the standard DFS, especially on significantly large data sets. 与标准DFS相比,这是对运行时的适度改进,尤其是在大型数据集上。

Sure, assuming you have enough memory to store all the paths from each node, you can just use a straightforward modification of the code you've received in that answer: 当然,假设您有足够的内存来存储来自每个节点的所有路径,则可以对在该答案中收到的代码进行直接修改:

class Reach(object):
    def __init__(self):
        self.name = None
        self.ds = None
        self.us = set()
        self._paths = []

    def __repr__(self):
        return "Reach({})".format(self.name)

    @property
    def paths(self):
        if not self._paths:
            for child in self.us:
                if child.paths:
                    self._paths.extend([child] + path for path in child.paths)
                else:
                    self._paths.append([child])
        return self._paths

Mind you, that for some 20,000 reaches, the required memory for that approach will be in the order of gigabytes. 请注意,对于大约20,000个访问范围,该方法所需的内存约为千兆字节。 Required memory, assuming generally balanced tree of reaches, is O(n^2) , where n is the total number of reaches. 假设所需的内存通常是平衡的树,则所需的内存为O(n ^ 2) ,其中n是内存的总数。 That would be 4-8 GiB for 20,000 reaches depending on your system. 取决于您的系统,这将是4-8 GiB的20,000个范围。 Required time is O(1) for any node though, after the paths from h[1] have been computed. 在计算完h[1]的路径后,任何节点的所需时间均为O(1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM