在 Python 中查找连通球体路径的高效方法

Question

I have written a code to find connected spheres paths using NetworkX library in Python .我在Python中使用NetworkX库编写了一段代码来查找连接的球体路径。 For doing so, I need to find distances between the spheres before using the graph.为此，我需要在使用图形之前找到球体之间的距离。 This part of the code (calculation section (the numba function) --> finding distances and connections) led to memory leaks when using arrays in parallel scheme by numba (I had this problem when using np.linalg or scipy.spatial.distance.cdist , too).这部分代码（计算部分（ numba函数）--> 查找距离和连接）导致 memory 在numba并行方案中使用 arrays 时出现泄漏（我在使用np.linalg或scipy.spatial.distance.cdist也是）。 So, I wrote a non-parallel numba code using lists to do so.因此，我使用列表编写了一个非并行numba代码。 Now, it is memory-friendly but consumes a much time to calculate these distances ( it consumes just ~10-20% of 16GB memory and ~30-40% of each CPU cores of my 4-cores CPU machine ).现在，它对内存很友好，但计算这些距离会消耗很多时间（它仅消耗 16GB memory 的 ~10-20% 和我的 4 核 CPU 机器的每个 CPU 核的 ~30-40% ）。 For example, when I was testing on ~12000 data volume, it took less than one second for each of the calculation section and the NetworkX graph creation and for ~550000 data volume, it took around 25 minutes for calculation section ( numba part) and 7 seconds for graph creation and getting the output list.例如，当我测试 ~12000 数据量时，每个计算部分和NetworkX图创建花费了不到一秒，而对于 ~550000 数据量，计算部分（ numba部分）花费了大约 25 分钟，并且7 秒用于图形创建和获取 output 列表。

import numpy as np
import numba as nb
import networkx as nx


radii = np.load('rad_dist_12000.npy')
poss = np.load('pos_dist_12000.npy')


@nb.njit("(Tuple([float64[:, ::1], float64[:, ::1]]))(float64[::1], float64[:, ::1])", parallel=True)
def distances_numba_parallel(radii, poss):
    radii_arr = np.zeros((radii.shape[0], radii.shape[0]), dtype=np.float64)
    poss_arr = np.zeros((poss.shape[0], poss.shape[0]), dtype=np.float64)
    for i in nb.prange(radii.shape[0] - 1):
        for j in range(i+1, radii.shape[0]):
            radii_arr[i, j] = radii[i] + radii[j]
            poss_arr[i, j] = ((poss[i, 0] - poss[j, 0]) ** 2 + (poss[i, 1] - poss[j, 1]) ** 2 + (poss[i, 2] - poss[j, 2]) ** 2) ** 0.5
    return radii_arr, poss_arr


@nb.njit("(List(UniTuple(int64, 2)))(float64[::1], float64[:, ::1])")
def distances_numba_non_parallel(radii, poss):
    connections = []
    for i in range(radii.shape[0] - 1):
        connections.append((i, i))
        for j in range(i+1, radii.shape[0]):
            radii_arr_ij = radii[i] + radii[j]
            poss_arr_ij = ((poss[i, 0] - poss[j, 0]) ** 2 + (poss[i, 1] - poss[j, 1]) ** 2 + (poss[i, 2] - poss[j, 2]) ** 2) ** 0.5
            if poss_arr_ij <= radii_arr_ij:
                connections.append((i, j))
    return connections


def connected_spheres_path(radii, poss):
    
    # in parallel mode
    # maximum_distances, distances = distances_numba_parallel(radii, poss)
    # connections = distances <= maximum_distances
    # connections[np.tril_indices_from(connections, -1)] = False
    
    # in non-parallel mode
    connections = distances_numba_non_parallel(radii, poss)

    G = nx.Graph(connections)
    return list(nx.connected_components(G))

My datasets will contain maximum of 10 millions spheres (data are positions and radii), mostly, up to 1 millions;我的数据集将包含最多 1000 万个球体（数据是位置和半径），大部分最多 100 万个； As it is mentioned above, the most part of the consumed time is related to the calculation section.如上所述，大部分消耗时间都与计算部分有关。 I have little experience using graphs and don't know if (and how) it can be handled much faster using all CPU cores or RAM capacity ( max 12GB ) or if it can be calculated internally ( I doubt that it is needed to calculate and find the connected spheres separately before using graphs ) using other Python libraries such as graph-tool , igraph , and netwrokit to do all the process in C or C++ in an efficient way.我没有使用图表的经验，不知道是否（以及如何）使用所有 CPU 内核或 RAM 容量（最大 12GB ）更快地处理它，或者是否可以在内部计算它（我怀疑是否需要计算和在使用图形之前分别找到连接的球体）使用其他 Python 库（例如图形工具、 igraph和netwrokit ）以有效的方式C或C++中的所有过程。
I would be grateful for any suggested answer that can make my code faster for large data volumes ( performance is the first priority ; if much memory capacities are needed for large data volumes, mentioning (some benchmarks) its amounts will be helpful ).我将不胜感激任何可以使我的代码更快地处理大数据量的建议答案（性能是第一要务；如果大数据量需要很多 memory 容量，提及（一些基准）它的数量将会有所帮助）。

Update:更新：

Since just using tree s will not be helpful enough to improve the performance, I have written an advanced optimized code to improve the calculation section speed by combining tree-based algorithms and numba jitting.由于仅使用tree不足以提高性能，因此我编写了一个高级优化代码，通过结合基于树的算法和 numba jitting 来提高计算部分的速度。
Now, I am curious if it can be calculated internally ( calculation section is an integral part and basic need for such graphing ) by other Python libraries such as graph-tool , igraph , and netwrokit to do all the process in C or C++ in an efficient way.现在，我很好奇是否可以通过其他 Python 库（例如graph-tool 、 igraph和netwrokit ）在内部计算它（计算部分是此类图形的组成部分和基本需求），以在C或C++中完成所有过程有效的方式。

Data数据

radii: 12000 , 50000 , 550000 半径： 12000、50000、550000
poss: 12000 , 50000 , 550000 拥有： 12000，50000，550000

Answer 1

If you are computing the pairwise distance between all points, that's N^2 calculations, which will take a very long time for sufficiently many data points.如果您要计算所有点之间的成对距离，那就是 N^2 次计算，对于足够多的数据点，这将花费很长时间。

If you can place an upper bound on the distance you need to consider for any two points, then there are some nice data structures for finding pairs of neighbors in a set of points.如果您可以为任意两点设置距离上限，那么可以使用一些不错的数据结构来查找一组点中的邻居对。 If you already have scipy installed, then the most convenient structure to reach for is the KDTree (or the optimized version, cKDTree ).如果您已经安装了scipy ，那么最方便的结构是KDTree （或优化版本， cKDTree ）。 ( Read more here. ) （在这里阅读更多。）

The basic recipe is:基本配方是：

Load your point set into the KDTree.将您的点集加载到 KDTree 中。
Ask the KDTree for all pairs of points which are within some maximum distance from each other.向 KDTree 询问所有在某个最大距离内的点对。
Calculate the actual distances between each of the returned points.计算每个返回点之间的实际距离。
Compare those distances with the summed radii associated with the point pair.将这些距离与与点对关联的总半径进行比较。 Drop the pairs whose distances are too large.丢弃距离太大的对。

Finally, you need to determine the clusters of spheres.最后，您需要确定球体的簇。 Your question mentions "paths", but in your example code you're only concerned with connected components.你的问题提到了“路径”，但在你的示例代码中你只关心连接的组件。 Of course you can use networkx or graph-tool for that, but maybe that's overkill.当然，您可以为此使用networkx或graph-tool ，但这可能有点矫枉过正。

If connected components are all you need, then you don't even need a proper graph data structure.如果连通分量就是您所需要的，那么您甚至不需要适当的图形数据结构。 You just need a way to find the groups of linked nodes, without maintaining the specific connections that linked them.您只需要一种方法来查找链接节点组，而无需维护链接它们的特定连接。 Again, scipy has a nice tool: DisjointSet .同样， scipy有一个不错的工具： DisjointSet 。 ( Read more here. ) （在这里阅读更多。）

Here is a complete example.这是一个完整的例子。 The execution time depends on not only the number of points, but how "dense" they are.执行时间不仅取决于点的数量，还取决于它们的“密集”程度。 I tried some reasonable (I think) test data with 1M points, which took 24 seconds to process on my laptop.我尝试了一些合理的（我认为）1M 点的测试数据，在我的笔记本电脑上处理这些数据需要 24 秒。

Your example data (the largest of the sets provided above) takes longer: about 45 seconds.您的示例数据（上面提供的最大数据集）需要更长的时间：大约 45 秒。 The KDTree finds 312M pairs of points to consider, of which fewer than 1M are actually valid connections. KDTree 找到了 312M 对要考虑的点，其中不到 1M 是实际有效的连接。

import numpy as np
from scipy.spatial import cKDTree
from scipy.cluster.hierarchy import DisjointSet

## Example data (2D)
## N = 1000
# D = 2
# max_point = 1000
# min_radius = 10
# max_radius = 20
# points = np.random.randint(0, max_point, size=(N, D))
# radii = np.random.randint(min_radius, max_radius+1, size=N)

## Example data (3D)
# N = 1_000_000
# D = 3
# max_point = 3000
# min_radius = 10
# max_radius = 20
# points = np.random.randint(0, max_point, size=(N, D))
# radii = np.random.randint(min_radius, max_radius+1, size=N)


# Question data (3D)
points = np.load('b (556024).npy')
radii = np.load('a (556024).npy')
N = len(points)

# Load into a KD tree and extract all pairs which could possibly be linked
# (using the maximum radius as the upper bound of the search distance.)
kd = cKDTree(points)
pairs = kd.query_pairs(2 * radii.max(), output_type='ndarray')

def filter_pairs(pairs):
    # Calculate the distance between each pair of points
    vectors = points[pairs[:, 1]] - points[pairs[:, 0]]
    distances = np.linalg.norm(vectors, axis=1)

    # Drop the pairs whose summed radii aren't large enough
    # to span the distance between the points.
    thresholds = radii[pairs].sum(axis=1)
    return pairs[distances <= thresholds]

# We could do this in one big step
# ...but that might require lots of RAM.
# It's cheaper to do it in big chunks, in a loop.
fp = []
CHUNK = 1_000_000
for i in range(0, len(pairs), CHUNK):
    fp.append(filter_pairs(pairs[i:i+CHUNK]))
filtered_pairs = np.concatenate(fp)

# Load the pairs into a DisjointSet (a.k.a. UnionFind)
# data structure and extract the groups.
ds = DisjointSet(range(N))
for u, v in filtered_pairs:
    ds.merge(u, v)
connected_sets = list(ds.subsets())

print(f"Found {len(connected_sets)} sets of circles/spheres")

Just for fun, here's a visualization of the 2D test data:只是为了好玩，下面是 2D 测试数据的可视化：

from bokeh.plotting import output_notebook, figure, show
output_notebook()

p = figure()
p.circle(*points.T, radius=radii, fill_alpha=0.25)
p.segment(*points[filtered_pairs[:, 0]].T,
          *points[filtered_pairs[:, 1]].T,
          line_color='red')
show(p)

Answer 2

to find connected spheres using NetworkX library in Python.使用 Python 中的 NetworkX 库查找连接的球体。 For doing so, I need to find distances between the spheres为此，我需要找到球体之间的距离

Are you calculating the distance between every pair of spheres?你在计算每对球体之间的距离吗？

If all you need is to know the pairs of spheres that touch, or maybe that overlap, then you do NOT need to calculate the distance between every pair of spheres, only ones that are in reasonable proximity to each other.如果您只需要知道接触或重叠的球体对，那么您不需要计算每对球体之间的距离，只需要计算彼此合理接近的球体。 The standard way of handling this it to use an octree https://en.wikipedia.org/wiki/Octree处理它的标准方法是使用八叉树https://en.wikipedia.org/wiki/Octree

This takes some time to set up, but once you have it, you can find quickly all the spheres that are close but none that are two far away.这需要一些时间来设置，但是一旦你有了它，你可以快速找到所有近的球体，但没有一个球体是两个远的。 A reasonable distance would be twice the radius of the largest sphere.一个合理的距离应该是最大球体半径的两倍。 For large dataset the improvement in performance can be spectacular对于大型数据集，性能的提升可能是惊人的

( For more details about this test https://github.com/JamesBremner/quadtree ) （有关此测试的更多详细信息https://github.com/JamesBremner/quadtree ）

在 Python 中查找连通球体路径的高效方法

问题描述

Update:更新：

Data数据

2 个解决方案

解决方案1
1 已采纳 2022-10-05 04:42:43

解决方案2
0 2022-09-27 15:08:59

在 Python 中查找连通球体路径的高效方法

问题描述

Update:更新：

Data数据

2 个解决方案

解决方案1 1 已采纳 2022-10-05 04:42:43

解决方案2 0 2022-09-27 15:08:59

解决方案1
1 已采纳 2022-10-05 04:42:43

解决方案2
0 2022-09-27 15:08:59