[英]euclidean distance calculation using Python and Dask
I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold.我试图识别欧几里得距离矩阵中落入某个阈值的元素。 I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array is the first eigenvector of PCA, but the sort is the most relevant part for my question).
然后,我使用此搜索的位置参数并使用它们来比较第二个数组中的元素(为了演示,该数组是 PCA 的第一个特征向量,但排序是与我的问题最相关的部分)。 The application needs to be applicable for an unknown number of observations, but should run effectively on several million.
该应用程序需要适用于未知数量的观察,但应该有效地运行数百万次。
import numpy as np from scipy.spatial.distance import cdist threshold = 10 data = np.random.uniform((1, 2, 3), 5000) searchValues = np.where(cdist(data, data) < threshold)
#
My problem is two fold.我的问题是双重的。
Firstly the euclidean distance matrix quickly becomes too large for simply applying scipy.spatial.distance.cdist().首先,欧几里得距离矩阵很快变得太大,无法简单地应用 scipy.spatial.distance.cdist()。 To solve this issue I apply the cdist function in batches over the dataset and implement the search iteratively.
为了解决这个问题,我在数据集上批量应用 cdist 函数并迭代地实现搜索。
cdist(data, data) Traceback (most recent call last): File "C:\\Users\\tl928yx\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py", line 2862, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-10-fb93ae543712>", line 1, in <module> cdist(data, data) File "C:\\Users\\tl928yx\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\scipy\\spatial\\distance.py", line 2142, in cdist dm = np.zeros((mA, mB), dtype=np.double) MemoryError
#
The second problem is a runtime issue that results from constructing distance matrix iteratively.第二个问题是由迭代构建距离矩阵导致的运行时问题。 When I institute my iterative approach the runtime increases exponentially.
当我开始我的迭代方法时,运行时间呈指数增长。 This isn't unexpected due to the nature of the iterative approach.
由于迭代方法的性质,这并不意外。
import numpy as np import dask.array as da from scipy.spatial.distance import cdist import itertools import timeit threshold = 10 data = np.random.uniform(1, 100, (200000,40)) #Build random data data = da.asarray(data) it = round(data.shape[0]/10000) dataArrays = [data[i*10000:(i+1)*10000] for i in range(0, it)] comparisons = itertools.combinations(dataArrays, 2) start = timeit.default_timer() searchvalues = [] for comparison in comparisons: searchvalues.append(np.where(cdist(comparison[0], comparison[1]) < threshold)) time = timeit.default_timer() - start print(time)
#
Neither of these issues are unexpected due to the nature of the problem.由于问题的性质,这些问题都不是意外的。 To try and offset both problems I've tried using dask to implement both a large data framework in python, and insert parallelization in the batch process.
为了尝试解决这两个问题,我尝试使用 dask 在 python 中实现大数据框架,并在批处理中插入并行化。 However, this hasn't resulted in a significant improvement in the time calculation, and I have a pretty strict memory limitation with this iterative method in dask (requiring taking in batches of 1000 obs at a time.
但是,这并没有导致时间计算的显着改进,并且我在 dask 中使用这种迭代方法有非常严格的内存限制(需要一次批量接收 1000 个 obs。
from dask.diagnostics import ProgressBar import dask.delayed import dask.bag @dask.delayed def eucDist(comparison): return da.asarray(cdist(comparison[0], comparison[1])) @dask.delayed def findValues(euclideanMatrix): return np.where(euclideanMatrix < threshold) start = timeit.default_timer() searchvalues = [] test = [] for comparison in comparisons: comp = dask.delayed(eucDist)(comparison) test.append(comp) look = [] with ProgressBar(): for element in test: look.append(dask.delayed(findValues)(element).compute())
I'm hoping that I can parallelize the comparisons to increase my speed, but I'm not sure how to implement that in python.我希望我可以并行比较以提高我的速度,但我不确定如何在 python 中实现它。 Any help with that, or any recommendations for how I can improve the initial comparison code would be appreciated.
对此的任何帮助,或有关如何改进初始比较代码的任何建议,将不胜感激。
I believe that the dask-image package has some dask-enabled distance algorithms.我相信 dask-image 包有一些支持 dask 的距离算法。
https://github.com/dask/dask-image https://github.com/dask/dask-image
您可以使用dask_distance.euclidean(x,y)
计算 Dask 中的欧几里得距离。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.