简体   繁体   English

使用 Python 和 Dask 计算欧几里德距离

[英]euclidean distance calculation using Python and Dask

I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold.我试图识别欧几里得距离矩阵中落入某个阈值的元素。 I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array is the first eigenvector of PCA, but the sort is the most relevant part for my question).然后,我使用此搜索的位置参数并使用它们来比较第二个数组中的元素(为了演示,该数组是 PCA 的第一个特征向量,但排序是与我的问题最相关的部分)。 The application needs to be applicable for an unknown number of observations, but should run effectively on several million.该应用程序需要适用于未知数量的观察,但应该有效地运行数百万次。

# #
 import numpy as np from scipy.spatial.distance import cdist threshold = 10 data = np.random.uniform((1, 2, 3), 5000) searchValues = np.where(cdist(data, data) < threshold)
# #

My problem is two fold.我的问题是双重的。

Firstly the euclidean distance matrix quickly becomes too large for simply applying scipy.spatial.distance.cdist().首先,欧几里得距离矩阵很快变得太大,无法简单地应用 scipy.spatial.distance.cdist()。 To solve this issue I apply the cdist function in batches over the dataset and implement the search iteratively.为了解决这个问题,我在数据集上批量应用 cdist 函数并迭代地实现搜索。

# #
 cdist(data, data) Traceback (most recent call last): File "C:\\Users\\tl928yx\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py", line 2862, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-10-fb93ae543712>", line 1, in <module> cdist(data, data) File "C:\\Users\\tl928yx\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\scipy\\spatial\\distance.py", line 2142, in cdist dm = np.zeros((mA, mB), dtype=np.double) MemoryError
# #

The second problem is a runtime issue that results from constructing distance matrix iteratively.第二个问题是由迭代构建距离矩阵导致的运行时问题。 When I institute my iterative approach the runtime increases exponentially.当我开始我的迭代方法时,运行时间呈指数增长。 This isn't unexpected due to the nature of the iterative approach.由于迭代方法的性质,这并不意外。

# #
 import numpy as np import dask.array as da from scipy.spatial.distance import cdist import itertools import timeit threshold = 10 data = np.random.uniform(1, 100, (200000,40)) #Build random data data = da.asarray(data) it = round(data.shape[0]/10000) dataArrays = [data[i*10000:(i+1)*10000] for i in range(0, it)] comparisons = itertools.combinations(dataArrays, 2) start = timeit.default_timer() searchvalues = [] for comparison in comparisons: searchvalues.append(np.where(cdist(comparison[0], comparison[1]) < threshold)) time = timeit.default_timer() - start print(time)
# #

Neither of these issues are unexpected due to the nature of the problem.由于问题的性质,这些问题都不是意外的。 To try and offset both problems I've tried using dask to implement both a large data framework in python, and insert parallelization in the batch process.为了尝试解决这两个问题,我尝试使用 dask 在 python 中实现大数据框架,并在批处理中插入并行化。 However, this hasn't resulted in a significant improvement in the time calculation, and I have a pretty strict memory limitation with this iterative method in dask (requiring taking in batches of 1000 obs at a time.但是,这并没有导致时间计算的显着改进,并且我在 dask 中使用这种迭代方法有非常严格的内存限制(需要一次批量接收 1000 个 obs。

 from dask.diagnostics import ProgressBar import dask.delayed import dask.bag @dask.delayed def eucDist(comparison): return da.asarray(cdist(comparison[0], comparison[1])) @dask.delayed def findValues(euclideanMatrix): return np.where(euclideanMatrix < threshold) start = timeit.default_timer() searchvalues = [] test = [] for comparison in comparisons: comp = dask.delayed(eucDist)(comparison) test.append(comp) look = [] with ProgressBar(): for element in test: look.append(dask.delayed(findValues)(element).compute())

I'm hoping that I can parallelize the comparisons to increase my speed, but I'm not sure how to implement that in python.我希望我可以并行比较以提高我的速度,但我不确定如何在 python 中实现它。 Any help with that, or any recommendations for how I can improve the initial comparison code would be appreciated.对此的任何帮助,或有关如何改进初始比较代码的任何建议,将不胜感激。

I believe that the dask-image package has some dask-enabled distance algorithms.我相信 dask-image 包有一些支持 dask 的距离算法。

https://github.com/dask/dask-image https://github.com/dask/dask-image

您可以使用dask_distance.euclidean(x,y)计算 Dask 中的欧几里得距离。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM