简体   繁体   English

使用Python代码并行计算两点之间距离的最快方法

[英]Fastest way with parallelization to compute distances between two points with Python code

I have a data frame 'data' with millions of rows. 我有一个具有数百万行的数据框“数据”。 Each row has coordinates ('x','y'), I would like to compute distances among consecutive pairs of the coordinates in the most efficient way python can provide. 每行都有坐标('x','y'),我想以python可以提供的最有效方法来计算连续的坐标对之间的距离。 Will parallelization help here? 并行化在这里有帮助吗?

I saw approaches here that suggests to use cython. 我在这里看到了建议使用cython的方法。 However I would like to see only python solutions. 但是我只想看python解决方案。

Here is the snippet of my data 这是我的数据片段

points = 
[(26406, -6869),
 (27679, -221),
 (27679, -221),
 (26416, -6156),
 (26679, -578),
 (26679, -580),
 (27813, -558),
 (26254, -1097),
 (26679, -580),
 (27813, -558),
 (28258, -893),
 (26253, -1098),
 (26678, -581),
 (27811, -558),
 (28259, -893),
 (26252, -1098),
 (27230, -481),
 (26679, -582),
 (27488, -5849),
 (27811, -558),
 (28259, -893),
 (26250, -1099),
 (27228, -481),
 (26679, -582),
 (27488, -5847),
 (28525, -1465),
 (27811, -558),
 (28259, -892)]

I believe that my first approach using for-loop can be definitely improved: 我相信使用for循环的第一种方法肯定可以改进:

    from scipy.spatial import distance
    def comp_dist(points):
        size  =len(points)
        d = 0
        i=1
        for i in range(1,size):
            if i%1000000==0:
                print i
            # print "i-1:", points[i-1]
            # print "i: ", points[i]
            dist = distance.euclidean(points[i-1],points[i])
            d= d+dist
        print d

    distance = comp_dist(points)

Thank you for your answers in advance. 预先感谢您的回答。

You said python, but since you're already using scipy for the distance calculation I assume that a numpy solution is ok. 您说的是python,但是由于您已经在使用scipy进行距离计算,因此我认为可以使用numpy解决方案。

Using a vectorized, single-threaded operation on a 28 million point numpy array takes only 1 second on my laptop. 在2,800万个点的numpy数组上使用矢量化单线程操作,在我的笔记本电脑上仅需1秒。 Using a 32-bit integer data type, the array occupies about 200MB in memory. 使用32位整数数据类型,该阵列占用大约200MB的内存。

import numpy as np
points = [(26406, -6869), ..., (28259, -892)]
# make test array my repeating the 28-element points list 1M times
np_points = np.array(points*1000000, dtype='int32')
# use two different slices (offset by 1) from resulting array;
# execution of next line takes ~1 second
dists = np.sqrt(np.sum((np_points[0:-2] - np_points[1:-1])**2, axis=1))
print(dists.shape)
(27999998,)

print(dists[:28])
[  6.76878372e+03   0.00000000e+00   6.06789865e+03   5.58419672e+03
   2.00000000e+00   1.13421338e+03   1.64954600e+03   6.69263775e+02
   1.13421338e+03   5.57000898e+02   2.01545280e+03   6.69263775e+02
   1.13323343e+03   5.59400572e+02   2.01744244e+03   1.15636197e+03
   5.60180328e+02   5.32876815e+03   5.30084993e+03   5.59400572e+02
   2.01953386e+03   1.15689585e+03   5.58213221e+02   5.32679134e+03
   4.50303153e+03   1.15431581e+03   5.58802291e+02   6.25764636e+03]

Here is a quick example to help you get started: 这是一个快速的示例,可以帮助您入门:

from scipy.spatial import distance
from multiprocessing import Pool

processes = 4

# Group data into pairs in order to compute distance
pairs = [(points[i], points[i+1]) for i in range(len(points)-1)]
print pairs

# Split data into chunks
l = [pairs[i:i+processes] for i in xrange(0, len(pairs), processes)]


def worker(lst):
    return [distance.euclidean(i[0], i[1]) for i in lst]

if __name__ == "__main__":
    p = Pool(processes)
    result = p.map(worker, l)
    # Flatten list
    print [item for sublist in result for item in sublist]

Testing this with: 使用以下方法进行测试:

points =[(random.randint(0,1000), random.randint(0, 1000)) for i in range(1000000)]

With 8 processes it takes around 5 seconds, with 1 takes over 10 seconds. 8个过程大约需要5秒,而1个过程则需要10秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用 numpy/scipy 计算连续向量之间距离的最快方法 - Fastest way to compute distances between consecutive vectors with numpy/scipy 使用Python有效地计算数组中所有点之间的距离 - Compute distances between all points in array efficiently using Python 有效地计算两个数据集之间的成对半正弦距离 - NumPy / Python - Efficiently compute pairwise haversine distances between two datasets - NumPy / Python 在 python 中计算大量不动点的最快方法? - Fastest way to compute large amount of fixed points in python? 计算python中每个点之间的距离的最快方法 - Fastest way to compute distance beetween each points in python Python中缺少值的两组点之间的快速欧几里得距离 - Fast euclidean distances between two sets of points with missing values in Python 计算pyspark中两个数据框的行之间的距离 - Compute the distances between the rows of two dataframes in pyspark 在python中获取两个(X,Y)坐标之间所有点的最快方法 - Fastest way to get all the points between two (X,Y) coordinates in python 在 Python(和 Cython)中计算两个矩阵的点积的最快方法是什么 - What is the fastest way to compute the dot product of two matrices in Python (and Cython) 在 Python 中计算熵的最快方法 - Fastest way to compute entropy in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM