简体   繁体   English

python中数百万行的高效欧几里得距离计算

[英]Efficient euclidean distance calculation in python for millions of rows

I am trying to find the euclidean distance between elements of two data sets. 我试图找到两个数据集的元素之间的欧几里得距离。 Each has millions of elements. 每个都有上百万个元素。 After calculating euclidean distance, I need the closest match. 在计算出欧几里得距离之后,我需要最接近的匹配。 Given the number of elements, it will take days to finish 给定元素数量,需要几天的时间才能完成

Below is the code I am trying. 下面是我正在尝试的代码。 I also tried using distance from scipy.spatial. 我也尝试使用与scipy.spatial的距离。 But even that is taking forever 但这甚至永远

from sklearn.metrics.pairwise import euclidean_distances
df =pd.DataFrame(euclidean_distances(df1,df2))
df.index =  df1.index
df.columns = df2.index
df['min_distance'] = df.min(axis=1)
df['min_distance_id'] = df.idxmin(axis=1)

Is there any other way to get the output in lesser time. 还有其他方法可以在更短的时间内获得输出。

Did you look at scipy.spatial.cKDTree ? 您是否看到了scipy.spatial.cKDTree

You can construct this data structure for one of your data set, and query it to get the distance for each point in the second data set. 您可以为其中一个数据集构造此数据结构,然后查询它以获取第二个数据集中每个点的距离。

KDTree = scipy.spatial.cKDTree(df1)
distances, indexes = KDTree.query(df2, n_jobs=-1)

I set here n_jobs=-1 to use all available processors. 我在这里设置n_jobs=-1以使用所有可用的处理器。

I wrote this solution for 2D point lists using numpy. 我使用numpy为2D点列表编写了该解决方案。 It will quickly find the closest pair of points between two arrays of points. 它将快速找到两个点阵列之间最接近的点对。 I tried it with two lists of 10 million points each and got the answer in about 4 minutes. 我尝试了两张每张一千万点的清单,并在大约4分钟内得到了答案。 With 2 million points on each side, it only took 42 seconds. 两侧各有200万点,只用了42秒。 I don't know if that will be good enough for your needs but it is definitely faster than "days". 我不知道这是否足以满足您的需求,但绝对比“几天”快。 It also gives good performance for higher dimensions if you need that as well. 如果需要,它还可以为较大的尺寸提供良好的性能。

def closest(A,B):

    def bruteForce(A,B):
        d = None
        swap = A.shape[0] > B.shape[0]
        if swap: A,B = B,A
        for pA in A:
            daB  = np.sum((pA-B)**2,axis=1)
            iMin = np.argmin(daB)
            if d is None or daB[iMin] < d:
                a,b = pA,B[iMin]
                d   = sum((a-b)**2)
        if swap: a,b = b,a
        return a,b,sqrt(d)

    # small sizes are faster using brute force
    if A.shape[0] * B.shape[0] < 1000000 \
    or A.shape[0] < 20 or B.shape[0] < 20:
        return bruteForce(A,B)

    # find center position
    midA  = np.sum(A,axis=0)/A.shape[0]
    midB  = np.sum(B,axis=0)/B.shape[0]
    midAB = (midA+midB)/2

    # closest A to center position
    A2midAB  = np.sum((A-midAB)**2,axis=1)
    iA       = np.argmin(A2midAB)    
    pA       = A[iA]

    # closest B to pA
    B2pA     = np.sum((B-pA)**2,axis=1)
    iB       = np.argmin(B2pA)
    pB       = B[iB]
    dAB      = sqrt(sum((pA-pB)**2))

    # distance of zero is best solution, return immediately
    if dAB == 0: return pA,pB,dAB

    # slope of ptA-ptB segment
    if pA[0] == pB[0]: p,m = 0,1 
    else:              p,m = 1,(pB[1]-pA[1])/(pB[0]-pA[0])

    # perpendicular line intersections with x axis from each point
    xA = m*A[:,1] + p*A[:,0] 
    xB = m*B[:,1] + p*B[:,0]

    # baselines for ptA and ptB
    baseA = xA[iA]
    baseB = xB[iB]
    rightSide = (baseB > baseA) 

    # partitions
    ArightOfA = (xA > baseA) == rightSide
    BrightOfA = (xB > baseA) == rightSide
    AleftOfB  = (xA > baseB) != rightSide
    BleftOfB  = (xB > baseB) != rightSide

    # include pB and exclude pA (we already know its closest point in B)
    ArightOfA[iA] = False
    AleftOfB[iA]  = False
    BleftOfB[iB]  = True
    BrightOfA[iB] = True

    # recurse left side
    if np.any(AleftOfB) and np.any(BleftOfB):
        lA,lB,lD = closest(A[AleftOfB],B[BleftOfB])
        if lD < dAB: pA,pB,dAB = lA,lB,lD

    # resurse right side
    if np.any(ArightOfA) and np.any(BrightOfA):
        rA,rB,rD = closest(A[ArightOfA],B[BrightOfA])
        if rD < dAB: pA,pB,dAB = rA,rB,rD

    return pA,pB,dAB

Tested using two random set of 2D points with 10 million points each: 使用两组随机的2D点进行测试,每个点均具有1000万个点:

dimCount = 2
ACount   = 10000000
ASpread  = ACount
BCount   = ACount-1
BSpread  = BCount
A = np.random.random((ACount,dimCount))*ASpread-ASpread/2
B = np.random.random((BCount,dimCount))*BSpread-BSpread/2

a,b,d = closest(A,B)
print("closest points:",a,b,"distance:",d)

# closest points: [-4422004.2963273   2783038.35968559] [-4422004.76974851  2783038.61468366] distance: 0.5377282447465505

The way it works is by dividing the A and B points based on a strategically selected pair (pA,pB). 它的工作方式是根据策略选择的一对(pA,pB)对A和B点进行划分。 The line between pA and pB serves as a partition for points of the two lists. pA和pB之间的线用作两个列表中各个点的分区。 Each side of this partitioning is then used recursively to find other (closer) pairs of points. 然后,递归地使用该分区的每一面来查找其他(更接近的)点对。

Graphically, this corresponds to a partition based on perpendicular lines of the pA-pB segment: 在图形上,这对应于基于pA-pB段的垂直线的分区:

在此处输入图片说明

The strategy for selecting pA and pB is to find the approximate center of the two groups of points and pick a point (pA) from list A that is close to that center. 选择pA和pB的策略是找到两组点的近似中心,并从列表A中选择一个接近该中心的点(pA)。 Then select the closest point to pA in list B. This ensures that there are no points in between the two perpendicular lines that are closer to pA or pB in the other list. 然后在列表B中选择最接近pA的点。这样可以确保在两条垂直线之间没有另一个点在另一个列表中更靠近pA或pB的点。

Points of A and B that are on opposite sides of the perpendicular lines are necessarily farther away from each other than pA-pB so they can be isolated in two sub-lists and processed separately. 垂直线相对侧上的A和B点必须比pA-pB彼此远离,因此可以将它们隔离在两个子列表中并分别进行处理。

This allows a "divide and conquer" approach that greatly reduces the number of point-to-point distances to compare. 这允许“分而治之”的方法大大减少了要比较的点对点距离的数量。

In my tests (with randomly distributed points) the performance seemed to be linear in proportion to the total number of points in A and B. I tried skewing the distribution by creating small clusters of points far appart (so that no point is actually near the approximate center) and the performance was still linear. 在我的测试中(具有随机分布的点),性能似乎与A和B中的点总数成线性关系。我试图通过创建远距Appart的小点集群来改变分布(因此,实际上没有点靠近近似中心),效果仍然是线性的。 I'm not sure if there are any "worst case" point distributions that will cause a drop in performance (I haven't found one yet) 我不确定是否有任何“最坏情况”的点分布会导致性能下降(我尚未找到)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM