如何確定哪條曲線最接近給定點集？

Question

我有幾個數據框，每個數據框包含兩列x和y值，因此每一行代表曲線上的一個點。 然后，不同的數據框代表地圖上的輪廓。 我還有一系列數據點（數量較少），我想看看它們平均最接近哪個輪廓。

我想用sqrt(x^2+y^2) - sqrt(x_1^2 + y_1^2)建立從每個數據點到曲線上每個點的距離，為曲線上的每個點加起來。 麻煩的是曲線上有數千個點，並且只有幾十個數據點需要評估，所以我不能簡單地將它們放在彼此相鄰的列中。

我認為我需要遍歷數據點，檢查它們與曲線中每個點之間的平方距離。 我不知道是否有一個簡單的功能或模塊可以做到這一點。 提前致謝！

編輯：感謝您的評論。 @Alexander：我已經嘗試過使用樣本數據集進行向量化功能，如下所示。 我實際上正在使用包含數千個數據點的輪廓，要進行比較的數據集是100+，因此我希望能夠實現盡可能多的自動化。 目前，我可以創建從第一個數據點到輪廓的距離測量，但理想情況下，我也想循環遍歷j。 當我嘗試它時，會出現一個錯誤：

import numpy as np
from numpy import vectorize
import pandas as pd
from pandas import DataFrame

df1 = {'X1':['1', '2', '2', '3'], 'Y1':['2', '5', '7', '9']}
df1 = DataFrame(df1, columns=['X1', 'Y1'])
df2 = {'X2':['3', '5', '6'], 'Y2':['10', '15', '16']}
df2 = DataFrame(df2, columns=['X2', 'Y2'])
df1=df1.astype(float)
df2=df2.astype(float)
Distance=pd.DataFrame()

i = range(0, len(df1))
j = range(0, len(df2))

def myfunc(x1, y1, x2, y2):
    return np.sqrt((x2-x1)**2+np.sqrt(y2-y1)**2)

vfunc=np.vectorize(myfunc)
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i]   ['X1'], df1.iloc[i]['Y1'], df2.iloc[0]['X2'], df2.iloc[0]['Y2'])
Distance['Distance of Datapoint j to Contour']=vfunc(df1.iloc[i] ['X1'], df1.iloc[i]['Y1'], df2.iloc[1]['X2'], df2.iloc[1]['Y2'])
Distance

Answer 1

對於距離，您需要將公式更改為

def getDistance(x, y, x_i, y_i):
    return sqrt((x_i -x)^2 + (y_i - y)^2)

其中（x，y）是您的數據點，而（x_i，y_i）是曲線上的一個點。

考慮使用NumPy進行矢量化。 明確循環遍歷數據點的效率可能會降低，具體取決於您的用例，但是可能足夠快。 （如果您需要定期運行它，我認為矢量化將很容易超過顯式方式的速度）這可能看起來像這樣：

import numpy as np # Universal abbreviation for the module

datapoints = np.random.rand(3,2) # Returns a vector with randomized entries of size 3x2 (Imagine it as 3 sets of x- and y-values

contour1 = np.random.rand(1000, 2) # Other than the size (which is 1000x2) no different than datapoints
contour2 = np.random.rand(1000, 2)
contour3 = np.random.rand(1000, 2)

def squareDistanceUnvectorized(datapoint, contour):
    retVal = 0.
    print("Using datapoint with values x:{}, y:{}".format(datapoint[0], datapoint[1]))

    lengthOfContour = np.size(contour, 0) # This gets you the number of lines in the vector

    for pointID in range(lengthOfContour):
        squaredXDiff = np.square(contour[pointID,0] - datapoint[0])
        squaredYDiff = np.square(contour[pointID,1] - datapoint[1])
        retVal += np.sqrt(squaredXDiff + squaredYDiff)

    retVal = retVal / lengthOfContour # As we want the average, we are dividing the sum by the element count
    return retVal

if __name__ == "__main__":
    noOfDatapoints = np.size(datapoints,0)
    contID = 0
    for currentDPID in range(noOfDatapoints):
        dist1 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour1)
        dist2 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour2)
        dist3 = squareDistanceUnvectorized(datapoints[currentDPID,:], contour3)
        if dist1 > dist2 and dist1 > dist3:
            contID = 1
        elif dist2 > dist1 and dist2 > dist3:
            contID = 2
        elif dist3 > dist1 and dist3 > dist2:
            contID = 3
        else:
            contID = 0
        if contID == 0:
            print("Datapoint {} is inbetween two contours".format(currentDPID))
        else:
            print("Datapoint {} is closest to contour {}".format(currentDPID, contID))

好的，現在轉到矢量域。

我已自由地將這部分調整為我認為是您的數據集。 試試看，讓我知道它是否有效。

import numpy as np
import pandas as pd

# Generate 1000 points (2-dim Vector) with random values between 0 and 1. Make them strings afterwards.
# This is the first contour
random2Ddata1 = np.random.rand(1000,2)
listOfX1      = [str(x) for x in random2Ddata1[:,0]]
listOfY1      = [str(y) for y in random2Ddata1[:,1]]

# Do the same for a second contour, except that we de-center this 255 units into the first dimension
random2Ddata2 = np.random.rand(1000,2)+[255,0]
listOfX2      = [str(x) for x in random2Ddata2[:,0]]
listOfY2      = [str(y) for y in random2Ddata2[:,1]]

# After this step, our 'contours' are basically two blobs of datapoints whose centers are approx. 255 units apart.

# Generate a set of 4 datapoints and make them a Pandas-DataFrame
datapoints = {'X': ['0.5', '0', '255.5', '0'], 'Y': ['0.5', '0', '0.5', '-254.5']}
datapoints = pd.DataFrame(datapoints, columns=['X', 'Y'])

# Do the same for the two contours
contour1    = {'Xf': listOfX1, 'Yf': listOfY1}
contour1    = pd.DataFrame(contour1,  columns=['Xf', 'Yf'])

contour2    = {'Xf': listOfX2, 'Yf': listOfY2}
contour2    = pd.DataFrame(contour2,  columns=['Xf', 'Yf'])

# We do now have 4 datapoints.
# - The first datapoint is basically where we expect the mean of the first contour to be.
#   Contour 1 consists of 1000 points with x, y- values between 0 and 1
# - The second datapoint is at the origin. Its distances should be similar to the once of the first datapoint
# - The third datapoint would be the result of shifting the first datapoint 255 units into the positive first dimension
# - The fourth datapoint would be the result of shifting the first datapoint 255 units into the negative second dimension

# Transformation into numpy array
# First the x and y values of the data points
dpArray = ((datapoints.values).T).astype(np.float)
c1Array = ((contour1.values).T).astype(np.float)
c2Array = ((contour2.values).T).astype(np.float)

# This did the following:
# - Transform the datapoints and contours into numpy arrays
# - Transpose them afterwards so that if we want all x values, we can write var[0,:] instead of var[:,0].
#   A personal preference, maybe
# - Convert all the values into floats.

# Now, we iterate through the contours. If you have a lot of them, putting them into a list beforehand would do the job
for contourid, contour in enumerate([c1Array, c2Array]):
    # Now for the datapoints
    for _index, _value in enumerate(dpArray[0,:]):
        # The next two lines do vectorization magic.
        # First, we square the difference between one dpArray entry and the contour x values.
        # You might notice that contour[0,:] returns an 1x1000 vector while dpArray[0,_index] is an 1x1 float value.
        # This works because dpArray[0,_index] is broadcasted to fit the size of contour[0,:].
        dx       = np.square(dpArray[0,_index] - contour[0,:])
        # The same happens for dpArray[1,_index] and contour[1,:]
        dy       = np.square(dpArray[1,_index] - contour[1,:])
        # Now, we take (for one datapoint and one contour) the mean value and print it.
        # You could write it into an array or do basically anything with it that you can imagine
        distance = np.mean(np.sqrt(dx+dy))
        print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))

# But you want to be able to call this... so here we go, generating a function out of it!
def getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, listOfContourDataFrames):
    """ Takes a DataFrame with points and a list of different contours to return the average distance for each combination"""
    dpArray = ((datapoints.values).T).astype(np.float)

    listOfContours = []
    for item in listOfContourDataFrames:
        listOfContours.append(((item.values).T).astype(np.float))

    retVal  = np.zeros((np.size(dpArray,1), len(listOfContours)))
    for contourid, contour in enumerate(listOfContours):
        for _index, _value in enumerate(dpArray[0,:]):
            dx       = np.square(dpArray[0,_index] - contour[0,:])
            dy       = np.square(dpArray[1,_index] - contour[1,:])
            distance = np.mean(np.sqrt(dx+dy))
            print("Mean distance between contour {} and datapoint {}: {}".format(contourid+1, _index+1, distance))
            retVal[_index, contourid] = distance

    return retVal

# And just to see that it is, indeed, returning the same results, run it once
getDistanceFromDatapointsToListOfContoursFindBetterName(datapoints, [contour1, contour2])

Answer 2

大概的概念

“曲線”實際上是具有很多點的多邊形。 確實有一些庫可以計算多邊形和點之間的距離。 但通常情況如下：

計算整個多邊形的“近似距離”，例如到多邊形的邊界框（從點到4線段）或邊界框的中心
計算到多邊形線的距離。 如果點太多，則可能會減少多邊形的“分辨率”。
找到的最小距離是點到多邊形的距離。
對每個點和每個多邊形重復

現有解決方案

一些庫已經可以做到這一點：

勻稱的問題，勻稱的Geo-Python文檔
在geopandas中勻稱地計算距離
scipy.spatial.distance ：scipy可用於計算任意數量的點之間的距離
numpy.linalg.norm(point1-point2) ：一些答案提出了使用numpy計算距離的不同方法。 有些甚至顯示性能基准
sklearn.neighbors ：與曲線和曲線的距離sklearn.neighbors ，但是如果您要檢查“最可能與哪個面積點相關”，可以使用sklearn.neighbors ：
而且，您始終可以使用D(x1, y1, x2, y2) = sqrt((x₂-x₁)² + (y₂-y₁)²)來計算距離D(x1, y1, x2, y2) = sqrt((x₂-x₁)² + (y₂-y₁)²)並尋找可以給出最小距離的最佳點組合

例：


# get distance from points of 1 dataset to all the points of another dataset

from scipy.spatial import distance

d = distance.cdist(df1.to_numpy(), df2.to_numpy(), 'euclidean')

print(d)
# Results will be a matrix of all possible distances:
# [[ D(Point_df1_0, Point_df2_0), D(Point_df1_0, Point_df2_1), D(Point_df1_0, Point_df2_2)]
#  [ D(Point_df1_1, Point_df2_0), D(Point_df1_1, Point_df2_1), D(Point_df1_1, Point_df2_2)]
#  [ D(Point_df1_3, Point_df2_0), D(Point_df1_2, Point_df2_1), D(Point_df1_2, Point_df2_2)]
#  [ D(Point_df1_3, Point_df2_0), D(Point_df1_3, Point_df2_1), D(Point_df1_3, Point_df2_2)]]

[[ 8.24621125 13.60147051 14.86606875]
 [ 5.09901951 10.44030651 11.70469991]
 [ 3.16227766  8.54400375  9.8488578 ]
 [ 1.          6.32455532  7.61577311]]

接下來該做什么由您決定。 例如，作為“曲線之間的一般距離”的度量，您可以：

在每一行和每一列中選擇最小值（如果跳過一些列/行，那么您可能最終得到“僅匹配輪廓的一部分），並計算其中位數： np.median(np.hstack([np.amin(d, axis) for axis in range(len(d.shape))])) 。
或者，您可以計算以下值的平均值：
- 所有距離： np.median(d)
- 最小距離的2/3”： np.median(d[d<np.percentile(d, 66, interpolation='higher')])
- “至少覆蓋每一行和每一列的最小距離”：

for min_value in np.sort(d, None):
    chosen_indices = d<=min_value
    if np.all(np.hstack([np.amax(chosen_indices, axis) for axis in range(len(chosen_indices.shape))])):
        break

similarity = np.median(d[chosen_indices])

或者也許您可以從一開始就使用不同類型的距離（例如，“相關距離”看起來對您的任務很有幫助）
也許將“ Procrustes分析，兩個數據集的相似性測試”與距離一起使用。
也許您可以使用minkowski距離作為相似性指標。

替代方法

另一種方法是使用一些“幾何”庫來比較凹殼的面積：

構建凹船體輪廓用於和“候選人數據點”（不容易，但有可能：使用勻稱，使用concaveman ）。 但是，如果您確定輪廓已經排序並且沒有重疊的線段，則可以直接從這些點構建多邊形，而無需使用凹殼。
使用“交叉區域”減去“非公共區域”作為相似度（可以使用shapely ）：
- 非公共區域是： union - intersection或簡單地“對稱差”
- 最終指標： intersection.area - symmetric_difference.area （交集，面積）

在某些情況下，此方法可能比處理距離更好，例如：

您希望使用“覆蓋整個區域的較少點”而不是“覆蓋僅覆蓋一半區域的大量非常接近的點”
比較具有不同分數的候選人的更明顯的方法

但是它也有缺點（只需在紙上畫一些例子並嘗試找到它們）

其他想法：

除了使用多邊形或凹殼之外，您還可以：
- 從您的點構建一個線性環，然后使用contour.buffer(some_distance) 。 這樣，您將忽略輪廓的“內部區域”，而僅比較輪廓本身（公差為some_distance ）。 質心之間的距離（或質心的兩倍）可以用作some_distance值
- 您可以使用ops.polygonize從線段構建面/線
而不是使用intersection.area - symmetric_difference.area您可以：
- 將一個對象捕捉到另一個對象，然后將捕捉的對象與原始對象進行比較
在比較實際對象之前，您可以比較對象的“簡單”版本以濾除明顯的不匹配：
- 例如，您可以檢查對象的邊界是否相交
- 或者您可以在比較之前簡化幾何

如何確定哪條曲線最接近給定點集？

問題描述

2 個解決方案

解決方案1
1 2019-05-25 09:27:40

解決方案2
1 已采納 2019-05-25 09:35:06

大概的概念

現有解決方案

例：

替代方法

如何確定哪條曲線最接近給定點集？

問題描述

2 個解決方案

解決方案1 1 2019-05-25 09:27:40

解決方案2 1 已采納 2019-05-25 09:35:06

大概的概念

現有解決方案

例：

替代方法

解決方案1
1 2019-05-25 09:27:40

解決方案2
1 已采納 2019-05-25 09:35:06