如何優化我的代碼以計算歐氏距離

Question

我試圖找到兩點之間的歐幾里得距離。 我在 Dataframe 中有大約 13000 行。我必須針對所有 13000 行找到每一行的歐幾里德距離，然后得到相似度分數。 運行代碼更耗時（超過 24 小時）。

下面是我的代碼：

# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)

# 'i' refers all id's in the dataframe
# Length of df_distance is 13000

for i in tqdm(range(len(df_distance))):
    df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])

    # in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the 
    # comparision from that index of "i" itself.
    if i < len(df_distance):
        index = i

    # This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's 
    for j in (range(len(df_distance))):

        # "a" is the id we are comparing with
        a = df_distance.iloc[i,2:]        

        # "b" is the id we are selecting to compare with
        b = df_distance.iloc[index,2:]

        value = euclidean_dist(a,b)

        # Create a temp dictionary to load the data into dataframe
        dict = {
            'id': df_distance['id'][i], 
            'id_match': df_distance['id'][index], 
            'similarity_distance':value
        }


        df_50 = df_50.append(dict,ignore_index=True)

        # if the b values are less (nearer to the end of the array)
        # in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
        if index == len(df_distance)-1:
            index = 0
        else:
            index +=1

    # Append the content of "df_50" into "df_similar" once for the iteration of "i"
    df_similar = df_similar.append(df_50,ignore_index=True)

我想對我來說更耗時的是 for 循環。

歐氏距離 function 我在我的代碼中使用。

from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
        euclidean_val = euclidean_distances([a, b])
        value = euclidean_val[0][1]
        return value

示例 df_distance 數據注意：在圖像中，值是從列位置到末端縮放的，我們僅使用此值來計算距離

Output 格式如下。

Answer 1

嘗試改用 numpy，做這樣的事情：

import pandas as pd
import numpy as np 

def numpy_euclidian_distance(point_1, point_2):
    array_1, array_2 = np.array(point_1), np.array(point_2)
    squared_distance = np.sum(np.square(array_1 - array_2))
    distance = np.sqrt(squared_distance)
    return distance 
    
    
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
 
# # Create DataFrame
df = pd.DataFrame(data)

# calculate distance of the hole number at ones using numpy 
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)

Answer 2

好的，所以根據評論我認為你想要前 50 個距離，使用KDTree一步更快。 作為警告， KDTree只會比columns**2 < rows的蠻力更快，所以你們有超過 13 行，可能有更快的實現方法，但這仍然可能是最簡單的：

from scipy.spatial import KDTree
X = df_distance.values
X_tree = KDTree(X)
k_d, k_i = X_tree.query(X, k = 50)  # shape of each is (13k, 50)

然后k_i[i]將是距離索引i處的點最近的 50 個點的索引列表，其中0 <= i < 13000 ，並且k_d[i]將是相應的距離。

編輯：這應該得到你想要的 dataframe，使用多索引：

df_d = {
        idx: {
              df_distance['id'][k_i[i, j]]: d for j, d in enumerate(k_d[i])
              } for i, idx in enumerate(df_distance['id'])
        }
out = pd.dataframe(df_d).T

如何優化我的代碼以計算歐氏距離

問題描述

2 個解決方案

解決方案1
3 已采納 2022-04-24 14:28:40

解決方案2
1 2022-04-25 10:58:18

如何優化我的代碼以計算歐氏距離

問題描述

2 個解決方案

解決方案1 3 已采納 2022-04-24 14:28:40

解決方案2 1 2022-04-25 10:58:18

解決方案1
3 已采納 2022-04-24 14:28:40

解決方案2
1 2022-04-25 10:58:18