簡體   English   中英

如何優化我的代碼以計算歐氏距離

[英]How to optimize my code to calculate Euclidean distance

我試圖找到兩點之間的歐幾里得距離。 我在 Dataframe 中有大約 13000 行。我必須針對所有 13000 行找到每一行的歐幾里德距離,然后得到相似度分數。 運行代碼更耗時(超過 24 小時)。

下面是我的代碼:

# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)

# 'i' refers all id's in the dataframe
# Length of df_distance is 13000

for i in tqdm(range(len(df_distance))):
    df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])

    # in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the 
    # comparision from that index of "i" itself.
    if i < len(df_distance):
        index = i

    # This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's 
    for j in (range(len(df_distance))):

        # "a" is the id we are comparing with
        a = df_distance.iloc[i,2:]        

        # "b" is the id we are selecting to compare with
        b = df_distance.iloc[index,2:]

        value = euclidean_dist(a,b)

        # Create a temp dictionary to load the data into dataframe
        dict = {
            'id': df_distance['id'][i], 
            'id_match': df_distance['id'][index], 
            'similarity_distance':value
        }


        df_50 = df_50.append(dict,ignore_index=True)

        # if the b values are less (nearer to the end of the array)
        # in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
        if index == len(df_distance)-1:
            index = 0
        else:
            index +=1

    # Append the content of "df_50" into "df_similar" once for the iteration of "i"
    df_similar = df_similar.append(df_50,ignore_index=True)

我想對我來說更耗時的是 for 循環。

歐氏距離 function 我在我的代碼中使用。

from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
        euclidean_val = euclidean_distances([a, b])
        value = euclidean_val[0][1]
        return value

示例 df_distance 數據注意:在圖像中,值是從列位置到末端縮放的,我們僅使用此值來計算距離

在此處輸入圖像描述

Output 格式如下。 在此處輸入圖像描述

嘗試改用 numpy,做這樣的事情:

import pandas as pd
import numpy as np 

def numpy_euclidian_distance(point_1, point_2):
    array_1, array_2 = np.array(point_1), np.array(point_2)
    squared_distance = np.sum(np.square(array_1 - array_2))
    distance = np.sqrt(squared_distance)
    return distance 
    
    
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
 
# # Create DataFrame
df = pd.DataFrame(data)

# calculate distance of the hole number at ones using numpy 
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)

好的,所以根據評論我認為你想要前 50 個距離,使用KDTree一步更快。 作為警告, KDTree只會比columns**2 < rows的蠻力更快,所以你們有超過 13 行,可能有更快的實現方法,但這仍然可能是最簡單的:

from scipy.spatial import KDTree
X = df_distance.values
X_tree = KDTree(X)
k_d, k_i = X_tree.query(X, k = 50)  # shape of each is (13k, 50)

然后k_i[i]將是距離索引i處的點最近的 50 個點的索引列表,其中0 <= i < 13000 ,並且k_d[i]將是相應的距離。

編輯:這應該得到你想要的 dataframe,使用多索引:

df_d = {
        idx: {
              df_distance['id'][k_i[i, j]]: d for j, d in enumerate(k_d[i])
              } for i, idx in enumerate(df_distance['id'])
        }
out = pd.dataframe(df_d).T

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM