[英]How to optimize my code to calculate Euclidean distance
我試圖找到兩點之間的歐幾里得距離。 我在 Dataframe 中有大約 13000 行。我必須針對所有 13000 行找到每一行的歐幾里德距離,然后得到相似度分數。 運行代碼更耗時(超過 24 小時)。
下面是我的代碼:
# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)
# 'i' refers all id's in the dataframe
# Length of df_distance is 13000
for i in tqdm(range(len(df_distance))):
df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])
# in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the
# comparision from that index of "i" itself.
if i < len(df_distance):
index = i
# This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's
for j in (range(len(df_distance))):
# "a" is the id we are comparing with
a = df_distance.iloc[i,2:]
# "b" is the id we are selecting to compare with
b = df_distance.iloc[index,2:]
value = euclidean_dist(a,b)
# Create a temp dictionary to load the data into dataframe
dict = {
'id': df_distance['id'][i],
'id_match': df_distance['id'][index],
'similarity_distance':value
}
df_50 = df_50.append(dict,ignore_index=True)
# if the b values are less (nearer to the end of the array)
# in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
if index == len(df_distance)-1:
index = 0
else:
index +=1
# Append the content of "df_50" into "df_similar" once for the iteration of "i"
df_similar = df_similar.append(df_50,ignore_index=True)
我想對我來說更耗時的是 for 循環。
歐氏距離 function 我在我的代碼中使用。
from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
euclidean_val = euclidean_distances([a, b])
value = euclidean_val[0][1]
return value
示例 df_distance 數據注意:在圖像中,值是從列位置到末端縮放的,我們僅使用此值來計算距離
嘗試改用 numpy,做這樣的事情:
import pandas as pd
import numpy as np
def numpy_euclidian_distance(point_1, point_2):
array_1, array_2 = np.array(point_1), np.array(point_2)
squared_distance = np.sum(np.square(array_1 - array_2))
distance = np.sqrt(squared_distance)
return distance
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
# # Create DataFrame
df = pd.DataFrame(data)
# calculate distance of the hole number at ones using numpy
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)
好的,所以根據評論我認為你想要前 50 個距離,使用KDTree
一步更快。 作為警告, KDTree
只會比columns**2 < rows
的蠻力更快,所以你們有超過 13 行,可能有更快的實現方法,但這仍然可能是最簡單的:
from scipy.spatial import KDTree
X = df_distance.values
X_tree = KDTree(X)
k_d, k_i = X_tree.query(X, k = 50) # shape of each is (13k, 50)
然后k_i[i]
將是距離索引i
處的點最近的 50 個點的索引列表,其中0 <= i < 13000
,並且k_d[i]
將是相應的距離。
編輯:這應該得到你想要的 dataframe,使用多索引:
df_d = {
idx: {
df_distance['id'][k_i[i, j]]: d for j, d in enumerate(k_d[i])
} for i, idx in enumerate(df_distance['id'])
}
out = pd.dataframe(df_d).T
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.