有沒有辦法加速這個函數來計算 K-最近鄰？

Question

我對編程還很陌生，我創建了一個函數來計算 1 K 最近鄰 (KNN1) 以進行預測。 問題是，代碼太慢了，我無法在我真正需要的訓練集上測試它。 我的訓練集是 ~ 1200 x 5600，其中有 1200 個數據點和 5600 個特征。 我需要計算每一行中每個特征的平方差之和，然后選擇最相似的另一行。 下面的代碼需要 HOURS，但仍未完成。 我相信永遠需要的是距離循環（三重循環）。

我已經包含了來自 sklearn IRIS 數據集的一個小型訓練集用於測試。

如果有人對加快速度有任何建議，以便我可以在合理的時間范圍內測試我的其他代碼，我們將不勝感激。

from sklearn.datasets import load_iris
import numpy as np   

def l2_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)):
        #print('row one: {}'.format(row1[i]))
        #print('row two: {}'.format(row2[i]))
        distance += (row1[i] - row2[i])**2
    return sqrt(distance)

def KNN1(x, y):
    # Create sum of square distances for each feature in each row
    d_arr = []
    for i in range(0,len(x)):
        d_temp = []
        for j in range(0,len(x)):
            d = l2_distance(x[i], x[j])
            d_temp.append(d)
        d_arr.append(d_temp)
        #del d_temp

    # Find the index for the first NN
    idx_arr = []
    for i in range(0,len(d_arr)):
        temp = list(d_arr[i])
        m = min(j for j in temp if j > 0)
        idx_arr.append(temp.index(m))
        del temp

    del d_arr
    # Make a prediction based off the position in y_train for the test row
    y_hat = []
    for i in range(0,len(idx_arr)):
        y_hat.append(float(y[idx_arr[i]]))
    del idx_arr
    y_hat = np.array(y_hat)
    y_hat = np.reshape(y_hat,(len(y_hat),1))
    a = np.where(y==y_hat, 1, 0)    
    accuracy = float(np.sum(a,axis=0)/float(len(a)))*100.0
    return accuracy

iris = load_iris()
xtrain2 = iris.data[:, :2]
ytrain2 = (iris.target != 0) * 1
ytrain2 = np.reshape(ytrain2, (len(ytrain2),1))

acc = KNN1(xtrain2,ytrain2)
print('Accuracy for KNN (k=1) for the base dataset:\n\t{}\n'.format(acc))

Answer 1

正如評論中提到的，您需要考慮其他算法來加速 KNN，例如球樹（在具有大量特征的數據集上效果很好）或 kd 樹。 算法的優化將成倍地降低時間復雜度。

但是，如果您堅持使用蠻力搜索，以下信息可能會有所幫助：

既然你已經使用了 numpy，為什么不也使用 scipy 來加速你的計算。 您可以使用scipy.spatial.distance.cdist而不是三重循環來獲取距離矩陣，並使用scipy.argsort來查找第一個 NN 的索引。

我把你的代碼改成這樣：

from scipy.spatial.distance import cdist
from scipy import argsort
from scipy.stats import mode

def KNN2(x, y):
    # Create sum of square distances for each feature in each row
    d_arr = cdist(x,x)
    d_arr += np.eye(x.shape[0])*np.max(d_arr)

    # Find the index for the first NN
    idx_arr = argsort(d_arr, axis=1)[:, : 1]

    # ! I don't touch this part
    # Make a prediction based off the position in y_train for the test row
    y_hat = []
    for i in range(0,len(idx_arr)):
        y_hat.append(float(y[idx_arr[i]]))
    del idx_arr
    y_hat = np.array(y_hat)
    y_hat = np.reshape(y_hat,(len(y_hat),1))
    a = np.where(y==y_hat, 1, 0)    
    accuracy = float(np.sum(a,axis=0)/float(len(a)))*100.0

    return accuracy

在我的電腦上測試：

%timeit KNN1(xtrain2,ytrain2)
# 51.4 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit KNN2(xtrain2,ytrain2)
# 1.24 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

我實現的一個 tiny-KNN 可以在這里看到。

有沒有辦法加速這個函數來計算 K-最近鄰？

問題描述

1 個解決方案

解決方案1
0 2020-03-02 04:29:29

有沒有辦法加速這個函數來計算 K-最近鄰？

問題描述

1 個解決方案

解決方案1 0 2020-03-02 04:29:29

解決方案1
0 2020-03-02 04:29:29