scikit-learn GridSearchCV() fit() 性能提升

Question

我正在使用GridSearchCV()及其fit()方法來構建 model。 我目前正在進行這項工作，但想通過提供更多圖像進行訓練來提高 model 的准確性。 現在， fit()需要一個多小時才能完成 500 張圖像。 隨着圖像數量翻倍，處理時間呈指數增長。 最終，我想在數千張圖像上進行訓練，甚至在我的概念證明中包括除了這兩個之外的其他類別。 我嘗試了幾種提高性能的方法，但無法解決。 減少處理時間的唯一方法是顯着降低train_test_split()中的train_size / test_size ，但這樣做會破壞使用更大數據集進行訓練的目的。 我對這個有點難過。 下面是我用來參考的代碼。 謝謝你。

categories = ['Cat', 'Dog']
flat_data_arr = []
target_arr = []
datadir = 'C:\\Users\\Name\\Python\\images'

for i in categories:
    path = os.path.join(datadir, i)
    for image in os.listdir(path):
        image_array = imread(os.path.join(path, image))
        image_resized = resize(image_array, (150, 150, 3))
        flat_data_arr.append(image_resized.flatten())
        target_arr.append(categories.index(i))

flat_data = np.array(flat_data_arr)
target = np.array(target_arr)
df = pd.DataFrame(flat_data)
df['Target'] = target
x = df.iloc[:,:-1]
y = df.iloc[:,-1]
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.75, test_size=0.25, shuffle=True, stratify=y)
param_grid={'C':[0.1,1,10,100],'gamma':[0.0001,0.001,0.1,1],'kernel':['rbf','poly']}
svc=svm.SVC(probability=True)
model=GridSearchCV(svc,param_grid)
model.fit(x_train,y_train) #this takes hours depending on number of images

Answer 1

Probably best to use tensorflow or keras or pytorch for computer vision and with GPUs on top, this will run in mili/seconds... even without GPU you will see significant speed up.

但是，如果您決定繼續，您可以嘗試以下方法（基本上是減小尺寸並添加功能）：

支持庫

import Image from PIL
from PIL import Image

import numpy as np

from skimage.feature import hog
from skimage.color import rgb2grey

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

你不需要訓練測試拆分使用你的所有數據進行網格搜索，因為它確實交叉驗證......也不知道為什么，但我看到你從 np.array 跳轉到 pandas，我認為你應該能夠直接使用 np 矩陣 go

確保您使用所有內核/處理器，網格搜索調用中的參數 n_jobs = -1 應該這樣做......

然后您還可以進一步減小圖像的大小，例如 100 x 100 而不是 150 x 150

另外可以將圖像轉換為灰度（使您的矩陣為 1 維，而不是 3 維）

grey_scaled = rgb2grey(imread(os.path.join(path, image))..

如果有興趣進行實驗，那么可以嘗試通過步驟 3 的預處理來使用您的 gray_scaled 圖像的 hog 特征

hog_features = hog(grey_scaled, block_norm='L2-Hys', pixels_per_cell=(10,10))

您甚至可以嘗試將原始圖像和 hog 特征堆疊在一起

color_features = imread(os.path.join(path, image).flatten()
final_features = np.hstack((color_features,hog_features))

循環遍歷所有圖像，並 append 這個管道說“final_features_list”列表並將其轉換為a to matrix = np.array(final_features_list)

有了這么多功能，您可能可以降低維度。 所以標准規模並做PCA。


standard_sc = StandardScaler()

matrix_scaled = standard_sc.fit_transform(matrix)

### read up on how to select # of components
### there are methods to help you with that
pca = PCA(n_components=300)
matrix_scaled_pca = pca.fit_transform(matrix_scaled)

添加嘗試使用 matrix_scaled_pca 矩陣再次運行您的網格搜索...應該 go 更快。 也可以嘗試 RandomizedSearchCV

scikit-learn GridSearchCV() fit() 性能提升

問題描述

1 個解決方案

解決方案1
0 2022-02-04 20:26:35

支持庫

scikit-learn GridSearchCV() fit() 性能提升

問題描述

1 個解決方案

解決方案1 0 2022-02-04 20:26:35

支持庫

解決方案1
0 2022-02-04 20:26:35