在隨機梯度下降過程中，這兩種更新的假設方法之間有何區別？

Question

我對隨機GD期間更新theta有疑問。 我有兩種更新theta的方法：

1）使用先前的theta，獲取所有樣本的所有假設，然后按每個樣本更新theta。 喜歡：

hypothese = np.dot(X, theta)
for i in range(0, m):
    theta = theta + alpha * (y[i] - hypothese[i]) * X[i]

2）另一種方法：在掃描樣本期間，使用最新的theta更新假設[i]。 喜歡：

for i in range(0, m):
    h = np.dot(X[i], theta)
    theta = theta + alpha * (y[i] - h) * X[i]

我檢查了SGD代碼，看來第二種方法是正確的。 但是在我的編碼過程中，第一個會收斂得更快，並且結果要比第二個要好。 為什么錯誤的方法會比正確的方法表現更好？

我還附上了完整的代碼，如下所示：

def SGD_method1():
maxIter = 100 # max iterations
alpha = 1e4 # learning rate
m, n = np.shape(X)  # X[m,n], m:#samples, n:#features
theta = np.zeros(n) # initial theta
for iter in range(0, maxIter):
    hypothese = np.dot(X, theta)  # update all the hypoes using the same theta
    for i in range(0, m):
        theta = theta + alpha * (y[i] - hypothese[i]) * X[i]
return theta

def SGD_method2():
maxIter = 100 # max iterations
alpha = 1e4 # learning rate
m, n = np.shape(X)  # X[m,n], m:#samples, n:#features
theta = np.zeros(n) # initial theta
for iter in range(0, maxIter):
    for i in range(0, m):
        h = np.dot(X[i], theta)  #  update on hypo using the latest theta
        theta = theta + alpha * (y[i] -h) * X[i]
return theta

Answer 1

第一個代碼不是 SGD。 這是“傳統”（批次）漸變下降。 隨機性來自基於為一個樣本（或小批量，稱為mini-bach SGD）計算的梯度的更新。 顯然，誤差函數不是正確的梯度（所有訓練樣本的誤差之和），而是可以證明，在合理的條件下，該過程收斂於局部最優值。 隨機更新由於其簡單性和（在許多情況下）更便宜的計算而在許多應用中是可取的。 兩種算法都是正確的 （都在合理的假設下，保證收斂於局部最優），特定策略的選擇取決於特定問題（尤其是其規模和其他要求）。

在隨機梯度下降過程中，這兩種更新的假設方法之間有何區別？

問題描述

1 個解決方案

解決方案1
0 已采納 2014-05-29 19:58:11

在隨機梯度下降過程中，這兩種更新的假設方法之間有何區別？

問題描述

1 個解決方案

解決方案1 0 已采納 2014-05-29 19:58:11

解決方案1
0 已采納 2014-05-29 19:58:11