与相同的 Keras model（使用 Adam 优化器）相比，使用 PyTorch 的错误高出 400%

Question

TLDR : TLDR ：

A simple (single hidden-layer) feed-forward Pytorch model trained to predict the function y = sin(X1) + sin(X2) +... sin(X10) substantially underperforms an identical model built/trained with Keras. A simple (single hidden-layer) feed-forward Pytorch model trained to predict the function y = sin(X1) + sin(X2) +... sin(X10) substantially underperforms an identical model built/trained with Keras. Why is this so and what can be done to mitigate the difference in performance?为什么会这样，可以做些什么来减轻性能差异？

In training a regression model, I noticed that PyTorch drastically underperforms an identical model built with Keras.在训练回归 model 时，我注意到 PyTorch 的性能大大低于用 Z7FEE7BB63EF4294BFC3E2 构建的相同 model。

This phenomenon has been observed and reported previously :这种现象之前已经观察到并报道过：

The following explanations and suggestions have been made previously as well :之前也提出了以下解释和建议：

Using the same decimal precision (32 vs 64): 1 , 2 ,使用相同的十进制精度（32 vs 64）： 1 , 2 ,
Using a CPU instead of a GPU: 1 , 2使用 CPU 代替 GPU: 1 , 2
Change retain_graph=True to create_graph=True in computing the 2nd derivative with autograd.grad : 1在使用autograd.grad计算二阶导数时将retain_graph=True create_graph=True 1
Check if keras is using a regularizer, constraint, bias, or loss function in a different way from pytorch: 1 , 2检查 keras 是否正在使用正则化器、约束、偏差或损失 function 以不同于pytorch的方式使用： 1、2
Ensure you are computing the validation loss in the same way: 1确保您以相同的方式计算验证损失： 1
Use the same initialization routine: 1 , 2使用相同的初始化例程： 1 , 2
Training the pytorch model for longer epochs: 1训练 pytorch model 以获得更长的时期： 1
Trying several random seeds: 1尝试几个随机种子： 1
Ensure that model.eval() is called in validation step when training pytorch model: 1确保在训练 pytorch model 时在验证步骤中调用model.eval() ： 1
The main issue is with the Adam optimizer, not the initialization: 1主要问题在于 Adam 优化器，而不是初始化： 1

To understand this issue, I trained a simple two-layer neural network (much simpler than my original model) in Keras and PyTorch, using the same hyperparameters and initialization routines, and following all the recommendations listed above.为了理解这个问题，我在 Keras 和 PyTorch 中训练了一个简单的两层神经网络（比我的原始模型简单得多），使用相同的超参数和初始化例程，并遵循上面列出的所有建议。 However, the PyTorch model results in a mean squared error (MSE) that is 400% higher than the MSE of the Keras model. However, the PyTorch model results in a mean squared error (MSE) that is 400% higher than the MSE of the Keras model.

Here is my code:这是我的代码：

0. Imports 0. 进口

import numpy as np
from scipy.stats import pearsonr

from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics

from torch.utils.data import Dataset, DataLoader

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.regularizers import L2
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

1. Generate a reproducible dataset 1. 生成可重现的数据集


def get_data():

    np.random.seed(0)
    Xtrain = np.random.normal(0, 1, size=(7000,10))
    Xval = np.random.normal(0, 1, size=(700,10))
    ytrain = np.sum(np.sin(Xtrain), axis=-1)
    yval = np.sum(np.sin(Xval), axis=-1)
    scaler = MinMaxScaler()
    ytrain = scaler.fit_transform(ytrain.reshape(-1,1)).reshape(-1)
    yval = scaler.transform(yval.reshape(-1,1)).reshape(-1) 

    return Xtrain, Xval, ytrain, yval



class XYData(Dataset):
    
    def __init__(self, X, y):
        
        super(XYData, self).__init__()
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)
        self.len = len(y)
         
    def __getitem__(self, index):
        
        return (self.X[index], self.y[index])


    def __len__(self):

        return self.len

# Data, dataset, and dataloader
Xtrain, Xval, ytrain, yval = get_data()
traindata = XYData(Xtrain, ytrain)
valdata = XYData(Xval, yval)
trainloader = DataLoader(dataset=traindata, shuffle=True, batch_size=32, drop_last=False)
valloader = DataLoader(dataset=valdata, shuffle=True, batch_size=32, drop_last=False)

2. Build Keras and PyTorch models with identical hyperparameters and initialization methods 2. 使用相同的超参数和初始化方法构建 Keras 和 PyTorch 模型

class TorchLinearModel(nn.Module):
    
    def __init__(self, input_dim=10, random_seed=0):
        
        super(TorchLinearModel, self).__init__()
        _ = torch.manual_seed(random_seed)
        self.hidden_layer = nn.Linear(input_dim,100)
        self.initialize_layer(self.hidden_layer)        
        self.output_layer = nn.Linear(100, 1)
        self.initialize_layer(self.output_layer)

    def initialize_layer(self, layer):
        
        _ = torch.nn.init.xavier_normal_(layer.weight)
        #_ = torch.nn.init.xavier_uniform_(layer.weight)
        _ = torch.nn.init.constant(layer.bias,0)
        
    def forward(self, x):
        x = self.hidden_layer(x)
        x = self.output_layer(x)
        return x




def mean_squared_error(ytrue, ypred):
    
    return torch.mean(((ytrue - ypred) ** 2))


def build_torch_model():

    torch_model = TorchLinearModel()
    optimizer = optim.Adam(torch_model.parameters(), 
                           betas=(0.9,0.9999),
                           eps=1e-7,
                           lr=1e-3,
                           weight_decay=0)
    return torch_model, optimizer




def build_keras_model():
    
    x = layers.Input(shape=10)
    z = layers.Dense(units=100, activation=None, use_bias=True, kernel_regularizer=None, 
                     bias_regularizer=None)(x)
    y = layers.Dense(units=1, activation=None, use_bias=True, kernel_regularizer=None, 
                     bias_regularizer=None)(z)
    keras_model = Model(x, y, name='linear')
    optimizer = Adam(learning_rate=1e-3, beta_1=0.9, beta_2=0.9999, epsilon=1e-7, 
                     amsgrad=False)
    
    keras_model.compile(optimizer=optimizer, loss='mean_squared_error')
    
    return keras_model




# Instantiate models
torch_model, optimizer = build_torch_model()
keras_model = build_keras_model()

3. Train PyTorch model for 100 epochs: 3. 训练 PyTorch model 100 个 epoch：


torch_trainlosses, torch_vallosses = [], []

for epoch in range(100):

    # Training
    losses = []
    _ = torch_model.train()
    
    for i, (x,y) in enumerate(trainloader):
        optimizer.zero_grad()                          
        ypred = torch_model(x)
        loss = mean_squared_error(y, ypred) 
        _ = loss.backward()
        _ = optimizer.step()
        losses.append(loss.item())
    torch_trainlosses.append(np.mean(losses))
    
    # Validation
    losses = []
    _ = torch_model.eval()

    with torch.no_grad():
        for i, (x, y) in enumerate(valloader):
            ypred = torch_model(x)
            loss = mean_squared_error(y, ypred) 
            losses.append(loss.item())
    torch_vallosses.append(np.mean(losses))
    
    print(f"epoch={epoch+1}, train_loss={torch_trainlosses[-1]:.4f}, val_loss={torch_vallosses[-1]:.4f}")

4. Train Keras model for 100 epochs: 4. 训练 Keras model 100 个 epoch：

history = keras_model.fit(Xtrain, ytrain, sample_weight=None, batch_size=32, epochs=100, 
                    validation_data=(Xval, yval))

5. Loss in training history 5. 训练历史损失

plt.plot(torch_trainlosses, color='blue', label='PyTorch Train')    
plt.plot(torch_vallosses, color='blue', linestyle='--', label='PyTorch Val')  
plt.plot(history.history['loss'], color='brown', label='Keras Train')
plt.plot(history.history['val_loss'], color='brown', linestyle='--', label='Keras Val')
plt.legend()

Keras records a much lower error in the training. Keras 在训练中记录的误差要低得多。 Since this may be due to a difference in how Keras computes the loss, I calculated the prediction error on the validation set with sklearn.metrics.mean_squared_error由于这可能是由于 Keras 计算损失的方式不同，我使用 sklearn.metrics.mean_squared_error 计算了验证集上的预测误差

6. Validation error after training 6.训练后的验证错误

ypred_keras = keras_model.predict(Xval).reshape(-1)
ypred_torch = torch_model(torch.tensor(Xval, dtype=torch.float32))
ypred_torch = ypred_torch.detach().numpy().reshape(-1)


mse_keras = metrics.mean_squared_error(yval, ypred_keras)
mse_torch = metrics.mean_squared_error(yval, ypred_torch)
print('Percent error difference:', (mse_torch / mse_keras - 1) * 100) 

r_keras = pearsonr(yval, ypred_keras)[0] 
r_pytorch = pearsonr(yval, ypred_torch)[0]  
print("r_keras:", r_keras)
print("r_pytorch:", r_pytorch)

plt.scatter(ypred_keras, yval); plt.title('Keras'); plt.show(); plt.close()
plt.scatter(ypred_torch, yval); plt.title('Pytorch'); plt.show(); plt.close()

Percent error difference: 479.1312469426776
r_keras: 0.9115184443702814
r_pytorch: 0.21728812737220082

The correlation of predicted values with ground truth is 0.912 for Keras but 0.217 for Pytorch, and the error for Pytorch is 479% higher! Keras 的预测值与 ground truth 的相关性为 0.912，但 Pytorch 的相关性为 0.217，Pytorch 的误差高出 479%！

7. Other trials I also tried: 7.其他试验我也试过：

Lowering the learning rate for Pytorch (lr=1e-4), R increases from 0.217 to 0.576 , but it's still much worse than Keras (r=0.912).降低 Pytorch (lr=1e-4) 的学习率， R 从 0.217 增加到 0.576 ，但仍然比 Z7FEE7BB66F4294C3E32783EFA17D9=0.576 差很多。
Increasing the learning rate for Pytorch (lr=1e-2), R is worse at 0.095增加 Pytorch (lr=1e-2) 的学习率， R 更差为 0.095
Training numerous times with different random seeds.使用不同的随机种子进行多次训练。 The performance is roughly the same , regardless.无论如何，性能大致相同。
Trained for longer than 100 epochs.训练超过 100 个 epoch。 No improvement was observed!没有观察到改善！
Used torch.nn.init.xavier_uniform_ instead of torch.nn.init.xavier_normal_ in the initialization of the weights.在初始化权重时使用了torch.nn.init.xavier_uniform_而不是torch.nn.init.xavier_normal_ 。 R improves from 0.217 to 0.639 , but it's still worse than Keras (0.912). R从 0.217 提高到 0.639 ，但仍然比 Keras (0.912) 差。

What can be done to ensure that the PyTorch model converges to a reasonable error comparable with the Keras model?可以做些什么来确保 PyTorch model 收敛到与 Keras Z20F35E630DAF44DB8F4C3F 相当的合理误差？

Answer 1

The problem here is unintentional broadcasting in the PyTorch training loop.这里的问题是 PyTorch 训练循环中的无意广播。

The result of a nn.Linear operation always has shape [B,D] , where B is the batch size and D is the output dimension. nn.Linear操作的结果始终具有形状[B,D] ，其中B是批量大小， D是 output 维度。 Therefore, in your mean_squared_error function ypred has shape [32,1] and ytrue has shape [32] .因此，在您的mean_squared_error function ypred的形状为[32,1]而ytrue的形状为[32] 。 By the broadcasting rules used by NumPy and PyTorch this means that ytrue - ypred has shape [32,32] .根据 NumPy 和 PyTorch 使用的广播规则，这意味着ytrue - ypred具有形状[32,32] 。 What you almost certainly meant is for ypred to have shape [32] .您几乎可以肯定的意思是ypred具有形状[32] 。 This can be accomplished in many ways;这可以通过多种方式实现； probably the most readable is to use Tensor.flatten可能最易读的是使用Tensor.flatten

class TorchLinearModel(nn.Module):
    ...
    def forward(self, x):
        x = self.hidden_layer(x)
        x = self.output_layer(x)
        return x.flatten()

which produces the following train/val curves这会产生以下 train/val 曲线

与相同的 Keras model（使用 Adam 优化器）相比，使用 PyTorch 的错误高出 400%

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-09-04 17:41:45

与相同的 Keras model（使用 Adam 优化器）相比，使用 PyTorch 的错误高出 400%

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-09-04 17:41:45

解决方案1
1 已采纳 2022-09-04 17:41:45