![](/img/trans.png)
[英]Error while compiling a model using the adam optimizer in tensorflow keras
[英]400% higher error with PyTorch compared with identical Keras model (with Adam optimizer)
TLDR :
A simple (single hidden-layer) feed-forward Pytorch model trained to predict the function y = sin(X1) + sin(X2) +... sin(X10)
substantially underperforms an identical model built/trained with Keras. 为什么会这样,可以做些什么来减轻性能差异?
在训练回归 model 时,我注意到 PyTorch 的性能大大低于用 Z7FEE7BB63EF4294BFC3E2 构建的相同 model。
这种现象之前已经观察到并报道过:
之前也提出了以下解释和建议:
在使用autograd.grad
计算二阶导数时将retain_graph=True
create_graph=True
1
确保您以相同的方式计算验证损失: 1
训练 pytorch model 以获得更长的时期: 1
尝试几个随机种子: 1
确保在训练 pytorch model 时在验证步骤中调用model.eval()
: 1
主要问题在于 Adam 优化器,而不是初始化: 1
为了理解这个问题,我在 Keras 和 PyTorch 中训练了一个简单的两层神经网络(比我的原始模型简单得多),使用相同的超参数和初始化例程,并遵循上面列出的所有建议。 However, the PyTorch model results in a mean squared error (MSE) that is 400% higher than the MSE of the Keras model.
这是我的代码:
0. 进口
import numpy as np
from scipy.stats import pearsonr
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from torch.utils.data import Dataset, DataLoader
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.regularizers import L2
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
1. 生成可重现的数据集
def get_data():
np.random.seed(0)
Xtrain = np.random.normal(0, 1, size=(7000,10))
Xval = np.random.normal(0, 1, size=(700,10))
ytrain = np.sum(np.sin(Xtrain), axis=-1)
yval = np.sum(np.sin(Xval), axis=-1)
scaler = MinMaxScaler()
ytrain = scaler.fit_transform(ytrain.reshape(-1,1)).reshape(-1)
yval = scaler.transform(yval.reshape(-1,1)).reshape(-1)
return Xtrain, Xval, ytrain, yval
class XYData(Dataset):
def __init__(self, X, y):
super(XYData, self).__init__()
self.X = torch.tensor(X, dtype=torch.float32)
self.y = torch.tensor(y, dtype=torch.float32)
self.len = len(y)
def __getitem__(self, index):
return (self.X[index], self.y[index])
def __len__(self):
return self.len
# Data, dataset, and dataloader
Xtrain, Xval, ytrain, yval = get_data()
traindata = XYData(Xtrain, ytrain)
valdata = XYData(Xval, yval)
trainloader = DataLoader(dataset=traindata, shuffle=True, batch_size=32, drop_last=False)
valloader = DataLoader(dataset=valdata, shuffle=True, batch_size=32, drop_last=False)
2. 使用相同的超参数和初始化方法构建 Keras 和 PyTorch 模型
class TorchLinearModel(nn.Module):
def __init__(self, input_dim=10, random_seed=0):
super(TorchLinearModel, self).__init__()
_ = torch.manual_seed(random_seed)
self.hidden_layer = nn.Linear(input_dim,100)
self.initialize_layer(self.hidden_layer)
self.output_layer = nn.Linear(100, 1)
self.initialize_layer(self.output_layer)
def initialize_layer(self, layer):
_ = torch.nn.init.xavier_normal_(layer.weight)
#_ = torch.nn.init.xavier_uniform_(layer.weight)
_ = torch.nn.init.constant(layer.bias,0)
def forward(self, x):
x = self.hidden_layer(x)
x = self.output_layer(x)
return x
def mean_squared_error(ytrue, ypred):
return torch.mean(((ytrue - ypred) ** 2))
def build_torch_model():
torch_model = TorchLinearModel()
optimizer = optim.Adam(torch_model.parameters(),
betas=(0.9,0.9999),
eps=1e-7,
lr=1e-3,
weight_decay=0)
return torch_model, optimizer
def build_keras_model():
x = layers.Input(shape=10)
z = layers.Dense(units=100, activation=None, use_bias=True, kernel_regularizer=None,
bias_regularizer=None)(x)
y = layers.Dense(units=1, activation=None, use_bias=True, kernel_regularizer=None,
bias_regularizer=None)(z)
keras_model = Model(x, y, name='linear')
optimizer = Adam(learning_rate=1e-3, beta_1=0.9, beta_2=0.9999, epsilon=1e-7,
amsgrad=False)
keras_model.compile(optimizer=optimizer, loss='mean_squared_error')
return keras_model
# Instantiate models
torch_model, optimizer = build_torch_model()
keras_model = build_keras_model()
3. 训练 PyTorch model 100 个 epoch:
torch_trainlosses, torch_vallosses = [], []
for epoch in range(100):
# Training
losses = []
_ = torch_model.train()
for i, (x,y) in enumerate(trainloader):
optimizer.zero_grad()
ypred = torch_model(x)
loss = mean_squared_error(y, ypred)
_ = loss.backward()
_ = optimizer.step()
losses.append(loss.item())
torch_trainlosses.append(np.mean(losses))
# Validation
losses = []
_ = torch_model.eval()
with torch.no_grad():
for i, (x, y) in enumerate(valloader):
ypred = torch_model(x)
loss = mean_squared_error(y, ypred)
losses.append(loss.item())
torch_vallosses.append(np.mean(losses))
print(f"epoch={epoch+1}, train_loss={torch_trainlosses[-1]:.4f}, val_loss={torch_vallosses[-1]:.4f}")
4. 训练 Keras model 100 个 epoch:
history = keras_model.fit(Xtrain, ytrain, sample_weight=None, batch_size=32, epochs=100,
validation_data=(Xval, yval))
5. 训练历史损失
plt.plot(torch_trainlosses, color='blue', label='PyTorch Train')
plt.plot(torch_vallosses, color='blue', linestyle='--', label='PyTorch Val')
plt.plot(history.history['loss'], color='brown', label='Keras Train')
plt.plot(history.history['val_loss'], color='brown', linestyle='--', label='Keras Val')
plt.legend()
Keras 在训练中记录的误差要低得多。 由于这可能是由于 Keras 计算损失的方式不同,我使用 sklearn.metrics.mean_squared_error 计算了验证集上的预测误差
6.训练后的验证错误
ypred_keras = keras_model.predict(Xval).reshape(-1)
ypred_torch = torch_model(torch.tensor(Xval, dtype=torch.float32))
ypred_torch = ypred_torch.detach().numpy().reshape(-1)
mse_keras = metrics.mean_squared_error(yval, ypred_keras)
mse_torch = metrics.mean_squared_error(yval, ypred_torch)
print('Percent error difference:', (mse_torch / mse_keras - 1) * 100)
r_keras = pearsonr(yval, ypred_keras)[0]
r_pytorch = pearsonr(yval, ypred_torch)[0]
print("r_keras:", r_keras)
print("r_pytorch:", r_pytorch)
plt.scatter(ypred_keras, yval); plt.title('Keras'); plt.show(); plt.close()
plt.scatter(ypred_torch, yval); plt.title('Pytorch'); plt.show(); plt.close()
Percent error difference: 479.1312469426776
r_keras: 0.9115184443702814
r_pytorch: 0.21728812737220082
Keras 的预测值与 ground truth 的相关性为 0.912,但 Pytorch 的相关性为 0.217,Pytorch 的误差高出 479%!
7.其他试验我也试过:
torch.nn.init.xavier_uniform_
而不是torch.nn.init.xavier_normal_
。 R从 0.217 提高到 0.639 ,但仍然比 Keras (0.912) 差。可以做些什么来确保 PyTorch model 收敛到与 Keras Z20F35E630DAF44DB8F4C3F 相当的合理误差?
这里的问题是 PyTorch 训练循环中的无意广播。
nn.Linear
操作的结果始终具有形状[B,D]
,其中B
是批量大小, D
是 output 维度。 因此,在您的mean_squared_error
function ypred
的形状为[32,1]
而ytrue
的形状为[32]
。 根据 NumPy 和 PyTorch 使用的广播规则,这意味着ytrue - ypred
具有形状[32,32]
。 您几乎可以肯定的意思是ypred
具有形状[32]
。 这可以通过多种方式实现; 可能最易读的是使用Tensor.flatten
class TorchLinearModel(nn.Module):
...
def forward(self, x):
x = self.hidden_layer(x)
x = self.output_layer(x)
return x.flatten()
这会产生以下 train/val 曲线
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.