Why does a GPflow model not seem to learn anything with TensorFlow optimizers such as tf.optimizers.Adam?

Question

My inducing points are set to trainable but do not change when I call opt.minimize() . Why is it and what does it mean? Does it mean, the model is not learning? What is the difference between tf.optimizers.Adam(lr) and gpflow.optimizers.Scipy ?

The following is the simple classification example adapted from the documentation. When I run this code example with gpflow's Scipy optimizer then I get the trained results and the values for inducing variables keep changing. But when I use Adam optimizer then I get only a straight line prediction, and the values for inducing points remain the same. It indicates that the model is not learning with Adam optimizer.

plot of data before training

plot of data after training with Adam

plot of data after training with gpflow optimizer Scipy

The link for the example is https://gpflow.readthedocs.io/en/develop/notebooks/advanced/multiclass_classification.html

import numpy as np
import tensorflow as tf


import warnings
warnings.filterwarnings('ignore')  # ignore DeprecationWarnings from tensorflow

import matplotlib.pyplot as plt

import gpflow

from gpflow.utilities import print_summary, set_trainable
from gpflow.ci_utils import ci_niter

from tensorflow2_work.multiclass_classification import plot_posterior_predictions, colors

np.random.seed(0)  # reproducibility

# Number of functions and number of data points
C = 3
N = 100

# RBF kernel lengthscale
lengthscale = 0.1

# Jitter
jitter_eye = np.eye(N) * 1e-6

# Input
X = np.random.rand(N, 1)

kernel_se = gpflow.kernels.SquaredExponential(lengthscale=lengthscale)
K = kernel_se(X) + jitter_eye

# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(C)).T

# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)
print(Y.shape)

# One-hot encoding
Y_hot = np.zeros((N, C), dtype=bool)
Y_hot[np.arange(N), Y] = 1

data = (X, Y)

plt.figure(figsize=(12, 6))
order = np.argsort(X.reshape(-1,))
print(order.shape)

for c in range(C):
    plt.plot(X[order], f[order, c], '.', color=colors[c], label=str(c))
    plt.plot(X[order], Y_hot[order, c], '-', color=colors[c])


plt.legend()
plt.xlabel('$X$')
plt.ylabel('Latent (dots) and one-hot labels (lines)')
plt.title('Sample from the joint $p(Y, \mathbf{f})$')
plt.grid()
plt.show()


# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)

# Robustmax Multiclass Likelihood
invlink = gpflow.likelihoods.RobustMax(C)  # Robustmax inverse link function
likelihood = gpflow.likelihoods.MultiClass(C, invlink=invlink)  # Multiclass likelihood
Z = X[::5].copy()  # inducing inputs
#print(Z)

m = gpflow.models.SVGP(kernel=kernel, likelihood=likelihood,
    inducing_variable=Z, num_latent_gps=C, whiten=True, q_diag=True)

# Only train the variational parameters
set_trainable(m.kernel.kernels[1].variance, True)
set_trainable(m.inducing_variable, True)
print(m.inducing_variable.Z)
print_summary(m)


training_loss = m.training_loss_closure(data) 

opt.minimize(training_loss, m.trainable_variables)
print(m.inducing_variable.Z)
print_summary(m.inducing_variable.Z)


print(m.inducing_variable.Z)

# %%
plot_posterior_predictions(m, X, Y)

Answer 1

The example given in the question isn't copy&pastable, but it seems like you simply exchange opt = gpflow.optimizers.Scipy() with opt = tf.optimizers.Adam() . The minimize() method of gpflow's Scipy optimizer runs one call of scipy.optimize.minimize , which by default runs to convergence (you can also specify a maximum number of iterations by passing, eg, options=dict(maxiter=100) to the minimize() call).

In contrast, the minimize() method of TensorFlow optimizers runs only a single optimization step. To run more steps, say iter = 100 , you need to manually write a loop:

for _ in range(iter):
    opt.minimize(model.training_loss, model.trainable_variables)

For this to actually run fast, you also need to wrap the optimization step in tf.function :

@tf.function
def optimization_step():
    opt.minimize(model.training_loss, model.trainable_variables)

for _ in range(iter):
    optimization_step()

This runs exactly iter steps - in TensorFlow you have to handle convergence checks yourself, your model may or may not be converged after this many steps.

So in your usage, you only ran one step - this did change the parameters, but presumably too little to notice the difference. (You could see a larger effect in one step by making the learning rate much higher, though that would not be a good idea for actually optimizing the model with many steps.)

Usage of the Adam optimizer with GPflow models is demonstrated in the notebook on stochastic variational inference , though it also works for non-stochastic optimization.

Note that, in any case, all parameters such as inducing point locations are set trainable by default, so your call to set_trainable(..., True) doesn't affect what's going on here.

Why does a GPflow model not seem to learn anything with TensorFlow optimizers such as tf.optimizers.Adam?

Question

1 answers

solution1
4 2020-05-18 07:17:42

Why does a GPflow model not seem to learn anything with TensorFlow optimizers such as tf.optimizers.Adam?

Question

1 answers

solution1 4 2020-05-18 07:17:42

solution1
4 2020-05-18 07:17:42