Use Scipy Optimizer with Tensorflow 2.0 for Neural Network training

Question

After the introduction of Tensorflow 2.0 the scipy interface (tf.contrib.opt.ScipyOptimizerInterface) has been removed. However, I would still like to use the scipy optimizer scipy.optimize.minimize(method='L-BFGS-B') to train a neural network ( keras model sequential ). In order for the optimizer to work, it requires as input a function fun(x0) with x0 being an array of shape (n,). Therefore, the first step would be to "flatten" the weights matrices to obtain a vector with the required shape. To this end, I modified the code provided by https://pychao.com/2019/11/02/optimize-tensorflow-keras-models-with-l-bfgs-from-tensorflow-probability/ . This provides a function factory meant to create such a function fun(x0) . However, the code does not seem to work and the loss function does not decrease. I would be really grateful if someone could help me work this out.

Here the piece of code I am using:

func = function_factory(model, loss_function, x_u_train, u_train)

# convert initial model parameters to a 1D tf.Tensor
init_params = tf.dynamic_stitch(func.idx, model.trainable_variables)
init_params = tf.cast(init_params, dtype=tf.float32)

# train the model with L-BFGS solver
results = scipy.optimize.minimize(fun=func, x0=init_params, method='L-BFGS-B')


def loss_function(x_u_train, u_train, network):
    u_pred = tf.cast(network(x_u_train), dtype=tf.float32)
    loss_value = tf.reduce_mean(tf.square(u_train - u_pred))
    return tf.cast(loss_value, dtype=tf.float32)


def function_factory(model, loss_f, x_u_train, u_train):
    """A factory to create a function required by tfp.optimizer.lbfgs_minimize.

    Args:
        model [in]: an instance of `tf.keras.Model` or its subclasses.
        loss [in]: a function with signature loss_value = loss(pred_y, true_y).
        train_x [in]: the input part of training data.
        train_y [in]: the output part of training data.

    Returns:
        A function that has a signature of:
            loss_value, gradients = f(model_parameters).
    """

    # obtain the shapes of all trainable parameters in the model
    shapes = tf.shape_n(model.trainable_variables)
    n_tensors = len(shapes)

    # we'll use tf.dynamic_stitch and tf.dynamic_partition later, so we need to
    # prepare required information first
    count = 0
    idx = [] # stitch indices
    part = [] # partition indices

    for i, shape in enumerate(shapes):
        n = np.product(shape)
        idx.append(tf.reshape(tf.range(count, count+n, dtype=tf.int32), shape))
        part.extend([i]*n)
        count += n

    part = tf.constant(part)


    def assign_new_model_parameters(params_1d):
        """A function updating the model's parameters with a 1D tf.Tensor.

        Args:
            params_1d [in]: a 1D tf.Tensor representing the model's trainable parameters.
        """

        params = tf.dynamic_partition(params_1d, part, n_tensors)
        for i, (shape, param) in enumerate(zip(shapes, params)):

            model.trainable_variables[i].assign(tf.cast(tf.reshape(param, shape), dtype=tf.float32))

    # now create a function that will be returned by this factory

    def f(params_1d):
        """
        This function is created by function_factory.
        Args:
            params_1d [in]: a 1D tf.Tensor.

        Returns:
            A scalar loss.
        """

        # update the parameters in the model
        assign_new_model_parameters(params_1d)
        # calculate the loss
        loss_value = loss_f(x_u_train, u_train, model)

        # print out iteration & loss
        f.iter.assign_add(1)
        tf.print("Iter:", f.iter, "loss:", loss_value)

        return loss_value

    # store these information as members so we can use them outside the scope
    f.iter = tf.Variable(0)
    f.idx = idx
    f.part = part
    f.shapes = shapes
    f.assign_new_model_parameters = assign_new_model_parameters

    return f

Here model is an object tf.keras.Sequential.

Thank you in advance for any help!

Answer 1

Changing from tf1 to tf2 I was exposed to the same question and after a little bit of experimenting I found the solution below that shows how to establish the interface between a function decorated with tf.function and a scipy optimizer. The important changes compared to the question are:

As mentioned by Ives scipy's lbfgs needs to get function value and gradient, so you need to provide a function that delivers both and then set jac=True
scipy's lbfgs is a Fortran function that expects the interface to provide np.float64 arrays while tensorflow tf.function uses tf.float32. So one has to cast input and output.

I provide an example of how this can be done for a toy problem here below.

import tensorflow as tf
import numpy as np
import scipy.optimize as sopt

def model(x):
    return tf.reduce_sum(tf.square(x-tf.constant(2, dtype=tf.float32)))

@tf.function
def val_and_grad(x):
    with tf.GradientTape() as tape:
        tape.watch(x)
        loss = model(x)
    grad = tape.gradient(loss, x)
    return loss, grad

def func(x):
    return [vv.numpy().astype(np.float64)  for vv in val_and_grad(tf.constant(x, dtype=tf.float32))]

resdd= sopt.minimize(fun=func, x0=np.ones(5),
                                      jac=True, method='L-BFGS-B')

print("info:\n",resdd)

displays

info:
       fun: 7.105427357601002e-14
 hess_inv: <5x5 LbfgsInvHessProduct with dtype=float64>
      jac: array([-2.38418579e-07, -2.38418579e-07, -2.38418579e-07, -2.38418579e-07,
       -2.38418579e-07])
  message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 3
      nit: 2
   status: 0
  success: True
        x: array([1.99999988, 1.99999988, 1.99999988, 1.99999988, 1.99999988])

Benchmark

For comparing speed I use the lbfgs optimizer for a style transfer problem (see here for the network). Note, that for this problem the network parameters are fixed and the input signal is adapted. As the optimized parameters (the input signal) are 1D the function factory is not needed.

I compare four implementations

TF1.12: TF1 with with ScipyOptimizerInterface
TF2.0 (E): the approach above without using tf.function decorators
TF2.0 (G): the approach above using tf.function decorators
TF2.0/TFP: using the lbfgs minimizer from tensorflow_probability

For this comparison the optimization is stopped after 300 iterations (generally for convergence the problem requires 3000 iterations)

Results

Method       runtime(300it)      final loss         
TF1.12          240s                0.045     (baseline)
TF2.0 (E)       299s                0.045
TF2.0 (G)       233s                0.045
TF2.0/TFP       226s                0.053

The TF2.0 eager mode (TF2.0(E)) works correctly but is about 20% slower than the TF1.12 baseline version. TF2.0(G) with tf.function works fine and is marginally faster than TF1.12, which is a good thing to know.

The optimizer from tensorflow_probability (TF2.0/TFP) is slightly faster than TF2.0(G) using scipy's lbfgs but does not achieve the same error reduction. In fact the decrease of the loss over time is not monotonous which seems a bad sign. Comparing the two implementations of lbfgs (scipy and tensorflow_probability=TFP) it is clear that the Fortran code in scipy is significantly more complex. So either the simplification of the algorithm in TFP is harming here or even the fact that TFP is performing all calculations in float32 may also be a problem.

Answer 2

Here is a simple solution using a library ( autograd_minimize ) that I wrote building on the answer of Roebel:

import tensorflow as tf
from autograd_minimize import minimize

def rosen_tf(x):
    return tf.reduce_sum(100.0*(x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0)

res = minimize(rosen_tf, np.array([0.,0.]))
print(res.x)
>>> array([0.99999912, 0.99999824])

It also works with keras models as shown with this naive example of linear regression:

import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from autograd_minimize.tf_wrapper import tf_function_factory
from autograd_minimize import minimize 
import tensorflow as tf

#### Prepares data
X = np.random.random((200, 2))
y = X[:,:1]*2+X[:,1:]*0.4-1

#### Creates model
model = keras.Sequential([keras.Input(shape=2),
                          layers.Dense(1)])

# Transforms model into a function of its parameter
func, params = tf_function_factory(model, tf.keras.losses.MSE, X, y)

# Minimization
res = minimize(func, params, method='L-BFGS-B')

print(res.x)
>>> [array([[2.0000016 ],
 [0.40000062]]), array([-1.00000164])]

Answer 3

I guess SciPy does not know how to calculate gradients of TensorFlow objects. Try to use the original function factory (ie, the one also returns the gradients together after loss), and set jac=True in scipy.optimize.minimize .

I tested the python code from the original Gist and replaced tfp.optimizer.lbfgs_minimize with SciPy optimizer. It worked with BFGS method:

results = scipy.optimize.minimize(fun=func, x0=init_params, jac=True, method='BFGS')

jac=True means SciPy knows that func also returns gradients.

For L-BFGS-B , however, it's tricky. After some effort, I finally made it work. I have to comment out the @tf.function lines and let func return grads.numpy() instead of the raw TF Tensor. I guess that's because the underlying implementation of L-BFGS-B is a Fortran function, so there might be some issue converting data from tf.Tensor -> numpy array -> Fortran array. And forcing the function func to return the ndarray version of the gradients resolves the problem. But then it's not possible to use @tf.function .

Answer 4

(Similar Question to: Is there a tf.keras.optimizers implementation for L-BFGS? )

While this is not from anywhere as legit as tf.contrib , it's an implementation L-BFGS (and any other scipy.optimize.minimize solver) for your consideration in case it fits your use case:

The package has models that extend keras.Model and keras.Sequential models, and can be compiled with `.compile(..., optimizer="L-BFGS") to use L-BFGS in TF2, or compiled with any of the other standard optimizers (because flipping between stochastic & deterministic should be easy:):

Use Scipy Optimizer with Tensorflow 2.0 for Neural Network training

Question

4 answers

solution1
9 2020-02-27 23:41:47

Benchmark

Results

solution2
1 2021-04-08 14:45:25

solution3
0 2019-12-06 02:48:45

solution4
0 2022-09-08 18:03:22

Use Scipy Optimizer with Tensorflow 2.0 for Neural Network training

Question

4 answers

solution1 9 2020-02-27 23:41:47

Benchmark

Results

solution2 1 2021-04-08 14:45:25

solution3 0 2019-12-06 02:48:45

solution4 0 2022-09-08 18:03:22

solution1
9 2020-02-27 23:41:47

solution2
1 2021-04-08 14:45:25

solution3
0 2019-12-06 02:48:45

solution4
0 2022-09-08 18:03:22