使用 Scipy 优化器和 Tensorflow 2.0 进行神经网络训练

Question

After the introduction of Tensorflow 2.0 the scipy interface (tf.contrib.opt.ScipyOptimizerInterface) has been removed.在引入 Tensorflow 2.0 后，scipy 接口 (tf.contrib.opt.ScipyOptimizerInterface) 已被删除。 However, I would still like to use the scipy optimizer scipy.optimize.minimize(method='L-BFGS-B') to train a neural network ( keras model sequential ). However, I would still like to use the scipy optimizer scipy.optimize.minimize(method='L-BFGS-B') to train a neural network ( keras model sequential ). In order for the optimizer to work, it requires as input a function fun(x0) with x0 being an array of shape (n,).为了让优化器工作，它需要一个 function fun(x0)作为输入，其中x0是一个形状 (n,) 的数组。 Therefore, the first step would be to "flatten" the weights matrices to obtain a vector with the required shape.因此，第一步是“展平”权重矩阵以获得具有所需形状的向量。 To this end, I modified the code provided by https://pychao.com/2019/11/02/optimize-tensorflow-keras-models-with-l-bfgs-from-tensorflow-probability/ .为此，我修改了https://pychao.com/2019/11/02/optimize-tensorflow-keras-models-with-l-bfgs-from-tensorflow-probability/提供的代码。 This provides a function factory meant to create such a function fun(x0) .这提供了一个 function 工厂，旨在创建这样一个 function fun(x0) 。 However, the code does not seem to work and the loss function does not decrease.但是，代码似乎不起作用，并且损失 function 并没有减少。 I would be really grateful if someone could help me work this out.如果有人能帮我解决这个问题，我将不胜感激。

Here the piece of code I am using:这是我正在使用的一段代码：

func = function_factory(model, loss_function, x_u_train, u_train)

# convert initial model parameters to a 1D tf.Tensor
init_params = tf.dynamic_stitch(func.idx, model.trainable_variables)
init_params = tf.cast(init_params, dtype=tf.float32)

# train the model with L-BFGS solver
results = scipy.optimize.minimize(fun=func, x0=init_params, method='L-BFGS-B')


def loss_function(x_u_train, u_train, network):
    u_pred = tf.cast(network(x_u_train), dtype=tf.float32)
    loss_value = tf.reduce_mean(tf.square(u_train - u_pred))
    return tf.cast(loss_value, dtype=tf.float32)


def function_factory(model, loss_f, x_u_train, u_train):
    """A factory to create a function required by tfp.optimizer.lbfgs_minimize.

    Args:
        model [in]: an instance of `tf.keras.Model` or its subclasses.
        loss [in]: a function with signature loss_value = loss(pred_y, true_y).
        train_x [in]: the input part of training data.
        train_y [in]: the output part of training data.

    Returns:
        A function that has a signature of:
            loss_value, gradients = f(model_parameters).
    """

    # obtain the shapes of all trainable parameters in the model
    shapes = tf.shape_n(model.trainable_variables)
    n_tensors = len(shapes)

    # we'll use tf.dynamic_stitch and tf.dynamic_partition later, so we need to
    # prepare required information first
    count = 0
    idx = [] # stitch indices
    part = [] # partition indices

    for i, shape in enumerate(shapes):
        n = np.product(shape)
        idx.append(tf.reshape(tf.range(count, count+n, dtype=tf.int32), shape))
        part.extend([i]*n)
        count += n

    part = tf.constant(part)


    def assign_new_model_parameters(params_1d):
        """A function updating the model's parameters with a 1D tf.Tensor.

        Args:
            params_1d [in]: a 1D tf.Tensor representing the model's trainable parameters.
        """

        params = tf.dynamic_partition(params_1d, part, n_tensors)
        for i, (shape, param) in enumerate(zip(shapes, params)):

            model.trainable_variables[i].assign(tf.cast(tf.reshape(param, shape), dtype=tf.float32))

    # now create a function that will be returned by this factory

    def f(params_1d):
        """
        This function is created by function_factory.
        Args:
            params_1d [in]: a 1D tf.Tensor.

        Returns:
            A scalar loss.
        """

        # update the parameters in the model
        assign_new_model_parameters(params_1d)
        # calculate the loss
        loss_value = loss_f(x_u_train, u_train, model)

        # print out iteration & loss
        f.iter.assign_add(1)
        tf.print("Iter:", f.iter, "loss:", loss_value)

        return loss_value

    # store these information as members so we can use them outside the scope
    f.iter = tf.Variable(0)
    f.idx = idx
    f.part = part
    f.shapes = shapes
    f.assign_new_model_parameters = assign_new_model_parameters

    return f

Here model is an object tf.keras.Sequential.这里model是 object tf.keras.Sequential。

Thank you in advance for any help!预先感谢您的任何帮助！

Answer 1

Changing from tf1 to tf2 I was exposed to the same question and after a little bit of experimenting I found the solution below that shows how to establish the interface between a function decorated with tf.function and a scipy optimizer.从 tf1 更改为 tf2 我遇到了同样的问题，经过一些实验后，我找到了下面的解决方案，该解决方案显示了如何在用 tf.function 修饰的函数和 scipy 优化器之间建立接口。 The important changes compared to the question are:与问题相比，重要的变化是：

As mentioned by Ives scipy's lbfgs needs to get function value and gradient, so you need to provide a function that delivers both and then set jac=True正如Ives scipy 的lbfgs 提到的需要获取函数值和梯度，所以你需要提供一个函数，同时提供两者然后设置jac=True
scipy's lbfgs is a Fortran function that expects the interface to provide np.float64 arrays while tensorflow tf.function uses tf.float32. scipy 的 lbfgs 是一个 Fortran 函数，它期望接口提供 np.float64 数组，而 tensorflow tf.function 使用 tf.float32。 So one has to cast input and output.所以必须对输入和输出进行转换。

I provide an example of how this can be done for a toy problem here below.我在下面提供了一个如何解决玩具问题的示例。

import tensorflow as tf
import numpy as np
import scipy.optimize as sopt

def model(x):
    return tf.reduce_sum(tf.square(x-tf.constant(2, dtype=tf.float32)))

@tf.function
def val_and_grad(x):
    with tf.GradientTape() as tape:
        tape.watch(x)
        loss = model(x)
    grad = tape.gradient(loss, x)
    return loss, grad

def func(x):
    return [vv.numpy().astype(np.float64)  for vv in val_and_grad(tf.constant(x, dtype=tf.float32))]

resdd= sopt.minimize(fun=func, x0=np.ones(5),
                                      jac=True, method='L-BFGS-B')

print("info:\n",resdd)

displays显示

info:
       fun: 7.105427357601002e-14
 hess_inv: <5x5 LbfgsInvHessProduct with dtype=float64>
      jac: array([-2.38418579e-07, -2.38418579e-07, -2.38418579e-07, -2.38418579e-07,
       -2.38418579e-07])
  message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 3
      nit: 2
   status: 0
  success: True
        x: array([1.99999988, 1.99999988, 1.99999988, 1.99999988, 1.99999988])

Benchmark基准

For comparing speed I use the lbfgs optimizer for a style transfer problem (see here for the network).为了比较速度，我使用 lbfgs 优化器来解决样式转换问题（有关网络，请参见此处）。 Note, that for this problem the network parameters are fixed and the input signal is adapted.请注意，对于这个问题，网络参数是固定的，输入信号是适应的。 As the optimized parameters (the input signal) are 1D the function factory is not needed.由于优化的参数（输入信号）是一维的，因此不需要函数工厂。

I compare four implementations我比较了四种实现

TF1.12: TF1 with with ScipyOptimizerInterface TF1.12：带有 ScipyOptimizerInterface 的 TF1
TF2.0 (E): the approach above without using tf.function decorators TF2.0 (E)：上述方法不使用 tf.function 装饰器
TF2.0 (G): the approach above using tf.function decorators TF2.0 (G)：上面使用 tf.function 装饰器的方法
TF2.0/TFP: using the lbfgs minimizer from tensorflow_probability TF2.0/TFP：使用来自tensorflow_probability的 lbfgs 最小化器

For this comparison the optimization is stopped after 300 iterations (generally for convergence the problem requires 3000 iterations)对于这个比较，优化在 300 次迭代后停止（通常为了收敛问题需要 3000 次迭代）

Results结果

Method       runtime(300it)      final loss         
TF1.12          240s                0.045     (baseline)
TF2.0 (E)       299s                0.045
TF2.0 (G)       233s                0.045
TF2.0/TFP       226s                0.053

The TF2.0 eager mode (TF2.0(E)) works correctly but is about 20% slower than the TF1.12 baseline version. TF2.0 急切模式 (TF2.0(E)) 工作正常，但比 TF1.12 基线版本慢约 20%。 TF2.0(G) with tf.function works fine and is marginally faster than TF1.12, which is a good thing to know.带有 tf.function 的 TF2.0(G) 工作正常，并且比 TF1.12 略快，这是一件好事。

The optimizer from tensorflow_probability (TF2.0/TFP) is slightly faster than TF2.0(G) using scipy's lbfgs but does not achieve the same error reduction.来自 tensorflow_probability (TF2.0/TFP) 的优化器比使用 scipy 的 lbfgs 的 TF2.0(G) 略快，但没有实现相同的错误减少。 In fact the decrease of the loss over time is not monotonous which seems a bad sign.事实上，随着时间的推移损失的减少并不是单调的，这似乎是一个坏兆头。 Comparing the two implementations of lbfgs (scipy and tensorflow_probability=TFP) it is clear that the Fortran code in scipy is significantly more complex.比较 lbfgs 的两种实现（scipy 和 tensorflow_probability=TFP），很明显 scipy 中的 Fortran 代码要复杂得多。 So either the simplification of the algorithm in TFP is harming here or even the fact that TFP is performing all calculations in float32 may also be a problem.因此，TFP 中算法的简化在这里是有害的，甚至 TFP 在 float32 中执行所有计算的事实也可能是一个问题。

Answer 2

Here is a simple solution using a library ( autograd_minimize ) that I wrote building on the answer of Roebel:这是一个使用库 ( autograd_minimize ) 的简单解决方案，我根据 Roebel 的回答编写了该库：

import tensorflow as tf
from autograd_minimize import minimize

def rosen_tf(x):
    return tf.reduce_sum(100.0*(x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0)

res = minimize(rosen_tf, np.array([0.,0.]))
print(res.x)
>>> array([0.99999912, 0.99999824])

It also works with keras models as shown with this naive example of linear regression:它也适用于 keras 模型，如这个简单的线性回归示例所示：

import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
from autograd_minimize.tf_wrapper import tf_function_factory
from autograd_minimize import minimize 
import tensorflow as tf

#### Prepares data
X = np.random.random((200, 2))
y = X[:,:1]*2+X[:,1:]*0.4-1

#### Creates model
model = keras.Sequential([keras.Input(shape=2),
                          layers.Dense(1)])

# Transforms model into a function of its parameter
func, params = tf_function_factory(model, tf.keras.losses.MSE, X, y)

# Minimization
res = minimize(func, params, method='L-BFGS-B')

print(res.x)
>>> [array([[2.0000016 ],
 [0.40000062]]), array([-1.00000164])]

Answer 3

I guess SciPy does not know how to calculate gradients of TensorFlow objects.我猜 SciPy 不知道如何计算 TensorFlow 对象的梯度。 Try to use the original function factory (ie, the one also returns the gradients together after loss), and set jac=True in scipy.optimize.minimize .尝试使用原始函数工厂（即损失后也一起返回梯度），并在scipy.optimize.minimize设置jac=True 。

I tested the python code from the original Gist and replaced tfp.optimizer.lbfgs_minimize with SciPy optimizer.我测试了原始 Gist 中的 python 代码，并用 SciPy 优化器替换了tfp.optimizer.lbfgs_minimize 。 It worked with BFGS method:它与BFGS方法一起使用：

results = scipy.optimize.minimize(fun=func, x0=init_params, jac=True, method='BFGS')

jac=True means SciPy knows that func also returns gradients. jac=True意味着 SciPy 知道func也返回梯度。

For L-BFGS-B , however, it's tricky.然而，对于L-BFGS-B来说，这很棘手。 After some effort, I finally made it work.经过一番努力，我终于成功了。 I have to comment out the @tf.function lines and let func return grads.numpy() instead of the raw TF Tensor.我必须注释掉@tf.function行并让func返回grads.numpy()而不是原始 TF Tensor。 I guess that's because the underlying implementation of L-BFGS-B is a Fortran function, so there might be some issue converting data from tf.Tensor -> numpy array -> Fortran array.我猜这是因为L-BFGS-B的底层实现是一个 Fortran 函数，所以从 tf.Tensor -> numpy array -> Fortran array 转换数据可能会出现一些问题。 And forcing the function func to return the ndarray version of the gradients resolves the problem.并强制函数func返回梯度的ndarray版本解决了问题。 But then it's not possible to use @tf.function .但是这样就不可能使用@tf.function 。

Answer 4

(Similar Question to: Is there a tf.keras.optimizers implementation for L-BFGS? ) （类似问题： L-BFGS 是否有 tf.keras.optimizers 实现？）

While this is not from anywhere as legit as tf.contrib , it's an implementation L-BFGS (and any other scipy.optimize.minimize solver) for your consideration in case it fits your use case:虽然这不像tf.contrib那样合法，但它是一个实现 L-BFGS（和任何其他scipy.optimize.minimize求解器）供您考虑，以防它适合您的用例：

The package has models that extend keras.Model and keras.Sequential models, and can be compiled with `.compile(..., optimizer="L-BFGS") to use L-BFGS in TF2, or compiled with any of the other standard optimizers (because flipping between stochastic & deterministic should be easy:): The package has models that extend keras.Model and keras.Sequential models, and can be compiled with `.compile(..., optimizer="L-BFGS") to use L-BFGS in TF2, or compiled with any of the其他标准优化器（因为在随机和确定性之间切换应该很容易:)：

使用 Scipy 优化器和 Tensorflow 2.0 进行神经网络训练

问题描述

4 个解决方案

解决方案1
9 2020-02-27 23:41:47

Benchmark基准

Results结果

解决方案2
1 2021-04-08 14:45:25

解决方案3
0 2019-12-06 02:48:45

解决方案4
0 2022-09-08 18:03:22

使用 Scipy 优化器和 Tensorflow 2.0 进行神经网络训练

问题描述

4 个解决方案

解决方案1 9 2020-02-27 23:41:47

Benchmark基准

Results结果

解决方案2 1 2021-04-08 14:45:25

解决方案3 0 2019-12-06 02:48:45

解决方案4 0 2022-09-08 18:03:22

解决方案1
9 2020-02-27 23:41:47

解决方案2
1 2021-04-08 14:45:25

解决方案3
0 2019-12-06 02:48:45

解决方案4
0 2022-09-08 18:03:22