相对于矩阵的Tensorflow梯度

Question

Just for context, I'm trying to implement a gradient descent algorithm with Tensorflow. 仅仅是为了上下文，我试图用Tensorflow实现梯度下降算法。

I have a matrix X 我有一个矩阵X

[ x1 x2 x3 x4 ]
[ x5 x6 x7 x8 ]

which I multiply by some feature vector Y to get Z 我乘以一些特征向量Y得到Z

      [ y1 ]
Z = X [ y2 ]  = [ z1 ]
      [ y3 ]    [ z2 ]
      [ y4 ]

I then put Z through a softmax function, and take the log. 然后我通过softmax函数放Z，并记录日志。 I'll refer to the output matrix as W. 我将输出矩阵称为W.

All this is implemented as follows (little bit of boilerplate added so it's runnable) 所有这些都实现如下（添加了一点样的样板，因此它可以运行）

sess = tf.Session()
num_features = 4
num_actions = 2

policy_matrix = tf.get_variable("params", (num_actions, num_features))
state_ph = tf.placeholder("float", (num_features, 1))
action_linear = tf.matmul(params, state_ph)
action_probs = tf.nn.softmax(action_linear, axis=0)
action_problogs = tf.log(action_probs)

W (corresponding to action_problogs ) looks like W（对应于action_problogs ）看起来像

[ w1 ]
[ w2 ]

I'd like to find the gradient of w1 with respect to the matrix X - that is, I'd like to calculate 我想找到相对于矩阵X的w1的梯度 - 也就是说，我想计算

          [ d/dx1 w1 ]
d/dX w1 =      .
               .
          [ d/dx8 w1 ]

(preferably still looking like a matrix so I can add it to X , but I'm really not concerned about that) （最好看起来像一个矩阵，所以我可以把它添加到X ，但我真的不关心那个）

I was hoping that tf.gradients would do the trick. 我希望tf.gradients能做到这一点。 I calculated the "gradient" like so 我像这样计算了“渐变”

problog_gradient = tf.gradients(action_problogs, policy_matrix)

However, when I inspect problog_gradient , here's what I get 但是，当我检查problog_gradient ，这就是我得到的

[<tf.Tensor 'foo_4/gradients/foo_4/MatMul_grad/MatMul:0' shape=(2, 4) dtype=float32>]

Note that this has exactly the same shape as X , but that it really shouldn't. 请注意，这与X具有完全相同的形状，但它确实不应该。 I was hoping to get a list of two gradients, each with respect to 8 elements. 我希望得到两个渐变的列表，每个渐变都有8个元素。 I suspect that I'm instead getting two gradients, but each with respect to four elements. 我怀疑我会得到两个渐变，但每个都有四个元素。

I'm very new to tensorflow, so I'd appreciate and explanation of what's going on and how I might achieve the behavior I desire. 我对张力流很陌生，所以我会欣赏和解释正在发生的事情以及如何实现我想要的行为。

Answer 1

The gradient expects a scalar function, so by default, it sums up the entries. 渐变需要标量函数，因此默认情况下，它会对条目求和。 That is the default behavior simply because all of the gradient descent algorithms need that type of functionality, and stochastic gradient descent (or variations thereof) are the preferred methods inside Tensorflow. 这是默认行为，因为所有梯度下降算法都需要这种类型的功能，并且随机梯度下降（或其变化）是Tensorflow内部的首选方法。 You won't find any of the more advanced algorithms (like BFGS or something) because they simply haven't been implemented yet (and they would require a true Jacobian, which also hasn't been implemented). 您将找不到任何更高级的算法（如BFGS或其他东西），因为它们还没有实现（并且它们需要一个真正的雅可比算法，它还没有实现）。 For what its worth, here is a functioning Jacobian implementation that I wrote: 值得一提的是，这是我写的一个正常运作的雅可比实现：

def map(f, x, dtype=None, parallel_iterations=10):
    '''
    Apply f to each of the elements in x using the specified number of parallel iterations.

    Important points:
    1. By "elements in x", we mean that we will be applying f to x[0],...x[tf.shape(x)[0]-1].
    2. The output size of f(x[i]) can be arbitrary. However, if the dtype of that output
       is different than the dtype of x, then you need to specify that as an additional argument.
    '''
    if dtype is None:
        dtype = x.dtype

    n = tf.shape(x)[0]
    loop_vars = [
        tf.constant(0, n.dtype),
        tf.TensorArray(dtype, size=n),
    ]
    _, fx = tf.while_loop(
        lambda j, _: j < n,
        lambda j, result: (j + 1, result.write(j, f(x[j]))),
        loop_vars,
        parallel_iterations=parallel_iterations
    )
    return fx.stack()

def jacobian(fx, x, parallel_iterations=10):
    '''
    Given a tensor fx, which is a function of x, vectorize fx (via tf.reshape(fx, [-1])),
    and then compute the jacobian of each entry of fx with respect to x.
    Specifically, if x has shape (m,n,...,p), and fx has L entries (tf.size(fx)=L), then
    the output will be (L,m,n,...,p), where output[i] will be (m,n,...,p), with each entry denoting the
    gradient of output[i] wrt the corresponding element of x.
    '''
    return map(lambda fxi: tf.gradients(fxi, x)[0],
               tf.reshape(fx, [-1]),
               dtype=x.dtype,
               parallel_iterations=parallel_iterations)

While this implementation works, it does not work when you try to nest it. 虽然此实现有效，但当您尝试嵌套时它不起作用。 For instance, if you try to compute the Hessian by using jacobian( jacobian( ... )) , then you get some strange errors. 例如，如果你尝试使用jacobian( jacobian( ... ))来计算Hessian，那么你会得到一些奇怪的错误。 This is being tracked as Issue 675 . 这被追踪为问题675 。 I am still awaiting a response on why this throws an error. 我仍在等待对此为何会引发错误的回应。 I believe that there is a deep-seated bug in either the while loop implementation or the gradient implementation, but I really have no idea. 我相信在while循环实现或渐变实现中存在一个根深蒂固的错误，但我真的不知道。

Anyway, if you just need a jacobian, try the code above. 无论如何，如果您只需要一个雅可比，请尝试上面的代码。

Answer 2

tf.gradients实际上对ys求和并计算其渐变，这就是为什么会出现这个问题。

相对于矩阵的Tensorflow梯度

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-02-20 07:57:40

解决方案2
0 2018-02-20 07:12:22

相对于矩阵的Tensorflow梯度

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-02-20 07:57:40

解决方案2 0 2018-02-20 07:12:22

解决方案1
3 已采纳 2018-02-20 07:57:40

解决方案2
0 2018-02-20 07:12:22