What does the MNIST tensorflow tutorial mean with matmul flipping trick?

Question

The tutorial on MNIST for ML Beginners, in Implementing the Regression , shows how to make the regression on a single line, followed by an explanation that mentions the use of a trick (emphasis mine):

y = tf.nn.softmax(tf.matmul(x, W) + b)

First, we multiply x by W with the expression tf.matmul(x, W) . This is flipped from when we multiplied them in our equation, where we had Wx, as a small trick to deal with x being a 2D tensor with multiple inputs.

What is the trick here, and why are we using it?

Answer 1

Well, there's no trick here. That line basically points to one previous equation multiplication order

# Here the order of W and x, this equation for single example
y = Wx +b
# if you want to use batch of examples you need the change the order of multiplication; instead of using another transpose op
y = xW +b
# hence
y = tf.matmul(x, W)

Answer 2

Ok, I think the main point is that if you train in batches (ie train with several instances of the training set at once), TensorFlow always assumes that the zeroth dimension of x indicates the number of events per batch.

Suppose you want to map a training instance of dimension M to a target instance of dimension N. You would typically do this by multiplying x (a column vector) with a NxM matrix (and, optionally, add a bias with dimension N (also a column vector)), ie

y = W*x + b, where y is also a column vector.

This is perfectly alright seen from the perspective of linear algebra. But now comes the point with the training in batches, ie training with several training instances at once. To get to understand this, it might be helpful to not view x (and y) as vectors of dimension M (and N), but as matrices with the dimensions Mx1 (and Nx1 for y). Since TensorFlow assumes that the different training instances constituting a batch are aligned along the zeroth dimension, we get into trouble here since the zeroth dimension is occupied by the different elements of one single instance. The trick is then to transpose the above equation (remember that transposition of a product also switches the order of the two transposed objects):

y^T = x^T * W^T + b^T

This is pretty much what has been described in short within the tutorial. Note that y^T is now a matrix of dimension 1xN (practically a row vector), while x^T is a matrix of dimension 1xM (also a row vector). W^T is a matrix of dimension MxN. In the tutorial, they did not write x^T or y^T, but simply defined the placeholders according to this transposed equation. The only point that is not clear to me is why they did not define b the "transposed way". I assume that the + operator automatically transposes b if it is necessary in order to get the correct dimensions.

The rest is now pretty easy: if you have batches larger than 1 instance, you just "stack" multiple of the x (1xM) matrices, say to a matrix of dimensions (AxM) (where A is the batch size). b will hopefully automatically broadcasted to this number of events (that means to a matrix of dimension (AxN). If you then use

y^T = x^T * W^T + b^T,

you will get a (AxN) matrix of the targets for each element of the batch.

What does the MNIST tensorflow tutorial mean with matmul flipping trick?

Question

2 answers

solution1
1 2017-08-06 21:11:11

solution2
1 2017-08-07 06:50:39

What does the MNIST tensorflow tutorial mean with matmul flipping trick?

Question

2 answers

solution1 1 2017-08-06 21:11:11

solution2 1 2017-08-07 06:50:39

solution1
1 2017-08-06 21:11:11

solution2
1 2017-08-07 06:50:39