Is tf.contrib.layers.fully_connected() behavior change between tensorflow 1.3 and 1.4 an issue?

Question

I was recently completing a CNN implementation using TensorFlow from an online course which I would prefer not to mention to avoid breaking the platform rules. I ran into surprising results where my local implementation diverged significantly from the one on the platform server. After further investigation, I nailed down the problem to a change in tf.contrib.layers.fully_connected() behaviour between versions 1.3 and 1.4 of TensorFlow.

I prepared a small subset of the source code to reproduce the issue:

import numpy as np
import tensorflow as tf

np.random.seed(1)

def create_placeholders(n_H0, n_W0, n_C0, n_y):
    X = tf.placeholder(tf.float32, [None, n_H0, n_W0, n_C0])
    Y = tf.placeholder(tf.float32, [None, n_y])
    return X, Y

def initialize_parameters():
    tf.set_random_seed(1)
    W1 = tf.get_variable("W1", [4, 4, 3, 8], initializer=tf.contrib.layers.xavier_initializer(seed=0))
    W2 = tf.get_variable("W2", [2, 2, 8, 16], initializer=tf.contrib.layers.xavier_initializer(seed=0))
    parameters = {"W1": W1, "W2": W2}
    return parameters

def forward_propagation(X, parameters):
    W1 = parameters['W1']
    W2 = parameters['W2']
    Z1 = tf.nn.conv2d(X, W1, strides=[1, 1, 1, 1], padding='SAME')
    A1 = tf.nn.relu(Z1)
    P1 = tf.nn.max_pool(A1, ksize=[1, 8, 8, 1], strides=[1, 8, 8, 1], padding='SAME')
    Z2 = tf.nn.conv2d(P1, W2, strides=[1, 1, 1, 1], padding='SAME')
    A2 = tf.nn.relu(Z2)
    P2 = tf.nn.max_pool(A2, ksize=[1, 4, 4, 1], strides=[1, 4, 4, 1], padding='SAME')
    F2 = tf.contrib.layers.flatten(P2)
    Z3 = tf.contrib.layers.fully_connected(F2, 6, activation_fn=None)
    return Z3

tf.reset_default_graph()
with tf.Session() as sess:
    np.random.seed(1)
    X, Y = create_placeholders(64, 64, 3, 6)
    parameters = initialize_parameters()
    Z3 = forward_propagation(X, parameters)
    init = tf.global_variables_initializer()
    sess.run(init)
    a = sess.run(Z3, {X: np.random.randn(2,64,64,3), Y: np.random.randn(2,6)})
    print("Z3 = " + str(a))

When running tensorflow 1.3- (tested 1.2.1 as well), the output for Z3 is:

Z3 = [[-0.44670227 -1.57208765 -1.53049231 -2.31013036 -1.29104376  0.46852064]
 [-0.17601591 -1.57972014 -1.4737016  -2.61672091 -1.00810647  0.5747785 ]]

When running tensorflow 1.4+ (tested up to 1.7), the output for Z3 is:

Z3 = [[ 1.44169843 -0.24909666  5.45049906 -0.26189619 -0.20669907  1.36546707]
 [ 1.40708458 -0.02573211  5.08928013 -0.48669922 -0.40940708  1.26248586]]

A detailed review of all the tensors in forward_propagation() (ie Wx, Ax, Px, etc. ) points to tf.contrib.layers.fully_connected() since Z3 is the only diverging tensor.

The function signature did not change so I have no idea what happens under the hood.

I get a warning with 1.3 and before which disappears with 1.4 and beyond:

2018-04-09 23:13:39.954455: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-09 23:13:39.954495: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-04-09 23:13:39.954508: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-04-09 23:13:39.954521: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

I was wondering if maybe something changed in the default initialization of the parameters? Anyway, this is where I am right now. I can go ahead with the course but I feel a bit frustrated that I can't get a final call on this issue. I am wondering if this is a known behaviour or if a bug was introduced somewhere.

Besides, when completing the assignment, the final model is expected to deliver a test accuracy of 0.78 on an image recognition task after 100 epochs. This is precisely what happens with 1.3- but the accuracy drops to 0.58 with 1.4+, everything otherwise equal. This is a huge difference. I guess that a longer training might erase the difference but still, this is not a slight one so it might be worth mentioning.

Any comment / suggestion welcome.

Thanks,

Laurent

Answer 1

So here's the breakdown. The problem, somewhat surprisingly, is caused by tf.contrib.layers.flatten() because it changes the random seed differently in the different versions. There are two ways to seed the random number generator in Tensorflow, either you seed it for the whole graph with tf.set_random_seed() or you can specify a seed argument where it makes sense. As per the docs on tf.set_random_seed() , note point 2:

Operations that rely on a random seed actually derive it from two seeds: the graph-level and operation-level seeds. This sets the graph-level seed.

Its interactions with operation-level seeds is as follows:

If neither the graph-level nor the operation seed is set: A random seed is used for this op.

If the graph-level seed is set, but the operation seed is not: The system deterministically picks an operation seed in conjunction with the graph-level seed so that it gets a unique random sequence.

If the graph-level seed is not set, but the operation seed is set: A default graph-level seed and the specified operation seed are used to determine the random sequence.

If both the graph-level and the operation seed are set: Both seeds are used in conjunction to determine the random sequence.

In our case the seed is set at the graph level, and Tensorflow does some deterministic calculation to calculate the actual seed to use in the operation. This calculation apparently depends on the number of operations as well.

In addition, the implementation of tf.contrib.layers.flatten() has changed exactly between the versions 1.3 and 1.4. You can look it up in the repository, but basically the code was simplified and moved from tensorflow/contrib/layers/python/layers/layers.py into tensorflow/tensorflow/python/layers/core.py , but for us the important part is that it changed the number of operations performed, thereby changing the random seed applied in the Xavier initializer on your fully connected layer.

A possible workaround would be specifying the seed for each weight tensor separately, but that would require either manually generating the fully connected layer or touching the Tensorflow code. If you were only interested to know this info to be sure there's no issue with your code, then rest assured.

Minimal example to reproduce behavior, note the commented out line starting with Xf:

import numpy as np
import tensorflow as tf

tf.reset_default_graph()
tf.set_random_seed(1)
with tf.Session() as sess:
    X = tf.constant( [ [ 1, 2, 3, 4, 5, 6 ] ], tf.float32 )
    #Xf = tf.contrib.layers.flatten( X )
    R = tf.random_uniform( shape = () )
    R_V = sess.run( R )
print( R_V )

If you run this code as above, you get a printout of:

0.38538742

for both versions. If you uncomment the Xf line, you get

0.013653636

and

0.6033112

for versions 1.3 and 1.4 respectively. Interesting to note that Xf is never even executed, simply creating it is enough to cause the issue.

Two final notes: the four warnings you get with 1.3 are not related to this, those are only compilation options that could optimize (speed up) some calculations.

The other thing is that this should not affect the training behavior of your code, this issue changes the random seed only. So there must be some other difference causing the slower learning you observe.

Answer 2

I think I'm doing the same class about CNN as you. The train and test accuracy in the online python notebook are 94% and 78% respectively and when I run it locally I get an accuracy around 50%.

As you noticed, the initialization is different in a later python version. Peter's answer already describes nicely, why that is. But as mentioned in the comments this should not be the reason for the lower accuracy. It's just an issue of how the random seed is used differently in a later python version.

I ran the code using a range of different learning rates and indeed I found another learning rate for which I get 84% and 77% train and test accuracy. So, there must have been changes, maybe to AdamOptimizer, that prevent a learning rate that was optimized for an older tf version from also being optimal in a newer version.

Is tf.contrib.layers.fully_connected() behavior change between tensorflow 1.3 and 1.4 an issue?

Question

2 answers

solution1
6 ACCPTED 2018-04-10 01:35:20

solution2
0 2019-03-19 22:34:25

Is tf.contrib.layers.fully_connected() behavior change between tensorflow 1.3 and 1.4 an issue?

Question

2 answers

solution1 6 ACCPTED 2018-04-10 01:35:20

solution2 0 2019-03-19 22:34:25

solution1
6 ACCPTED 2018-04-10 01:35:20

solution2
0 2019-03-19 22:34:25