Distributed Tensorflow: Internal Error - Blas GEMM launch failed

Question

I am experimenting with distributed Tensorflow and started with two processes on localhost (Windows 10, Python 3.6.6, Tensorflow 1.8.0). Each process runs a replica of simple Neural Network (1-hidden layer), modeled for a subset of UrbanSounds dataset (5268 samples with 193 features each).

Following this well-written post: https://learningtensorflow.com/lesson11/ I could repeat their basic example, calculating mean from results of two distinct processes. For my dataset, I modified the code as follows, to divide the total samples into two half and let two distinct processes compute the cost function separately. But after the RPC server is started successfully, both processes end up in following error:

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193

[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

It appears to me some basic mistake with neural network configuration OR preparing datasets for feed_dict, but I am unable to see that so need another pair of eyes. Another observation during this experiment is that GPU mostly shooted to max and code aborted. Please assist me with any mistake in code or strategy to distribute the Tensorflow?

Thank you.

 ### ERROR TRACE (removed duplicate rows ...) #### train_data, train_labels (528, 193) (528, 10) test_data, test_labels (22, 193) (22, 10) 2018-08-27 14:35:29.096572: IT:\\src\\github\\tensorflow\\tensorflow\\core\\platform\\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 2018-08-27 14:35:29.330127: IT:\\src\\github\\tensorflow\\tensorflow\\core\\common_runtime\\gpu\\gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:01:00.0 totalMemory: 8.00GiB freeMemory: 6.63GiB ... 2018-08-27 14:35:33.982347: ET:\\src\\github\\tensorflow\\tensorflow\\stream_executor\\cuda\\cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED Traceback (most recent call last): File "C:\\Users\\shakeel\\Anaconda3\\envs\\tensorflow-gpu\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1322, in _do_call 2018-08-27 14:35:33.989312: WT:\\src\\github\\tensorflow\\tensorflow\\stream_executor\\stream.cc:2001] attempting to perform BLAS operation using StreamExecutor without BLAS support return fn(*args) ... tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]] During handling of the above exception, another exception occurred: Traceback (most recent call last): ... tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]] Caused by op 'MatMul', defined at: File "tf_dis_audio_test.py", line 78, in <module> z = tf.nn.tanh(tf.matmul(X, w1) + b1) File "C:\\Users\\shakeel\\Anaconda3\\envs\\tensorflow-gpu\\lib\\site-packages\\tensorflow\\python\\ops\\math_ops.py", line 2122, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) ... InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

 ### CODE SAMPLE ### # selected UrbanSounds dataset print("train_data, train_labels", train_data.shape, train_labels.shape) print("test_data, test_labels", test_data.shape, test_labels.shape) # neural network configurations cost = 0.0 n_tasks = 2 n_epochs = 10 n_classes = 10 n_features = 193 n_hidden_1 = 200 learning_rate = 0.1 sd = 1/np.sqrt(n_features) cost_history = np.empty(shape=[1], dtype=float) # task#0 is set as rpc host process rpc_server = "grpc://localhost:2001" # run two separate python shells, each with its task number (0,1), as: #>python this_script.py 0 #>python this_script.py 1 task_number = int(sys.argv[1]) # cluster specs with two localhosts on different ports (2001, 2002) cluster = tf.train.ClusterSpec({job_name:["localhost:2001", "localhost:2002"]}) server = tf.train.Server(cluster, job_name="local", task_index=task_number) server.start() graph = tf.Graph() with graph.as_default(): X = tf.placeholder(tf.float32, [None, n_features]) Y = tf.placeholder(tf.float32, [None, n_classes]) w1 = tf.Variable(tf.random_normal([n_features, n_hidden_1], mean = 0, stddev=sd), name="w1") b1 = tf.Variable(tf.random_normal([n_hidden_1], mean=0, stddev=sd), name="b1") w2 = tf.Variable(tf.random_normal([n_hidden_1, n_classes], mean = 0, stddev=sd), name="w2") b2 = tf.Variable(tf.random_normal([n_classes], mean=0, stddev=sd), name="b2") z = tf.nn.tanh(tf.matmul(X, w1) + b1) _y = tf.nn.softmax(tf.matmul(z, w2) + b2) cost_function = tf.reduce_mean(tf.square(Y - _y)) train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function) prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(_y, 1)) accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32)) * 100.0 print("#2: {}".format(datetime.utcnow().strftime(datetime_format)[:-3])) # hack to fix the GPU out of memory issue # but it does not make any good, GPU still shoots :( gpuops = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) config = tf.ConfigProto(gpu_options=gpuops) with tf.Session(rpc_server, graph=graph, config=config) as ss: # setting up the session with RPC host ss = tf.Session(rpc_server) ss.run(tf.global_variables_initializer()) for epoch in range(n_epochs): batch_size = int(len(train_labels) / n_tasks) # run session for task#0 if (task_number == 0): _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[:batch_size-1], Y:train_labels[:batch_size-1]}) # run session for task#1 elif (task_number == 1): _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[batch_size:-1], Y:train_labels[batch_size:-1]}) # recording the running cost of both processes cost_history = np.append(cost_history, cost) print(" epoch {}: task {}: history {:.3f}".format(epoch, task_number, cost_history)) print("Accuracy SGD ({}): {:.3f}".format( epoch, round(ss.run(accuracy, feed_dict={X: test_data, Y: test_labels}), 3)))

Answer 1

Simply moving the given code to Ubuntu 16.04.4 LTS, solved the said problem for me.

I am not sure but this seems to be something related to GRPC+Fiewall on Windows 10.

If anybody come across BLASS error on Windows and could solve it on Windows, then please post the solution for rest of us.

Cheers.

Distributed Tensorflow: Internal Error - Blas GEMM launch failed

Question

1 answers

solution1
0 ACCPTED 2018-09-03 10:32:04

Distributed Tensorflow: Internal Error - Blas GEMM launch failed

Question

1 answers

solution1 0 ACCPTED 2018-09-03 10:32:04

solution1
0 ACCPTED 2018-09-03 10:32:04