TensorFlow中的SVD比numpy中的SVD慢

Question

I am observing that on my machine SVD in tensorflow is running significantly slower than in numpy. 我观察到我的机器上tensorflow中的SVD运行速度明显慢于numpy中。 I have GTX 1080 GPU, and expecting SVD to be at least as fast as when running the code using CPU (numpy). 我有GTX 1080 GPU，并且期望SVD至少与使用CPU（numpy）运行代码时的速度一样快。

Environment Info 环境信息

Operating System 操作系统

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.10
Release:    16.10
Codename:   yakkety

Installed version of CUDA and cuDNN: CUDA和cuDNN的安装版本：

ls -l /usr/local/cuda-8.0/lib64/libcud*
-rw-r--r-- 1 root      root    556000 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudadevrt.a
lrwxrwxrwx 1 root      root        16 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root      root        19 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61
-rwxr-xr-x 1 root      root    415432 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
-rw-r--r-- 1 root      root    775162 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudart_static.a
lrwxrwxrwx 1 voldemaro users       13 Nov  6  2016 /usr/local/cuda-8.0/lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 voldemaro users       18 Nov  6  2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5 -> libcudnn.so.5.1.10
-rwxr-xr-x 1 voldemaro users 84163560 Nov  6  2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5.1.10
-rw-r--r-- 1 voldemaro users 70364814 Nov  6  2016 /usr/local/cuda-8.0/lib64/libcudnn_static.a

TensorFlow Setup TensorFlow设置

python -c "import tensorflow; print(tensorflow.__version__)"
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.0

Code: 码：

'''
Created on Sep 21, 2017

@author: voldemaro
'''
import numpy as np
import tensorflow as tf
import time;
import numpy.linalg as NLA;




N=1534;

svd_array = np.random.random_sample((N,N));
svd_array = svd_array.astype(complex);

specVar = tf.Variable(svd_array, dtype=tf.complex64);

[D2, E1,  E2] = tf.svd(specVar);

init_OP = tf.global_variables_initializer();

with tf.Session() as sess:
    # Initialize all tensorflow variables
    start = time.time();
    sess.run(init_OP);
    print 'initializing variables: {} s'.format(time.time()-start);

    start_time = time.time();
    [d, e1, e2]  = sess.run([D2, E1,  E2]);
    print("Tensorflow SVD ---: {} s" . format(time.time() - start_time));


# Equivalent numpy 
start = time.time();

u, s, v = NLA.svd(svd_array);   
print 'numpy SVD  ---: {} s'.format(time.time() - start);

Code Trace: 代码跟踪：

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.11GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
initializing variables: 0.230546951294 s
Tensorflow SVD ---: 6.56117296219 s
numpy SVD  ---: 4.41714000702 s

Answer 1

GPU execution typically outperforms CPU only when parallelization is effective. 仅当并行化有效时，GPU的执行通常才能胜过CPU。

However, the parallelization of SVD algorithms is still subject to active research, meaning that no parallel version has been found yet to be vastly superior to the serial implementation. 但是，SVD算法的并行化仍需进行积极的研究，这意味着尚未发现并行版本比串行实现具有更多优势。

Likely, the NumPy version is based on an extremely well optimized FORTRAN implementation, while I believe TensorFlow has its own C++ implementation, and apparently that is not as well optimized as the code that NumPy is calling. 可能，NumPy版本基于一个非常优化的FORTRAN实现，而我相信TensorFlow具有自己的C ++实现，并且显然没有NumPy调用的代码优化得那么好。

EDIT : You may not be the first to observe the poorer performances of TensorFlow with SVD as compared to the FORTRAN implementations. 编辑：与FORTRAN实现相比，您可能不是第一个观察到带有SVD的TensorFlow较差性能的人。

Answer 2

It looks like TensorFlow op implements gesvd whereas if you use MKL-enabled numpy/scipy (ie, if you use conda), it defaults to faster (but less numerically robust) gesdd 看起来TensorFlow op 实现了 gesvd，而如果您使用启用了MKL的numpy / scipy（即，如果使用conda），则默认使用更快（但数值上较不稳健）的gesdd

You can try comparing against gesvd in scipy: 您可以尝试与gesvd中的gesvd进行比较：

from scipy import linalg
u0, s0, vt0 = linalg.svd(target0, lapack_driver="gesvd")

I've also experienced better results with MKL version so I've been using this helper class to transparently switch between TensorFlow and numpy versions of SVD, using tf.Variable to store results 我在MKL版本上也体验到了更好的结果，所以我一直在使用tf.Variable存储结果来使用此帮助器类在TensorFlow和numpy SVD版本之间透明切换

You use it like this 你这样用

result = SvdWrapper(tensor)
result.update()
sess.run([result.u, result.s, result.v])

Issue with more details on slowness: https://github.com/tensorflow/tensorflow/issues/13222 有关慢度的更多详细信息的问题： https : //github.com/tensorflow/tensorflow/issues/13222

TensorFlow中的SVD比numpy中的SVD慢

问题描述

2 个解决方案

解决方案1
1 2017-09-21 23:53:09

解决方案2
1 已采纳 2017-09-22 00:03:00

TensorFlow中的SVD比numpy中的SVD慢

问题描述

2 个解决方案

解决方案1 1 2017-09-21 23:53:09

解决方案2 1 已采纳 2017-09-22 00:03:00

解决方案1
1 2017-09-21 23:53:09

解决方案2
1 已采纳 2017-09-22 00:03:00