Theano简单线性回归在CPU而不是GPU上运行

Question

I created a simple python script (using Theano) performing linear regression which should be run on GPU. 我创建了一个简单的python脚本（使用Theano）执行线性回归，应该在GPU上运行。 When code starts it says "using gpu device", but (according to the profiler) all operations are CPU-specific (ElemWise, instead of GpuElemWise, no GpuFromHost etc.). 代码启动时会显示“使用gpu device”，但是（根据分析器）所有操作都是CPU特定的（ElemWise，而不是GpuElemWise，没有GpuFromHost等）。

I checked the variables, THEANO_FLAGS, everything seems right and I cannot see the catch (especially when Theano tutorials with the same settings are correctly run on GPU :)). 我检查了变量，THEANO_FLAGS，一切似乎都正确，我看不到捕获（特别是当使用相同设置的Theano教程在GPU上正确运行:)）。

Here is the code: 这是代码：

# linear regression

import numpy
import theano
import theano.tensor as T

input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])

TS = theano.shared(input_data, "training-set")
E = theano.shared(output_data, "expected")
W1 = theano.shared(numpy.zeros((1, 2)))

O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True)

for i in range(1000):
    train()

THEANO_FLAGS=cuda.root=/usr/local/cuda THEANO_FLAGS = cuda.root =在/ usr /本地/ CUDA

device=gpu 设备= GPU

floatX=float32 floatX = FLOAT32

lib.cnmem=.5 lib.cnmem = 0.5

profile=True 配置=真

CUDA_LAUNCH_BLOCKING=1 CUDA_LAUNCH_BLOCKING = 1

Output: 输出：

Using gpu device 0: GeForce GT 650M (CNMeM is enabled)
Function profiling
==================
  Message: /home/mw/Documents/LiClipse Workspace/theano1/test2.py:18
  Time in 1000 calls to Function.__call__: 3.348637e-02s
  Time in Function.fn.__call__: 2.419019e-02s (72.239%)
  Time in thunks: 1.839781e-02s (54.941%)
  Total compile time: 1.350801e-01s
    Number of Apply nodes: 18
    Theano Optimizer time: 1.101730e-01s
       Theano validate time: 2.029657e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.491690e-02s
       Import time 2.320528e-03s

Time in all call to theano.grad() 8.740902e-03s
Time since theano import 0.881s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  71.7%    71.7%       0.013s       6.59e-06s     Py    2000       2   theano.tensor.basic.Dot
  12.3%    83.9%       0.002s       3.22e-07s     C     7000       7   theano.tensor.elemwise.Elemwise
   5.7%    89.6%       0.001s       3.50e-07s     C     3000       3   theano.tensor.elemwise.DimShuffle
   4.0%    93.6%       0.001s       3.65e-07s     C     2000       2   theano.tensor.subtensor.Subtensor
   3.6%    97.2%       0.001s       3.31e-07s     C     2000       2   theano.compile.ops.Shape_i
   1.7%    98.9%       0.000s       3.06e-07s     C     1000       1   theano.tensor.opt.MakeVector
   1.1%   100.0%       0.000s       2.10e-07s     C     1000       1   theano.tensor.elemwise.Sum
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  71.7%    71.7%       0.013s       6.59e-06s     Py    2000        2   dot
   4.0%    75.6%       0.001s       3.65e-07s     C     2000        2   Subtensor{int64}
   3.5%    79.1%       0.001s       6.35e-07s     C     1000        1   InplaceDimShuffle{1,0}
   3.3%    82.4%       0.001s       6.06e-07s     C     1000        1   Elemwise{mul,no_inplace}
   2.4%    84.8%       0.000s       4.38e-07s     C     1000        1   Shape_i{0}
   2.3%    87.1%       0.000s       4.29e-07s     C     1000        1   Elemwise{Composite{((i0 * i1) / i2)}}
   2.3%    89.3%       0.000s       2.08e-07s     C     2000        2   InplaceDimShuffle{x,x}
   1.8%    91.1%       0.000s       3.25e-07s     C     1000        1   Elemwise{Cast{float64}}
   1.7%    92.8%       0.000s       3.06e-07s     C     1000        1   MakeVector{dtype='int64'}
   1.5%    94.3%       0.000s       2.78e-07s     C     1000        1   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
   1.4%    95.7%       0.000s       2.53e-07s     C     1000        1   Elemwise{Sub}[(0, 1)]
   1.2%    96.9%       0.000s       2.24e-07s     C     1000        1   Shape_i{1}
   1.1%    98.0%       0.000s       2.10e-07s     C     1000        1   Sum{acc_dtype=float64}
   1.1%    99.1%       0.000s       1.98e-07s     C     1000        1   Elemwise{Sqr}[(0, 0)]
   0.9%   100.0%       0.000s       1.66e-07s     C     1000        1   Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  37.8%    37.8%       0.007s       6.95e-06s   1000     3   dot(<TensorType(float64, matrix)>, training-set.T)
  33.9%    71.7%       0.006s       6.24e-06s   1000    14   dot(Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set)
   3.5%    75.1%       0.001s       6.35e-07s   1000     0   InplaceDimShuffle{1,0}(training-set)
   3.3%    78.4%       0.001s       6.06e-07s   1000    11   Elemwise{mul,no_inplace}(InplaceDimShuffle{x,x}.0, InplaceDimShuffle{x,x}.0)
   3.0%    81.4%       0.001s       5.58e-07s   1000     8   Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{1})
   2.4%    83.8%       0.000s       4.38e-07s   1000     2   Shape_i{0}(expected)
   2.3%    86.2%       0.000s       4.29e-07s   1000    12   Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of -2.0}, Elemwise{Sub}[(0, 1)].0, Elemwise{mul,no_inplace}.0)
   1.8%    87.9%       0.000s       3.25e-07s   1000     6   Elemwise{Cast{float64}}(MakeVector{dtype='int64'}.0)
   1.7%    89.6%       0.000s       3.06e-07s   1000     4   MakeVector{dtype='int64'}(Shape_i{0}.0, Shape_i{1}.0)
   1.6%    91.2%       0.000s       3.03e-07s   1000    10   InplaceDimShuffle{x,x}(Subtensor{int64}.0)
   1.5%    92.7%       0.000s       2.78e-07s   1000    16   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](<TensorType(float64, matrix)>, TensorConstant{(1, 1) of ..974738e-05}, dot.0)
   1.4%    94.1%       0.000s       2.53e-07s   1000     5   Elemwise{Sub}[(0, 1)](expected, dot.0)
   1.2%    95.3%       0.000s       2.24e-07s   1000     1   Shape_i{1}(expected)
   1.1%    96.5%       0.000s       2.10e-07s   1000    15   Sum{acc_dtype=float64}(Elemwise{Sqr}[(0, 0)].0)
   1.1%    97.6%       0.000s       1.98e-07s   1000    13   Elemwise{Sqr}[(0, 0)](Elemwise{Sub}[(0, 1)].0)
   0.9%    98.5%       0.000s       1.72e-07s   1000     7   Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{0})
   0.9%    99.4%       0.000s       1.66e-07s   1000    17   Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](Sum{acc_dtype=float64}.0, Subtensor{int64}.0, Subtensor{int64}.0)
   0.6%   100.0%       0.000s       1.13e-07s   1000     9   InplaceDimShuffle{x,x}(Subtensor{int64}.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Answer 1

As mentioned in the comments although you have set the allow_input_downcast parameter to True , but you need to make sure all the data to be assigned to shared variables are in float32 . 正如注释中所述，虽然您已将allow_input_downcast参数设置为True ，但您需要确保要分配给共享变量的所有数据都在float32 。 As of Jan. 06, 2016 Theano still cannot work with any other data type rather than float32 to do the computations on GPU, as mentioned here in a more details. 截至1月06年，2016年 Theano仍无法与任何其它数据类型的工作，而不是float32上做GPU的计算，提到这里的更多细节。 So you have to have to cast your data into 'float32' format. 因此，您必须将数据转换为“float32”格式。

Therefore, here should be the code you need to use: 因此，这里应该是您需要使用的代码：

import numpy
import theano
import theano.tensor as T


input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])

TS = theano.shared(input_data.astype('float32'), "training-set")
E = theano.shared(output_data.astype('float32'), "expected")
W1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32'))

O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True, profile = True)

for i in range(1000):
    train()

train.profile.print_summary()

And here will be the profiling result: 这将是分析结果：

Message: LearnTheano.py:18
  Time in 1000 calls to Function.__call__: 2.642968e-01s
  Time in Function.fn.__call__: 2.460811e-01s (93.108%)
  Time in thunks: 1.877530e-01s (71.039%)
  Total compile time: 2.483290e+01s
    Number of Apply nodes: 17
    Theano Optimizer time: 2.818849e-01s
       Theano validate time: 3.435850e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.453926e+01s
       Import time 1.241469e-02s

Time in all call to theano.grad() 1.206994e-02s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  34.8%    34.8%       0.065s       3.27e-05s     C     2000       2   theano.sandbox.cuda.blas.GpuGemm
  28.8%    63.5%       0.054s       1.80e-05s     C     3000       3   theano.sandbox.cuda.basic_ops.GpuElemwise
  12.9%    76.4%       0.024s       2.42e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
  10.3%    86.7%       0.019s       1.93e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.GpuFromHost
   7.2%    93.9%       0.014s       1.36e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   1.8%    95.7%       0.003s       1.13e-06s     C     3000       3   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.5%    97.2%       0.003s       2.81e-06s     C     1000       1   theano.tensor.elemwise.Elemwise
   1.1%    98.4%       0.002s       1.08e-06s     C     2000       2   theano.compile.ops.Shape_i
   1.1%    99.5%       0.002s       1.02e-06s     C     2000       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.5%   100.0%       0.001s       9.96e-07s     C     1000       1   theano.tensor.opt.MakeVector
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  25.3%    25.3%       0.047s       4.74e-05s     C     1000        1   GpuGemm{no_inplace}
  12.9%    38.1%       0.024s       2.42e-05s     C     1000        1   GpuCAReduce{pre=sqr,red=add}{1,1}
  12.8%    51.0%       0.024s       2.41e-05s     C     1000        1   GpuElemwise{mul,no_inplace}
  10.3%    61.3%       0.019s       1.93e-05s     C     1000        1   GpuFromHost
   9.5%    70.8%       0.018s       1.79e-05s     C     1000        1   GpuGemm{inplace}
   8.2%    79.0%       0.015s       1.55e-05s     C     1000        1   GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
   7.7%    86.7%       0.014s       1.44e-05s     C     1000        1   GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)]
   7.2%    93.9%       0.014s       1.36e-05s     C     1000        1   HostFromGpu
   1.5%    95.4%       0.003s       2.81e-06s     C     1000        1   Elemwise{Cast{float32}}
   1.1%    96.5%       0.002s       1.02e-06s     C     2000        2   GpuSubtensor{int64}
   1.0%    97.5%       0.002s       9.00e-07s     C     2000        2   GpuDimShuffle{x,x}
   0.8%    98.3%       0.002s       1.59e-06s     C     1000        1   GpuDimShuffle{1,0}
   0.7%    99.1%       0.001s       1.38e-06s     C     1000        1   Shape_i{0}
   0.5%    99.6%       0.001s       9.96e-07s     C     1000        1   MakeVector
   0.4%   100.0%       0.001s       7.76e-07s     C     1000        1   Shape_i{1}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  25.3%    25.3%       0.047s       4.74e-05s   1000     3   GpuGemm{no_inplace}(expected, TensorConstant{-1.0}, <CudaNdarrayType(float32, matrix)>, GpuDimShuffle{1,0}.0, TensorConstant{1.0})
  12.9%    38.1%       0.024s       2.42e-05s   1000     5   GpuCAReduce{pre=sqr,red=add}{1,1}(GpuGemm{no_inplace}.0)
  12.8%    51.0%       0.024s       2.41e-05s   1000    13   GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,x}.0, GpuDimShuffle{x,x}.0)
  10.3%    61.3%       0.019s       1.93e-05s   1000     7   GpuFromHost(Elemwise{Cast{float32}}.0)
   9.5%    70.8%       0.018s       1.79e-05s   1000    16   GpuGemm{inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{-9.99999974738e-05}, GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, training-set, TensorConstant{1.0})
   8.2%    79.0%       0.015s       1.55e-05s   1000    12   GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](GpuCAReduce{pre=sqr,red=add}{1,1}.0, GpuSubtensor{int64}.0, GpuSubtensor{int64}.0)
   7.7%    86.7%       0.014s       1.44e-05s   1000    15   GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)](CudaNdarrayConstant{[[-2.]]}, GpuGemm{no_inplace}.0, GpuElemwise{mul,no_inplace}.0)
   7.2%    93.9%       0.014s       1.36e-05s   1000    14   HostFromGpu(GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)].0)
   1.5%    95.4%       0.003s       2.81e-06s   1000     6   Elemwise{Cast{float32}}(MakeVector.0)
   0.8%    96.3%       0.002s       1.59e-06s   1000     0   GpuDimShuffle{1,0}(training-set)
   0.7%    97.0%       0.001s       1.38e-06s   1000     2   Shape_i{0}(expected)
   0.7%    97.7%       0.001s       1.30e-06s   1000     8   GpuSubtensor{int64}(GpuFromHost.0, Constant{0})
   0.6%    98.3%       0.001s       1.08e-06s   1000    11   GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
   0.5%    98.8%       0.001s       9.96e-07s   1000     4   MakeVector(Shape_i{0}.0, Shape_i{1}.0)
   0.4%    99.2%       0.001s       7.76e-07s   1000     1   Shape_i{1}(expected)
   0.4%    99.6%       0.001s       7.40e-07s   1000     9   GpuSubtensor{int64}(GpuFromHost.0, Constant{1})
   0.4%   100.0%       0.001s       7.25e-07s   1000    10   GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Theano简单线性回归在CPU而不是GPU上运行

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-01-06 00:29:42

Theano简单线性回归在CPU而不是GPU上运行

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-01-06 00:29:42

解决方案1
2 已采纳 2016-01-06 00:29:42