使用theano的CUDA运行时GPU初始化

Question

I am trying to parallelize my NN across two GPUs following https://github.com/uoguelph-mlrg/theano_multi_gpu . 我正在尝试在https://github.com/uoguelph-mlrg/theano_multi_gpu之后的两个GPU之间并行化我的NN。 I have all the dependencies, but the cuda runtime initialization fails with the following message. 我具有所有依赖关系，但是cuda运行时初始化失败，并显示以下消息。

ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device 0 failed:
cublasCreate() returned this error 'the CUDA Runtime initialization failed'
Error when trying to find the memory information on the GPU: invalid device ordinal
Error allocating 24 bytes of device memory (invalid device ordinal). Driver report 0 bytes free and 0 bytes total
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
CudaNdarray_ZEROS: allocation failed.
Process Process-1:
Traceback (most recent call last):
  File "/opt/share/Python-2.7.9/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/share/Python-2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/u/bsankara/nt/Git-nt/nt/train_attention.py", line 171, in launch_train
    clip_c=1.)
  File "/u/bsankara/nt/Git-nt/nt/nt.py", line 1616, in train
    import theano.sandbox.cuda
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/__init__.py", line 98, in <module>
    theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/tests/test_driver.py", line 30, in test_nvidia_driver1
    A = cuda.shared_constructor(a)
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/var.py", line 181, in float32_shared_constructor
    enable_cuda=False)
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py", line 389, in use
    cuda_ndarray.cuda_ndarray.CudaNdarray.zeros((2, 3))
RuntimeError: ('CudaNdarray_ZEROS: allocation failed.', 'You asked to force this device and it failed. No fallback to the cpu or other gpu device.')

The relevant part of the code snippet is here: 代码段的相关部分在此处：

from multiprocessing import Queue
import zmq
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray

def train(private_args, process_env, <some other args>)
    if process_env is not None:
       os.environ = process_env

    ####
    # pycuda and zmq environment

    drv.init()
    dev = drv.Device(private_args['ind_gpu'])
    ctx = dev.make_context()
    sock = zmq.Context().socket(zmq.PAIR)

    if private_args['flag_client']:
        sock.connect('tcp://localhost:5000')
    else:
        sock.bind('tcp://*:5000')

    ####
    # import theano stuffs
    import theano.sandbox.cuda
    theano.sandbox.cuda.use(private_args['gpu'])

    import theano
    import theano.tensor as tensor
    from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
    import theano.misc.pycuda_init
    import theano.misc.pycuda_utils
...

The error is triggered when it imports theano.sandbox.cuda. 导入theano.sandbox.cuda时将触发错误。 And this is where, I launch the training function as two processes. 在这里，我将训练功能分为两个过程启动。

def launch_train(curr_args, process_env, curr_queue, oth_queue):
    trainerr, validerr, testerr = train(private_args=curr_args,
                                        process_env=process_env,
                                         ...)

process1_env = os.environ.copy()
process1_env['THEANO_FLAGS'] = "cuda.root=/opt/share/cuda-7.0,device=gpu0,floatX=float32,on_unused_input=ignore,optimizer=fast_run,exception_verbosity=high,compiledir=/u/bsankara/.theano/NT_multi_GPU1"
process2_env = os.environ.copy()
process2_env['THEANO_FLAGS'] = "cuda.root=/opt/share/cuda-7.0,device=gpu1,floatX=float32,on_unused_input=ignore,optimizer=fast_run,exception_verbosity=high,compiledir=/u/bsankara/.theano/NT_multi_GPU2"

p = Process(target=launch_train,
                args=(p_args, process1_env, queue_p, queue_q))
q = Process(target=launch_train,
                args=(q_args, process2_env, queue_q, queue_p))

p.start()
q.start()
p.join()
q.join()

The import statement however seem to work if I try to initialize the gpu interactively in Python. 但是，如果我尝试在Python中以交互方式初始化gpu，则import语句似乎可以正常工作。 I executed the first 20 lines of the train() and it worked fine there and also correctly assigned me to gpu0 as I requested. 我执行了train（）的前20行，并且在那里工作正常，并且按照我的要求正确地将我分配给了gpu0。

Answer 1

After digging around and running pdb, the original poster found the issue. 挖掘并运行pdb之后，原始发布者发现了问题。

Basically theano and pycuda were both competing to initialize the gpu, causing the problem. 基本上theano和pycuda都在竞争初始化gpu，从而导致了问题。 The solution is to first 'import theano', which would get a gpu and then attach to the specific context in pycuda. 解决方案是先“导入theano”，这将得到一个gpu，然后附加到pycuda中的特定context 。 So, the import sections within train function would look like this: 因此， train函数中的导入部分如下所示：

def train(private_args, process_env, <some other args>)
    if process_env is not None:
       os.environ = process_env

    ####
    # import theano related
    # We need global imports and so we make them as such
    theano = __import__('theano')
    _t_tensor = __import__('theano', globals(), locals(), ['tensor'], -1)
    tensor = _t_tensor.tensor

    import theano.sandbox.cuda
    import theano.misc.pycuda_utils

    ####
    # pycuda and zmq environment
    import zmq
    import pycuda.driver as drv
    import pycuda.gpuarray as gpuarray

    drv.init()
    # Attach the existing context (already initialized by theano import statement)
    ctx = drv.Context.attach()
    sock = zmq.Context().socket(zmq.PAIR)

    if private_args['flag_client']:
        sock.connect('tcp://localhost:5000')
    else:
        sock.bind('tcp://*:5000')

[This answer was added as a community wiki entry from an edit made by the OP in a attempt to get this question off the unaswered list]. [此答案是OP所做的编辑中添加的社区Wiki条目，目的是将这个问题从未作答的清单中删除]。

使用theano的CUDA运行时GPU初始化

问题描述

1 个解决方案

解决方案1
0 已采纳

使用theano的CUDA运行时GPU初始化

问题描述

1 个解决方案

解决方案1 0 已采纳

解决方案1
0 已采纳