简体   繁体   中英

Theano performance issue when copying data to GPU

I had some performance issues when trying to train a deep convolutional neural net using theano and lasagne. I made some experiments to investigate where they come from. One thing I found is that it takes very long to load batches of images from main memory to the GPU. Here is a minimal example that illustrates the problem. It simply times how long it takes to evaluate a theano identity function on batches of images with batch sizes 1,2,4,8,16,... I am working with RGB images of size 448x448.

import numpy as np
import theano
import theano.tensor as T
import time

var = T.ftensor4('inputs')
f = theano.function([var], var)

for batchsize in [2**i for i in range(6)]:
    X = np.zeros((batchsize,3,448,448), dtype=np.float32)
    print "Batchsize", batchsize
    times = []
    start = time.time()
    for i in range(1000):
        f(X)
        times.append(time.time()-start)
        start = time.time()
    print "-> Function evaluation takes:", np.mean(times), "+/-", np.std(times), "sec"

My results are the following:

Batchsize 1
-> Function evaluation takes: 0.000177580833435 +/- 2.78762612138e-05 sec
Batchsize 2
-> Function evaluation takes: 0.000321553707123 +/- 2.4221262933e-05 sec
Batchsize 4
-> Function evaluation takes: 0.000669012069702 +/- 0.000896798280943 sec
Batchsize 8
-> Function evaluation takes: 0.00137474012375 +/- 0.0032982626882 sec
Batchsize 16
-> Function evaluation takes: 0.176659427643 +/- 0.0330068803715 sec
Batchsize 32
-> Function evaluation takes: 0.356572513342 +/- 0.074931685704 sec

Note the increase of factor 100 when increasing the batch size from 8 to 16. Is this normal or do I have some kind of technical problems? If so, do you have any idea where it might come from? Any help is appreciated. It would also help if you run the code snippet and report what you see.

EDIT: Daniel Renshaw pointed out that it has probably nothing to do with host-GPU copying. Any other ideas where the problem might come from? Some more information:

The theano debug print of the function reads

DeepCopyOp [@A] 'inputs'   0
 |inputs [@B]

Output of the theano profiling:

Function profiling                                                      
================== 
Message: theano_test.py:14
Time in 6000 calls to Function.__call__: 3.711728e+03s
Time in Function.fn.__call__: 3.711528e+03s (99.995%)                       
Time in thunks: 3.711491e+03s (99.994%)
Total compile time: 6.542931e-01s
    Number of Apply nodes: 1
    Theano Optimizer time: 7.912159e-03s
        Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 8.321500e-02s
        Import time 2.951717e-02s

Time in all call to theano.grad() 0.000000e+00s
Class 
---

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
100.0%   100.0%     3711.491s       6.19e-01s     C     6000       1   theano.compile.ops.DeepCopyOp
... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
100.0%   100.0%     3711.491s       6.19e-01s     C     6000        1   DeepCopyOp
... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
100.0%   100.0%     3711.491s       6.19e-01s   6000     0 DeepCopyOp(inputs)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

INFO (theano.gof.compilelock): Waiting for existing lock by process '3642' (I am process '22124')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/bal8rng/.theano/compiledir_Linux-3.16--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.10-64/lock_dir

THEANO_FLAGS: floatX=float32,device=gpu,optimizer_including=conv_meta,mode=FAST_RUN,blas.ldflags="-L/usr/lib/openblas-base -lopenblas",device=gpu3,assert_no_cpu_op=raise

Your computation is almost certainly not running on the GPU! As long as you're using standard configuration flags , Theano's optimizer is clever enough to see that no operations are actually performed so it doesn't add any "move data to GPU" and "move data back from GPU" operations in the compiled computation. You can see this by adding the following line just after the f = theano.function([var], var) line.

theano.printing.debugprint(f)

If you want to understand the overhead of moving data to and from the GPU, you're probably better served by Theano's built in profiling tools . Turn profiling on then, in the output, look at how much time is spent in GpuFromHost and HostFromGpu operations. This, of course, must be done with a more meaningful computation, one where data does actually need to be moved around.

However, it is curious that you get the results that you do. If the computation is indeed running on the CPU, I would still not expect to see such a step change as the batch size increases. This is probably of little interest to you if you don't continue seeing the same behaviour when the computation is actually running on the GPU though.

By the way, running your code (which on my server actually ran on the CPU despite device=gpu in the config, as explained above) I didn't get the same huge step change; my time multipliers were 2.6, 1.9, 4.0, 3.9, 2.0 (ie the time increased by a multiple of 2.6 from batch size = 1 to batch size = 2, etc.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM