简体   繁体   English

"为什么在google colab中GPU比cpu慢得多?"

[英]Why GPU is much slower than cpu in google colab?

I'm training a RNN on google colab and this is my first time using gpu to train a neural network.我在 google colab 上训练 RNN,这是我第一次使用 gpu 训练神经网络。 From my point of view, GPU should be much faster than cpu, and changing device from cpu to gpu only need to add .to('cuda')<\/code> in the definition of model\/loss\/variable and set google colab 'running on gpu'.从我的角度来看,GPU应该比cpu快得多,并且将设备从cpu更改为gpu只需要在模型\/损失\/变量的定义中添加.to('cuda')<\/code>并设置google colab'running on gpu' .

When I train it on cpu, the average speed is 650 iteration\/s当我在 cpu 上训练它时,平均速度是 650 次迭代\/秒

Training on cpu in google colab<\/a>谷歌colab中的cpu培训<\/a>

But when I train it on gpu, the average speed is only 340 iterations\/s, only half of the cpu但是我在gpu上训练的时候,平均速度只有340次迭代\/s,只有cpu的一半

Training on gpu in google colab<\/a>在 google colab 中进行 gpu 培训<\/a>

and this happened on every epoch这发生在每个时代

Here is my code.这是我的代码。

def train(num_epoch = 30,len_vocab = 1, num_hidden=256,embedding_dim = 8,batch_size = 100):
    data = get_data()

    model = MyRNN(len_vocab,num_hidden,embedding_dim).to('cuda') #here 
    if os.path.exists('QingBinLi'):
        model.load_state_dict(torch.load('QingBinLi'))

    criterion = nn.MSELoss().to('cuda')   #here 
    optimizer = torch.optim.Adam(model.parameters(), lr=0.1, weight_decay=1e-5)
    loss_for_draw = []
    model.train()
    data = data.detach().to('cuda') #here 

    for epoch in range(num_epoch+1):

        h = torch.randn(1,batch_size,num_hidden).to('cuda')  #here 
        loss_average = 0
        for i in tqdm(range(data.shape[-2] -batch_size)):
            optimizer.zero_grad()
            pre,h = model(data[:,:,i:i+batch_size,:].squeeze(0) ,h)
            h = h.detach()
            pre = pre.unsqueeze(0).unsqueeze(0)
            loss = criterion(pre, data[:,:,i+1:i+1+batch_size,:].squeeze(0))
            loss_average += loss.item()
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=10)
            optimizer.step()

        loss_for_draw.append(loss_average/(data.shape[-2] -batch_size))
        torch.save(model.state_dict(), 'QingBinLi')
        print(f'now epoch:{epoch}, loss = {loss_for_draw[-1]}')


    return loss_for_draw

My brother says that when the tensor is very big, such as 1 million dimension, gpu can be faster than cpu, otherwise we don't even need parallel computing because computing are not mainly on tensor multiply, but on copy tensors and other things like that.我哥说当tensor很大的时候,比如100万维,gpu可以比cpu快,否则我们甚至不需要并行计算,因为计算主要不是tensor multiply,而是copy tensors之类的那。

My RNN has about 256x256+256x8 parameters and batch_size is 100, and the dimention of that is much lower than 1 million.我的 RNN 大约有 256x256+256x8 个参数,batch_size 是 100,它的维度远低于 100 万。 So gpu is much slower.所以gpu要慢得多。

And, when I change my batch_size to 10000, gpu is 145 iteration\/s while cpu is only 15iterations\/s.而且,当我将 batch_size 更改为 10000 时,gpu 为 145 次迭代\/秒,而 CPU 仅为 15 次迭代\/秒。 This time gpu is much faster.这次gpu快多了。

A CNN, with stride one, in gpu we can calculate filter_size *image_size * batch_size, about 2,415,919,104 times multiply simultaneously.一个 CNN,步幅为 1,在 gpu 中我们可以计算 filter_size *image_size * batch_size,大约 2,415,919,104 次同时相乘。 So in this kind of computing, gpu is much faster.所以在这种计算中,gpu 要快得多。

"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM