[英]Shift from single gpu to multiple gpu.Throws an error TypeError: '<' not supported between instances of 'list' and 'int'
I had shifted from using single gpu to multiple gpu. 我已经从使用单个gpu转移到多个gpu。 The Code throws an error
该代码引发错误
epoch main/loss validation/main/loss elapsed_time
Exception in main training loop: '<' not supported between instances of
'list' and 'int'
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site-
packages/chainer/training/trainer.py", line 318, in run
entry.extension(self)
File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site-
packages/chainer/training/extensions/evaluator.py", line 157, in
__call__
result = self.evaluate()
File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site-
packages/chainer/training/extensions/evaluator.py", line 206, in evaluate
in_arrays = self.converter(batch, self.device)
File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site-
packages/chainer/dataset/convert.py", line 150, in concat_examples
return to_device(device, _concat_arrays(batch, padding))
File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site-
packages/chainer/dataset/convert.py", line 35, in to_device
elif device < 0:
Will finalize trainer extensions and updater before reraising the exception. 在重新启动异常之前,将最终确定培训师扩展和更新程序。
I have tried without using gpu it worked fine. 我试过没有使用gpu它工作正常。 But when using single gpu ,got an error of out of memory.so, shifted p28xlarge instance and now it throws the above error.where is the problem and how to solve it ?
但是当使用单个gpu时,得到了内存不足的错误。所以,移动了p28xlarge实例,现在它抛出了上面的错误。问题出在哪里,如何解决?
num_gpus = 8
chainer.cuda.get_device_from_id(0).use()
3.# updater 3.#更新者
if num_gpus > 0:
updater = training.updater.ParallelUpdater(
train_iter,
optimizer,
devices={('main' if device == 0 else str(device)): device for
device in range(num_gpus)},
)
else:
updater = training.updater.StandardUpdater(train_iter, optimizer,
device=args.gpus)
4.and son on.. 5.Training : 4.和儿子.. 5.培训:
trainer.run()
output -- epoch main/loss validation/main/loss elapsed_time Exception in main training loop: '<' not supported between instances of 'list' and 'int' 输出 - epoch main / loss验证/ main / loss elapsed_time主训练循环中的异常:'list'和'int'实例之间不支持'<'
I expected the output as 我期望输出为
epoch main/loss validation/main/loss elapsed_time
1.
2.
3. and so on till it converge's.
It seems like an error caused by the Evaluator
extension when it's transferring data to the specified device
. 当它将数据传输到指定
device
时,它似乎是由Evaluator
扩展引起的错误。 How are you specifying the device
to Evalutor.__init__
? 你如何指定
device
Evalutor.__init__
? Note that it should be a single device. 请注意,它应该是单个设备。 Maybe this example could be a reference https://github.com/chainer/chainer/blob/master/examples/mnist/train_mnist_data_parallel.py
也许这个例子可以作为参考https://github.com/chainer/chainer/blob/master/examples/mnist/train_mnist_data_parallel.py
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.