从单个gpu转换到多个gpu.Throws一个错误TypeError：'list'和'int'的实例之间不支持'<'

Question

I had shifted from using single gpu to multiple gpu. 我已经从使用单个gpu转移到多个gpu。 The Code throws an error 该代码引发错误

    epoch       main/loss   validation/main/loss  elapsed_time
   Exception in main training loop: '<' not supported between instances of 
    'list' and 'int'
       Traceback (most recent call last):
   File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/training/trainer.py", line 318, in run
       entry.extension(self)
   File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
        packages/chainer/training/extensions/evaluator.py", line 157, in 
        __call__
         result = self.evaluate()
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
         packages/chainer/training/extensions/evaluator.py", line 206, in evaluate
       in_arrays = self.converter(batch, self.device)
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/dataset/convert.py", line 150, in concat_examples
       return to_device(device, _concat_arrays(batch, padding))
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/dataset/convert.py", line 35, in to_device
          elif device < 0:

Will finalize trainer extensions and updater before reraising the exception. 在重新启动异常之前，将最终确定培训师扩展和更新程序。

I have tried without using gpu it worked fine. 我试过没有使用gpu它工作正常。 But when using single gpu ,got an error of out of memory.so, shifted p28xlarge instance and now it throws the above error.where is the problem and how to solve it ? 但是当使用单个gpu时，得到了内存不足的错误。所以，移动了p28xlarge实例，现在它抛出了上面的错误。问题出在哪里，如何解决？

change's done using 8 gpu's 使用8个gpu进行更改

     num_gpus = 8
     chainer.cuda.get_device_from_id(0).use()

3.# updater 3.＃更新者

     if num_gpus > 0:

        updater = training.updater.ParallelUpdater(
        train_iter,
        optimizer,
        devices={('main' if device == 0 else str(device)): device for 
                 device in range(num_gpus)},
    )
    else:
        updater = training.updater.StandardUpdater(train_iter, optimizer, 
                    device=args.gpus)

4.and son on.. 5.Training : 4.和儿子.. 5.培训：

       trainer.run()

output -- epoch main/loss validation/main/loss elapsed_time Exception in main training loop: '<' not supported between instances of 'list' and 'int' 输出 - epoch main / loss验证/ main / loss elapsed_time主训练循环中的异常：'list'和'int'实例之间不支持'<'

I expected the output as 我期望输出为

          epoch       main/loss   validation/main/loss  elapsed_time
           1.         
           2. 
           3. and so on till it converge's.

Answer 1

It seems like an error caused by the Evaluator extension when it's transferring data to the specified device . 当它将数据传输到指定device时，它似乎是由Evaluator扩展引起的错误。 How are you specifying the device to Evalutor.__init__ ? 你如何指定device Evalutor.__init__ ？ Note that it should be a single device. 请注意，它应该是单个设备。 Maybe this example could be a reference https://github.com/chainer/chainer/blob/master/examples/mnist/train_mnist_data_parallel.py 也许这个例子可以作为参考https://github.com/chainer/chainer/blob/master/examples/mnist/train_mnist_data_parallel.py

从单个gpu转换到多个gpu.Throws一个错误TypeError：'list'和'int'的实例之间不支持'<'

问题描述

change's done using 8 gpu's 使用8个gpu进行更改

1 个解决方案

解决方案1
0 2019-06-17 03:24:56

从单个gpu转换到多个gpu.Throws一个错误TypeError：&#39;list&#39;和&#39;int&#39;的实例之间不支持&#39;&lt;&#39;

问题描述

change's done using 8 gpu's 使用8个gpu进行更改

1 个解决方案

解决方案1 0 2019-06-17 03:24:56

从单个gpu转换到多个gpu.Throws一个错误TypeError：'list'和'int'的实例之间不支持'<'

解决方案1
0 2019-06-17 03:24:56