简体   繁体   中英

Shift from single gpu to multiple gpu.Throws an error TypeError: '<' not supported between instances of 'list' and 'int'

I had shifted from using single gpu to multiple gpu. The Code throws an error

    epoch       main/loss   validation/main/loss  elapsed_time
   Exception in main training loop: '<' not supported between instances of 
    'list' and 'int'
       Traceback (most recent call last):
   File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/training/trainer.py", line 318, in run
       entry.extension(self)
   File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
        packages/chainer/training/extensions/evaluator.py", line 157, in 
        __call__
         result = self.evaluate()
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
         packages/chainer/training/extensions/evaluator.py", line 206, in evaluate
       in_arrays = self.converter(batch, self.device)
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/dataset/convert.py", line 150, in concat_examples
       return to_device(device, _concat_arrays(batch, padding))
    File "/home/ubuntu/anaconda3/envs/chainer_p36/lib/python3.6/site- 
       packages/chainer/dataset/convert.py", line 35, in to_device
          elif device < 0:

Will finalize trainer extensions and updater before reraising the exception.

I have tried without using gpu it worked fine. But when using single gpu ,got an error of out of memory.so, shifted p28xlarge instance and now it throws the above error.where is the problem and how to solve it ?

change's done using 8 gpu's

     num_gpus = 8
     chainer.cuda.get_device_from_id(0).use()

3.# updater

     if num_gpus > 0:

        updater = training.updater.ParallelUpdater(
        train_iter,
        optimizer,
        devices={('main' if device == 0 else str(device)): device for 
                 device in range(num_gpus)},
    )
    else:
        updater = training.updater.StandardUpdater(train_iter, optimizer, 
                    device=args.gpus)

4.and son on.. 5.Training :

       trainer.run()

output -- epoch main/loss validation/main/loss elapsed_time Exception in main training loop: '<' not supported between instances of 'list' and 'int'

I expected the output as

          epoch       main/loss   validation/main/loss  elapsed_time
           1.         
           2. 
           3. and so on till it converge's.

It seems like an error caused by the Evaluator extension when it's transferring data to the specified device . How are you specifying the device to Evalutor.__init__ ? Note that it should be a single device. Maybe this example could be a reference https://github.com/chainer/chainer/blob/master/examples/mnist/train_mnist_data_parallel.py

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM