RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Question

I am doing training and put the dataset inside the data folder. The Structure looks like this.

--data
-----mars
---------bbox_train
---------bbox_test
---------info

Many developers said that this is a label problem but I am not sure because labels are in the right place.

Traceback (most recent call last):
Args:Namespace(arch='resnet50graphpoolparthyper', concat=False, dataset='mars', dropout=0.1, eval_step=100, evaluate=False, gamma=0.1, gpu_devices='0', height=256, htri_only=False, lr=0.0003, margin=0.3, max_epoch=800, nheads=8, nhid=512, num_instances=4, part1=4, part2=8, part3=2, pool='avg', pretrained_model='/home/jiyang/Workspace/Works/video-person-reid/3dconv-person-reid/pretrained_models/resnet-50-kinetics.pth', print_freq=80, save_dir='log_hypergraphsagepart', seed=1, seq_len=8, start_epoch=0, stepsize=200, test_batch=1, train_batch=32, use_cpu=False, warmup=True, weight_decay=0.0005, width=128, workers=4, xent_only=False)
==========
Currently using GPU 0
Initializing dataset mars
=> MARS loaded
Dataset statistics:
  ------------------------------
  subset   | # ids | # tracklets
  ------------------------------
  train    |   625 |     8298
  query    |   626 |     1980
  gallery  |   622 |     9330
  ------------------------------
  total    |  1251 |    19608
  number of images per tracklet: 2 ~ 920, average 59.5
  ------------------------------
Initializing model: resnet50graphpoolparthyper
Model size: 44.17957M
==> Epoch 1/800  lr:1.785e-05
Traceback (most recent call last):
  File "main_video_person_reid_hypergraphsage_part.py", line 357, in <module>
    main()
  File "main_video_person_reid_hypergraphsage_part.py", line 220, in main
    train(model, criterion_xent, criterion_htri, optimizer, trainloader, use_gpu)
  File "main_video_person_reid_hypergraphsage_part.py", line 257, in train
    outputs, features = model(imgs)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/khawar/HDD_Khawar1/hypergraph_reid/models/ResNet_hypergraphsage_part.py", line 621, in forward
    x = self.base(x)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/media/khawar/HDD_Khawar1/hypergraph_reid/models/resnet.py", line 213, in forward
    x = self.conv1(x)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/khawar/anaconda3/envs/hypergraph_reid/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 396, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Answer 1

Installing torch with CUDA 11.1 with the following command did fix the initial issue with torch 1.8:

pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Question

1 answers

solution1
1 ACCPTED 2021-03-20 06:05:08

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

Question

1 answers

solution1 1 ACCPTED 2021-03-20 06:05:08

solution1
1 ACCPTED 2021-03-20 06:05:08