[英]How can I use GPUs on Azure ML with a NVIDIA CUDA custom docker base image?
In my dockerfile to build the custom docker base image, I specify the following base image:在我的 dockerfile 构建自定义 docker 基础镜像中,我指定了以下基础镜像:
FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04
The dockerfile corresponding to the nvidia-cuda base image is found here: https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/ubuntu16.04/10.1/devel/cudnn7/Dockerfile对应nvidia-cuda基础镜像的dockerfile在这里找到: https://gitlab.com/nvidia/container-images/cuda/blob/master/dist/ubuntu16.04/10.1/devel/cudnn
Now when I print the AzureML log:现在当我打印 AzureML 日志时:
run = Run.get_context()
# setting device on GPU if available, else CPU
run.log("Using device: ", torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
I get我明白了
device(type='cpu')
but I would like to have a GPU and not a CPU.但我想要一个 GPU 而不是 CPU。 What am I doing wrong?我究竟做错了什么?
EDIT: I do not know exactly what you need.编辑:我不知道你到底需要什么。 But I can give you the following information: azureml.core VERSION is 1.0.57.但我可以给你以下信息:azureml.core VERSION 是 1.0.57。 The compute_target is defined via: compute_target 通过以下方式定义:
def compute_target(ws, cluster_name):
try:
cluster = ComputeTarget(workspace=ws, name=cluster_name)
except ComputeTargetException:
compute_config=AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',min_nodes=0,max_nodes=4)
cluster = ComputeTarget.create(ws, cluster_name, compute_config)
The experiment is run via:实验通过以下方式运行:
ws = workspace(os.path.join("azure_cloud", 'config.json'))
exp = experiment(ws, name=<name>)
c_target = compute_target(ws, <name>)
est = Estimator(source_directory='.',
script_params=script_params,
compute_target=c_target,
entry_script='azure_cloud/azure_training_wrapper.py',
custom_docker_image=image_name,
image_registry_details=img_reg_details,
user_managed = True,
environment_variables = {"SYSTEM": "azure_cloud"})
# run the experiment / train the model
run = exp.submit(config=est)
The yaml file contains: yaml 文件包含:
dependencies:
- conda-package-handling=1.3.10
- python=3.6.2
- cython=0.29.10
- scikit-learn==0.21.2
- anaconda::cloudpickle==1.2.1
- anaconda::cffi==1.12.3
- anaconda::mxnet=1.5.0
- anaconda::psutil==5.6.3
- anaconda::pycosat==0.6.3
- anaconda::pip==19.1.1
- anaconda::six==1.12.0
- anaconda::mkl==2019.4
- anaconda::cudatoolkit==10.1.168
- conda-forge::pycparser==2.19
- conda-forge::openmpi=3.1.2
- pytorch::pytorch==1.2.0
- tensorboard==1.13.1
- tensorflow==1.13.1
- tensorflow-estimator==1.13.0
- pip:
- pytorch-transformers==1.2.0
- azure-cli==2.0.72
- azure-storage-nspkg==3.1.0
- azureml-sdk==1.0.57
- pandas==0.24.2
- tqdm==4.32.1
- numpy==1.16.4
- matplotlib==3.1.0
- requests==2.22.0
- setuptools==41.0.1
- ipython==7.8.0
- boto3==1.9.220
- botocore==1.12.220
- cntk==2.7
- ftfy==5.6
- gensim==3.8.0
- horovod==0.16.4
- keras==2.2.5
- langdetect==1.0.7
- langid==1.1.6
- nltk==3.4.5
- ptvsd==4.3.2
- pytest==5.1.2
- regex==2019.08.19
- scipy==1.3.1
- scikit_learn==0.21.3
- spacy==2.1.8
- tensorpack==0.9.8
EDIT 2: I tried use_gpu = True
as well as upgrading to azureml-sdk=1.0.65
but to no avail.编辑 2:我尝试use_gpu = True
以及升级到azureml-sdk=1.0.65
但无济于事。 Some people suggest additionally installing cuda-drivers via apt-get install cuda-drivers
, but this does not work and I cannot build a docker image with that.有些人建议通过apt-get install cuda-drivers
,但这不起作用,我无法用它构建 docker 映像。 The output of nvcc --version
on the docker image yields: output 的nvcc --version
在 docker 图像上产生:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
So I think that should be ok The docker image itself of course has no GPU, so command nvidia-smi
is not found and所以我认为应该没问题 docker 图像本身当然没有 GPU,所以找不到命令nvidia-smi
并且
python -i
and then接着
import torch
print(torch.cuda.is_available())
will print False.将打印 False。
In your Estimator definition, please try adding use_gpu=True
在您的 Estimator 定义中,请尝试添加use_gpu=True
est = Estimator(source_directory='.',
script_params=script_params,
compute_target=c_target,
entry_script='azure_cloud/azure_training_wrapper.py',
custom_docker_image=image_name,
image_registry_details=img_reg_details,
user_managed = True,
environment_variables = {"SYSTEM": "azure_cloud"},
use_gpu=True)
I believe, with azureml-sdk>=1.0.60 this should be inferred from the vm-size used, but since you are using 1.0.57 I think this is still required.我相信,使用 azureml-sdk>=1.0.60 这应该是从使用的 vm-size 推断出来的,但是由于您使用的是 1.0.57,我认为这仍然是必需的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.