简体   繁体   中英

Deploying from AzureML into AKS - Set Taints & Tolerations

We are attempting to deploy a model from AzureML into an AKS Kluster which has been configured to use taints and tolerations.

When we try to deploy, we receive the below error message...

"details": [ { "code": "Unschedulable", "message": "0/15 nodes are available: 12 node(s) had taint {Workload: MachineLearning}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector." }, { "code": "DeploymentFailed", "message": "Couldn't schedule because the kube.netes cluster didn't have available resources after trying for 00:05:00. You can address this error by either adding more nodes, changing the SKU of your nodes or changing the resource requirements of your service. Please refer to https://aka.ms/debugimage#container-cannot-be-scheduled for more information." }, { "code": "DeploymentFailed", "message": "Your container endpoint is not available. Please follow the steps to debug: 1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information. 2. You can also interactively debug your scoring file locally. Please refer to https://learn.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information. 3. For AKS deployment with custom certificate, you need to update your DNS record to point to the IP address of scoring endpoint. Please refer to https://learn.microsoft.com/azure/machine-learning/how-to-secure-web-service#update-your-dns for more information. 4. View the diagnostic events to check status of container, it may help you to debug the issue. {"InvolvedObject":"am-prod-app-c88d8d49c-vbxsv","Involved Kind":"Pod","Type":"Warning","Reason":"FailedScheduling","Message":"0/15 nodes are available: 15 pod has unbound immediate PersistentVolumeClaims.","LastTimestamp":null} {"InvolvedObject":"am-prod-app-c88d8d49c-vbxsv","InvolvedKind":"Pod","Type":"Warning","Reason":"FailedScheduling","Message":"0/15 nodes are available: 15 pod has unbound immediate PersistentVolumeClaims.","LastTimestamp":null} {"InvolvedObject":"am-prod-app-c88d8d49c-vbxsv","InvolvedKind":"Pod","Type":"Warning","Reason":"FailedScheduling","Message":"0/15 nodes are available: 12 node(s) had taint {Workload: MachineLearning}, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.","LastTimestamp":null} {"InvolvedObject":"am-prod-app-c88d8d49c-vbxsv","InvolvedKind":"Pod","Type":"Normal","Reason":"NotTriggerScaleUp","Message":"pod didn't trigger scale-up: 5 pod has unbound immediate PersistentVolumeClaims","LastTimestamp":"2022-04-05T14:33:02Z"} " } ] }

Is there any way to specify the taints and tolerations from the deployment?

Thanks in advance!

...specify the taints and tolerations from the deployment?

Try add tolerations to your deployment spec:

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
spec:
  ...
  template:
    ...
    spec:
      ...
      containers:
      - name: ...
        ...
      tolerations:
      - key: Workload
        value: MachineLearning

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM