XGBoost（免費套餐）的 Amazon Sagemaker ResourceLimitExceeded 錯誤

Question

我正在嘗試在免費套餐 AWS Sagemaker 中創建 XGBoost model。 我收到以下錯誤：

“ResourceLimitExceeded：調用 CreateEndpoint 操作時發生錯誤 (ResourceLimitExceeded)：賬戶級服務限制 'ml.m5.xlarge for endpoint usage' 為 0 個實例，當前利用率為 0 個實例，請求增量為 1 個實例。” .

我應該使用什么正確的 train_instance_type？

這是我的代碼：

# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np                                
import pandas as pd                               
import matplotlib.pyplot as plt                   
from IPython.display import Image                 
from IPython.display import display               
from time import gmtime, strftime                 
from sagemaker.predictor import csv_serializer   

# Define IAM role
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'} # each region has its XGBoost container
my_region = boto3.session.Session().region_name # set the region of the instance

# Create an instance of the XGBoost model (an estimator), and define the model’s hyperparameters.
# Note: train_instance_type='ml.m5.large' has 0 free credits! Use one of https://aws.amazon.com/sagemaker/pricing/ 
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[my_region],role, train_instance_count=1, train_instance_type='ml.m5.xlarge',output_path='s3://{}/{}/output'.format('my_s3_bucket', prefix),sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=1,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective='binary:logistic',num_round=100)
# Train the model using gradient optimization on a ml.m4.xlarge instance
# After a few minutes, you should start to see the training logs being generated.
xgb.fit({'train': s3_input_train})

在這一步，這就是我所看到的：

2019-10-22 06:32:51 Starting - Starting the training job...
2019-10-22 06:33:00 Starting - Launching requested ML instances......
2019-10-22 06:33:54 Starting - Preparing the instances for training...
2019-10-22 06:34:41 Downloading - Downloading input data...
2019-10-22 06:35:22 Training - Training image download completed. Training in progress..Arguments: train
[2019-10-22:06:35:22:INFO] Running standalone xgboost training.
[2019-10-22:06:35:22:INFO] Path /opt/ml/input/data/validation does not exist!
[2019-10-22:06:35:22:INFO] File size need to be processed in the node: 3.38mb. Available memory size in the node: 8089.9mb
[2019-10-22:06:35:22:INFO] Determined delimiter of CSV input is ','
[06:35:22] S3DistributionType set as FullyReplicated
[06:35:22] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[0]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[1]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[2]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[3]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[4]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[5]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[6]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[7]#011train-error:0.10839
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[8]#011train-error:0.102737
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[9]#011train-error:0.107697

然后當我部署它時：

# Deploy the model on a server and create an endpoint that you can access
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')
---------------------------------------------------------------------------
ResourceLimitExceeded                     Traceback (most recent call last)
<ipython-input-38-6d149f3edc98> in <module>()
      1 # Deploy the model on a server and create an endpoint that you can access
----> 2 xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, model_name, kms_key, **kwargs)
    559             tags=self.tags,
    560             wait=wait,
--> 561             kms_key=kms_key,
    562         )
    563 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait)
    464         else:
    465             self.sagemaker_session.endpoint_from_production_variants(
--> 466                 self.endpoint_name, [production_variant], tags, kms_key, wait
    467             )
    468 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait)
   1361 
   1362             self.sagemaker_client.create_endpoint_config(**config_options)
-> 1363         return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
   1364 
   1365     def expand_role(self, role):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
    975 
    976         self.sagemaker_client.create_endpoint(
--> 977             EndpointName=endpoint_name, EndpointConfigName=config_name, Tags=tags
    978         )
    979         if wait:

~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    355                     "%s() only accepts keyword arguments." % py_operation_name)
    356             # The "self" in this scope is referring to the BaseClient.
--> 357             return self._make_api_call(operation_name, kwargs)
    358 
    359         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    659             error_code = parsed_response.get("Error", {}).get("Code")
    660             error_class = self.exceptions.from_code(error_code)
--> 661             raise error_class(parsed_response, operation_name)
    662         else:
    663             return parsed_response

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The account-level service limit 'ml.m5.xlarge for endpoint usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

編輯：嘗試ml.m4.xlarge實例：

當我使用 ml.m4.xlarge 時，我收到相同的消息“ResourceLimitExceeded：調用 CreateEndpoint 操作時發生錯誤 (ResourceLimitExceeded)：賬戶級服務限制 'ml.m4.xlarge for endpoint usage' is 0 Instances,當前利用率為 0 個實例，請求增量為 1 個實例。請聯系 AWS 支持以請求增加此限制。”

Answer 1

根據此 AWS 頁面，您每月將獲得 50 小時的 m4.xlarge 用於前兩個月的培訓，以及每月 125 小時的 m4.xlarge 用於前兩個月的托管。 因此，如果您在頭兩個月內， ml.m4.xlarge應該可以解決問題。

至於根據這篇文章的服務限制本身，新創建的帳戶將 SageMaker 中的每個實例類型（t2 介質除外）限制為 0，而不是默認限制。

因此，您畢竟需要聯系 AWS 支持並要求提高您的限制。 此外，如果您自己不是管理員，這可能會受到您帳戶管理員的限制。 因此，在這種情況下，這應該是您的第一個停靠港。

Answer 2

請求增加 ml.m5.xlarge 限制的步驟

訪問 aws 控制台https://console.aws.amazon.com/
點擊右上角的支持
單擊創建案例（橙色按鈕）
select 服務限制增加單選按鈕
對於限制類型、搜索和 Select SageMaker 筆記本實例
select 與亞馬遜控制台右上角顯示的區域相同。
寫一個簡短的用例描述
對於限制，Select ml.[x].[x]（在你的情況下，ml.m5.xlarge）
新限值 1

此手動支持票可能需要 48 小時才能周轉。（對我來說，我在一天后收到了支持團隊的回復，實例限制更改為 1）

Answer 3

根據您提供的 output，model 訓練成功。 失敗的是將 model 作為可查詢端點托管的deploy步驟。 推理的限制與訓練的限制是分開的。 根據this page SageMaker目前提供

推理上每月 125 小時的 m4.xlarge 或 m5.xlarge 實例

如果您在激活的前 2 個月內。

您可以使用服務配額控制台或 API 檢查您當前的 SageMaker 限制。 如果您當前沒有為 Endpoint 資源分配上述實例類型，您也可以通過 Service Quotas 請求增加限制。 更多細節在這里。

XGBoost（免費套餐）的 Amazon Sagemaker ResourceLimitExceeded 錯誤

問題描述

3 個解決方案

解決方案1
1 2019-10-24 09:11:24

解決方案2
1 2021-06-30 19:09:07

解決方案3
0 2022-06-20 19:48:04

XGBoost（免費套餐）的 Amazon Sagemaker ResourceLimitExceeded 錯誤

問題描述

3 個解決方案

解決方案1 1 2019-10-24 09:11:24

解決方案2 1 2021-06-30 19:09:07

解決方案3 0 2022-06-20 19:48:04

解決方案1
1 2019-10-24 09:11:24

解決方案2
1 2021-06-30 19:09:07

解決方案3
0 2022-06-20 19:48:04