簡體   English   中英

XGBoost(免費套餐)的 Amazon Sagemaker ResourceLimitExceeded 錯誤

[英]Amazon Sagemaker ResourceLimitExceeded Error for XGBoost (Free Tier)

我正在嘗試在免費套餐 AWS Sagemaker 中創建 XGBoost model。 我收到以下錯誤:

“ResourceLimitExceeded:調用 CreateEndpoint 操作時發生錯誤 (ResourceLimitExceeded):賬戶級服務限制 'ml.m5.xlarge for endpoint usage' 為 0 個實例,當前利用率為 0 個實例,請求增量為 1 個實例。” .

我應該使用什么正確的 train_instance_type?

這是我的代碼:

# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np                                
import pandas as pd                               
import matplotlib.pyplot as plt                   
from IPython.display import Image                 
from IPython.display import display               
from time import gmtime, strftime                 
from sagemaker.predictor import csv_serializer   

# Define IAM role
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'} # each region has its XGBoost container
my_region = boto3.session.Session().region_name # set the region of the instance

# Create an instance of the XGBoost model (an estimator), and define the model’s hyperparameters.
# Note: train_instance_type='ml.m5.large' has 0 free credits! Use one of https://aws.amazon.com/sagemaker/pricing/ 
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[my_region],role, train_instance_count=1, train_instance_type='ml.m5.xlarge',output_path='s3://{}/{}/output'.format('my_s3_bucket', prefix),sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=1,eta=0.2,gamma=4,min_child_weight=6,subsample=0.8,silent=0,objective='binary:logistic',num_round=100)
# Train the model using gradient optimization on a ml.m4.xlarge instance
# After a few minutes, you should start to see the training logs being generated.
xgb.fit({'train': s3_input_train})

在這一步,這就是我所看到的:

2019-10-22 06:32:51 Starting - Starting the training job...
2019-10-22 06:33:00 Starting - Launching requested ML instances......
2019-10-22 06:33:54 Starting - Preparing the instances for training...
2019-10-22 06:34:41 Downloading - Downloading input data...
2019-10-22 06:35:22 Training - Training image download completed. Training in progress..Arguments: train
[2019-10-22:06:35:22:INFO] Running standalone xgboost training.
[2019-10-22:06:35:22:INFO] Path /opt/ml/input/data/validation does not exist!
[2019-10-22:06:35:22:INFO] File size need to be processed in the node: 3.38mb. Available memory size in the node: 8089.9mb
[2019-10-22:06:35:22:INFO] Determined delimiter of CSV input is ','
[06:35:22] S3DistributionType set as FullyReplicated
[06:35:22] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[0]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[1]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[2]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[3]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[4]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[5]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[6]#011train-error:0.102182
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[7]#011train-error:0.10839
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[8]#011train-error:0.102737
[06:35:22] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 2 extra nodes, 0 pruned nodes, max_depth=1
[9]#011train-error:0.107697

然后當我部署它時:

# Deploy the model on a server and create an endpoint that you can access
xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')
---------------------------------------------------------------------------
ResourceLimitExceeded                     Traceback (most recent call last)
<ipython-input-38-6d149f3edc98> in <module>()
      1 # Deploy the model on a server and create an endpoint that you can access
----> 2 xgb_predictor = xgb.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, model_name, kms_key, **kwargs)
    559             tags=self.tags,
    560             wait=wait,
--> 561             kms_key=kms_key,
    562         )
    563 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait)
    464         else:
    465             self.sagemaker_session.endpoint_from_production_variants(
--> 466                 self.endpoint_name, [production_variant], tags, kms_key, wait
    467             )
    468 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait)
   1361 
   1362             self.sagemaker_client.create_endpoint_config(**config_options)
-> 1363         return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
   1364 
   1365     def expand_role(self, role):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
    975 
    976         self.sagemaker_client.create_endpoint(
--> 977             EndpointName=endpoint_name, EndpointConfigName=config_name, Tags=tags
    978         )
    979         if wait:

~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    355                     "%s() only accepts keyword arguments." % py_operation_name)
    356             # The "self" in this scope is referring to the BaseClient.
--> 357             return self._make_api_call(operation_name, kwargs)
    358 
    359         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    659             error_code = parsed_response.get("Error", {}).get("Code")
    660             error_class = self.exceptions.from_code(error_code)
--> 661             raise error_class(parsed_response, operation_name)
    662         else:
    663             return parsed_response

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The account-level service limit 'ml.m5.xlarge for endpoint usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

編輯:嘗試ml.m4.xlarge實例:

當我使用 ml.m4.xlarge 時,我收到相同的消息“ResourceLimitExceeded:調用 CreateEndpoint 操作時發生錯誤 (ResourceLimitExceeded):賬戶級服務限制 'ml.m4.xlarge for endpoint usage' is 0 Instances,當前利用率為 0 個實例,請求增量為 1 個實例。請聯系 AWS 支持以請求增加此限制。”

根據此 AWS 頁面,您每月將獲得 50 小時的 m4.xlarge 用於前兩個月的培訓,以及每月 125 小時的 m4.xlarge 用於前兩個月的托管 因此,如果您在頭兩個月內, ml.m4.xlarge應該可以解決問題。

至於根據這篇文章的服務限制本身,新創建的帳戶將 SageMaker 中的每個實例類型(t2 介質除外)限制為 0,而不是默認限制。

因此,您畢竟需要聯系 AWS 支持並要求提高您的限制。 此外,如果您自己不是管理員,這可能會受到您帳戶管理員的限制。 因此,在這種情況下,這應該是您的第一個停靠港。

請求增加 ml.m5.xlarge 限制的步驟

  1. 訪問 aws 控制台https://console.aws.amazon.com/
  2. 點擊右上角的支持
  3. 單擊創建案例(橙色按鈕)
  4. select 服務限制增加單選按鈕
  5. 對於限制類型、搜索和 Select SageMaker 筆記本實例
  6. select 與亞馬遜控制台右上角顯示的區域相同。
  7. 寫一個簡短的用例描述
  8. 對於限制,Select ml.[x].[x](在你的情況下,ml.m5.xlarge)
  9. 新限值 1

此手動支持票可能需要 48 小時才能周轉。(對我來說,我在一天后收到了支持團隊的回復,實例限制更改為 1)

根據您提供的 output,model 訓練成功。 失敗的是將 model 作為可查詢端點托管的deploy步驟。 推理的限制與訓練的限制是分開的。 根據this page SageMaker目前提供

推理上每月 125 小時的 m4.xlarge 或 m5.xlarge 實例

如果您在激活的前 2 個月內。

您可以使用服務配額控制台或 API 檢查您當前的 SageMaker 限制。 如果您當前沒有為 Endpoint 資源分配上述實例類型,您也可以通過 Service Quotas 請求增加限制。 更多細節在這里

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM