简体   繁体   English

如何在 AWS sagemaker 上部署预训练的 sklearn model? (端点停留在创建)

[英]How do I deploy a pre trained sklearn model on AWS sagemaker? (Endpoint stuck on creating)

To start with, I understand that this question has been asked multiple times but I haven't found the solution to my problem.首先,我知道这个问题已被多次询问,但我还没有找到解决问题的方法。

So, to start with I have used joblib.dump to save a locally trained sklearn RandomForest.因此,首先我使用 joblib.dump 来保存本地训练的 sklearn RandomForest。 I then uploaded this to s3, made a folder called code and put in an inference script there, called inference.py.然后我将其上传到 s3,创建了一个名为 code 的文件夹,并在其中放入了一个名为 inference.py 的推理脚本。

import joblib
import json
import numpy
import scipy
import sklearn
import os

"""
Deserialize fitted model
"""
def model_fn(model_dir):
    model_path = os.path.join(model_dir, 'test_custom_model')
    model = joblib.load(model_path)
    return model

"""
input_fn
    request_body: The body of the request sent to the model.
    request_content_type: (string) specifies the format/variable type of the request
"""
def input_fn(request_body, request_content_type):
    if request_content_type == 'application/json':
        request_body = json.loads(request_body)
        inpVar = request_body['Input']
        return inpVar
    else:
        raise ValueError("This model only supports application/json input")

"""
predict_fn
    input_data: returned array from input_fn above
    model (sklearn model) returned model loaded from model_fn above
"""
def predict_fn(input_data, model):
    return model.predict(input_data)

"""
output_fn
    prediction: the returned value from predict_fn above
    content_type: the content type the endpoint expects to be returned. Ex: JSON, string
"""

def output_fn(prediction, content_type):
    res = int(prediction[0])
    respJSON = {'Output': res}
    return respJSON

Very simple so far.到目前为止非常简单。

I also put this into the local jupyter sagemaker session我也将其放入本地 jupyter sagemaker session

all_files (folder) code (folder) inference.py (python file) test_custom_model (joblib dump of model) all_files(文件夹)code(文件夹)inference.py(python 文件)test_custom_model(模型的作业库转储)

The script turns this folder all_files into a tar.gz file该脚本将此文件夹 all_files 转换为 tar.gz 文件

Then comes the main script that I ran on sagemaker:然后是我在 sagemaker 上运行的主脚本:

import boto3
import json
import os
import joblib
import pickle
import tarfile
import sagemaker
import time
from time import gmtime, strftime
import subprocess
from sagemaker import get_execution_role

#Setup
client = boto3.client(service_name="sagemaker")
runtime = boto3.client(service_name="sagemaker-runtime")
boto_session = boto3.session.Session()
s3 = boto_session.resource('s3')
region = boto_session.region_name
print(region)
sagemaker_session = sagemaker.Session()
role = get_execution_role()

#Bucket for model artifacts
default_bucket = 'pretrained-model-deploy'
model_artifacts = f"s3://{default_bucket}/test_custom_model.tar.gz"

#Build tar file with model data + inference code
bashCommand = "tar -cvpzf test_custom_model.tar.gz all_files"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

#Upload tar.gz to bucket
response = s3.meta.client.upload_file('test_custom_model.tar.gz', default_bucket, 'test_custom_model.tar.gz')

# retrieve sklearn image
image_uri = sagemaker.image_uris.retrieve(
    framework="sklearn",
    region=region,
    version="0.23-1",
    py_version="py3",
    instance_type="ml.m5.xlarge",
)

#Step 1: Model Creation
model_name = "sklearn-test" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Model name: " + model_name)
create_model_response = client.create_model(
    ModelName=model_name,
    Containers=[
        {
            "Image": image_uri,
            "ModelDataUrl": model_artifacts,
        }
    ],
    ExecutionRoleArn=role,
)
print("Model Arn: " + create_model_response["ModelArn"])

#Step 2: EPC Creation - Serverless
sklearn_epc_name = "sklearn-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
response = client.create_endpoint_config(
   EndpointConfigName=sklearn_epc_name,
   ProductionVariants=[
        {
            "ModelName": model_name,
            "VariantName": "sklearnvariant",
            "ServerlessConfig": {
                "MemorySizeInMB": 2048,
                "MaxConcurrency": 20
            }
        } 
    ]
)

# #Step 2: EPC Creation - Synchronous
# sklearn_epc_name = "sklearn-epc" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
# endpoint_config_response = client.create_endpoint_config(
#     EndpointConfigName=sklearn_epc_name,
#     ProductionVariants=[
#         {
#             "VariantName": "sklearnvariant",
#             "ModelName": model_name,
#             "InstanceType": "ml.m5.xlarge",
#             "InitialInstanceCount": 1
#         },
#     ],
# )
# print("Endpoint Configuration Arn: " + endpoint_config_response["EndpointConfigArn"])

#Step 3: EP Creation
endpoint_name = "sklearn-local-ep" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
create_endpoint_response = client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=sklearn_epc_name,
)
print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])


#Monitor creation
describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
while describe_endpoint_response["EndpointStatus"] == "Creating":
    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name)
    print(describe_endpoint_response)
    time.sleep(15)
print(describe_endpoint_response)

Now, I mainly just want the serverless deployment but that fails after a while with this error message:现在,我主要只想要无服务器部署,但一段时间后失败并显示以下错误消息:

{'EndpointName': 'sklearn-local-ep2022-04-29-12-16-10', 'EndpointArn': 'arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-12-16-10', 'EndpointConfigName': 'sklearn-epc2022-04-29-12-16-03', 'EndpointStatus': 'Creating', 'CreationTime': datetime.datetime(2022, 4, 29, 12, 16, 10, 290000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2022, 4, 29, 12, 16, 11, 52000, tzinfo=tzlocal()), 'ResponseMetadata': {'RequestId': '1d25120e-ddb1-474d-9c5f-025c6be24383', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '1d25120e-ddb1-474d-9c5f-025c6be24383', 'content-type': 'application/x-amz-json-1.1', 'content-length': '305', 'date': 'Fri, 29 Apr 2022 12:21:59 GMT'}, 'RetryAttempts': 0}}
{'EndpointName': 'sklearn-local-ep2022-04-29-12-16-10', 'EndpointArn': 'arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-12-16-10', 'EndpointConfigName': 'sklearn-epc2022-04-29-12-16-03', 'EndpointStatus': 'Failed', 'FailureReason': 'Unable to successfully stand up your model within the allotted 180 second timeout. Please ensure that downloading your model artifacts, starting your model container and passing the ping health checks can be completed within 180 seconds.', 'CreationTime': datetime.datetime(2022, 4, 29, 12, 16, 10, 290000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2022, 4, 29, 12, 22, 2, 68000, tzinfo=tzlocal()), 'ResponseMetadata': {'RequestId': '59fb8ddd-9d45-41f5-9383-236a2baffb73', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '59fb8ddd-9d45-41f5-9383-236a2baffb73', 'content-type': 'application/x-amz-json-1.1', 'content-length': '559', 'date': 'Fri, 29 Apr 2022 12:22:15 GMT'}, 'RetryAttempts': 0}}

The real time deployment is just permanently stuck at creating.实时部署只是永久停留在创建过程中。

Cloudwatch has the following errors: Error handling request /ping Cloudwatch 出现以下错误: Error handling request /ping

AttributeError: 'NoneType' object has no attribute 'startswith' AttributeError: 'NoneType' object 没有属性 'startswith'

with traceback:追溯:

Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/gunicorn/workers/base_async.py", line 55, in handle
    self.handle_request(listener_name, req, client, addr)

Copy paste has stopped working so I have attached an image of it instead.复制粘贴已停止工作,所以我附上了它的图像。

在此处输入图像描述

This is the error message I get: Endpoint Arn: arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-13-18-09 {'EndpointName': 'sklearn-local-ep2022-04-29-13-18-09', 'EndpointArn': 'arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-13-18-09', 'EndpointConfigName': 'sklearn-epc2022-04-29-13-18-07', 'EndpointStatus': 'Creating', 'CreationTime': datetime.datetime(2022, 4, 29, 13, 18, 9, 548000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2022, 4, 29, 13, 18, 13, 119000, tzinfo=tzlocal()), 'ResponseMetadata': {'RequestId': 'ef0e49ee-618e-45de-9c49-d796206404a4', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'ef0e49ee-618e-45de-9c49-d796206404a4', 'content-type': 'application/x-amz-json-1.1', 'content-length': '306', 'date': 'Fri, 29 Apr 2022 13:18:24 GMT'}, 'RetryAttempts': 0}}这是我收到的错误消息:Endpoint Arn: arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-13-18-09 {'EndpointName': 'sklearn- local-ep2022-04-29-13-18-09', 'EndpointArn': 'arn:aws:sagemaker:us-east-1:963400650255:endpoint/sklearn-local-ep2022-04-29-13-18- 09', 'EndpointConfigName': 'sklearn-epc2022-04-29-13-18-07', 'EndpointStatus': 'Creating', 'CreationTime': datetime.datetime(2022, 4, 29, 13, 18, 9 , 548000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2022, 4, 29, 13, 18, 13, 119000, tzinfo=tzlocal()), 'ResponseMetadata': {'RequestId': 'ef0e49ee -618e-45de-9c49-d796206404a4', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'ef0e49ee-618e-45de-9c49-d796206404a4', 'content-type': 'application/ x-amz-json-1.1', 'content-length': '306', 'date': 'Fri, 29 Apr 2022 13:18:24 GMT'}, 'RetryAttempts': 0}}

These are the permissions I have associated with that role:这些是我与该角色关联的权限:

AmazonSageMaker-ExecutionPolicy
SecretsManagerReadWrite
AmazonS3FullAccess
AmazonSageMakerFullAccess
EC2InstanceProfileForImageBuilderECRContainerBuilds
AWSAppRunnerServicePolicyForECRAccess

What am I doing wrong?我究竟做错了什么? I've tried different folder structures for the zip file, different accounts, all to no avail.我为 zip 文件尝试了不同的文件夹结构,不同的帐户,都无济于事。 I don't really want to use the model.deploy() method as I don't know how to use serverless with that, and it's also inconcistent between different model types (I'm trying to make a flexible deployment pipeline where different (xgb / sklearn) models can be deployed with minimal changes.我真的不想使用 model.deploy() 方法,因为我不知道如何使用无服务器,而且它在不同的 model 类型之间也不一致(我试图在不同的地方制作灵活的部署管道( xgb / sklearn) 模型可以通过最小的更改进行部署。

Please send help, I'm very close to smashing my hair and tearing out my laptop, been struggling with this for a whole 4 days now.请提供帮助,我差点弄坏头发,撕掉笔记本电脑,整整 4 天都在为此苦苦挣扎。

Please follow this guide: https://github.com/RamVegiraju/Pre-Trained-Sklearn-SageMaker .请遵循本指南: https://github.com/RamVegiraju/Pre-Trained-Sklearn-SageMaker During model creation I think your inference script is not being specified in the environment variables.在 model 创建期间,我认为您的推理脚本未在环境变量中指定。

I think the issue is with the zipped files.我认为问题出在压缩文件上。 From the question I understood that you are trying to zip up all the files including the model dump and the script.从这个问题我了解到您正在尝试 zip 包括 model 转储和脚本在内的所有文件。

I would suggest to remove the inference script from the model artifacts.我建议从 model 工件中删除推理脚本。

The model.tar.gz file should consist of only model. model.tar.gz 文件应仅包含 model。

And add the environment variable of inference script as suggested by @ram-vegiraju.并按照@ram-vegiraju 的建议添加推理脚本的环境变量。

The script should be locally available.该脚本应该在本地可用。

I've solved this problem - I used sagemaker.model.model to load in the model data I already had and I called the deploy method on the aforementioned model object to deploy it.我已经解决了这个问题 - 我使用 sagemaker.model.model 加载了我已经拥有的 model 数据,我调用了上述 model object 上的部署方法来部署它。 Further, I had the inference script and the model file in the same place as the notebook and directly called them, as this gave me an error earlier as well.此外,我将推理脚本和 model 文件放在与笔记本相同的位置并直接调用它们,因为这也给了我一个错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将重新训练的 Sagemaker model 部署到端点? - How can I deploy a re-trained Sagemaker model to an endpoint? 在 AWS SageMaker 中使用预处理和后处理创建和部署预训练 tensorflow model - Creating and deploying pre-trained tensorflow model with pre-processing and post-processing in AWS SageMaker 如何使用新训练的 Model 更新 Sagemaker Endpoint? - How to update Sagemaker Endpoint with the newly Trained Model? 无法在 AWS Sagemaker 上部署本地训练的逻辑回归 model - Unable to deploy locally trained Logistic Regression model on AWS Sagemaker 部署 TensorFlow 概率回归 model 作为 Sagemaker 端点 - Deploy TensorFlow probability regression model as Sagemaker endpoint 如何在 AWS sagemaker 中为 yolov5 推理创建端点 - How can I create an endpoint for yolov5 inference in AWS sagemaker 将预训练的 Tensorflow 模型部署到 sagemaker 中的一个端点(一个端点的多模型)时出错? - Error when deploying pre trained Tensorflow models to one endpoint (multimodel for one endpoint) in sagemaker? AWS SageMaker Pipeline Model 端点部署失败 - AWS SageMaker Pipeline Model endpoint deployment failing 如何使用 Python SDK 调用 Amazon SageMaker 终端节点 - How do I invoke a Amazon SageMaker endpoint with the Python SDK 如何在 aws sagemaker 中创建无服务器端点? - How to create a serverless endpoint in aws sagemaker?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM