如何在 AWS sagemaker >= 2.0 中更新现有的 model

Question

I have an XGBoost model currently in production using AWS sagemaker and making real time inferences.我有一个 XGBoost model 目前正在使用 AWS sagemaker 进行生产并进行实时推理。 After a while, I would like to update the model with a newer one trained on more data and keep everything as is (eg same endpoint, same inference procedure, so really no changes aside from the model itself)过了一会儿，我想用一个经过更多数据训练的更新的 model 更新并保持一切不变（例如，相同的端点，相同的推理过程，所以除了 model 本身之外真的没有任何变化）

The current deployment procedure is the following:当前的部署过程如下：

from sagemaker.xgboost.model import XGBoostModel
from sagemaker.xgboost.model import XGBoostPredictor

xgboost_model = XGBoostModel(
    model_data = <S3 url>,
    role = <sagemaker role>,
    entry_point = 'inference.py',
    source_dir = 'src',
    code_location = <S3 url of other dependencies>
    framework_version='1.5-1',
    name = model_name)

xgboost_model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1,
    endpoint_name = model_name)

Now that I updated the model a few weeks later, I would like to re-deploy it.现在我在几周后更新了 model，我想重新部署它。 I am aware that the .deploy() method creates an endpoint and an endpoint configuration so it does it all.我知道.deploy()方法创建了一个端点和一个端点配置，因此它可以完成所有工作。 I cannot simply re-run my script again since I would encounter an error.我不能简单地重新运行我的脚本，因为我会遇到错误。

In previous versions of sagemaker I could have updated the model with an extra argument passed to the .deploy() method called update_endpoint = True .在以前版本的 sagemaker 中，我可以更新 model 并使用一个额外的参数传递给.deploy()方法，称为update_endpoint = True 。 In sagemaker >=2.0 this is a no-op.在 sagemaker >=2.0 中，这是一个无操作。 Now, in sagemaker >= 2.0, I need to use the predictor object as stated in the documentation .现在，在 sagemaker >= 2.0 中，我需要使用文档中所述的预测器 object。 So I try the following:所以我尝试以下方法：

predictor = XGBoostPredictor(model_name)
predictor.update_endpoint(model_name= model_name)

Which actually updates the endpoint according to a new endpoint configuration.它实际上根据新的端点配置更新端点。 However, I do not know what it is updating... I do not specify in the above 2 lines of code that we need to considering the new xgboost_model trained on more data... so where do I tell the update to take a more recent model?但是，我不知道它在更新什么......我没有在上面的 2 行代码中指定我们需要考虑使用更多数据训练的新xgboost_model ......所以我在哪里告诉更新需要更多最近的 model？

Thank you!谢谢！

Update更新

I believe that I need to be looking at production variants as stated in their documentation here .我相信我需要查看此处文档中所述的生产变体。 However, their whole tutorial is based on the amazon sdk for python (boto3) which has artifacts that are hard to manage when I have difference entry points for each model variant (eg different inference.py scripts).然而，他们的整个教程是基于亚马逊 sdk 的 python (boto3) 的，当我对每个 model 变体都有不同的入口点时，这些工件很难管理。py 脚本（例如不同的inference.py ）。

Answer 1

In your model_name you specify the name of a SageMaker Model object where you can specify the image_uri, model_data etc.在您的 model_name 中，您指定 SageMaker Model object 的名称，您可以在其中指定 image_uri、model_data 等。

Answer 2

Since I found an answer to my own question I will post it here for those who encounter the same problem.由于我找到了自己问题的答案，因此我将在此处发布给遇到相同问题的人。

I ended up re-coding all my deployment script using the boto3 SDK rather than the sagemaker SDK (or a mix of both as some documentation suggest).我最终使用 boto3 SDK 而不是 sagemaker SDK（或一些文档建议的两者的混合）重新编码了我的所有部署脚本。

Here's the whole script that shows how to create a sagemaker model object, an endpoint configuration and an endpoint to deploy the model on for the first time.这是显示如何创建 sagemaker model object、端点配置和端点以首次部署 model 的整个脚本。 In addition, it shows what to do how to update the endpoint with a newer model (which was my main question)此外，它还展示了如何使用更新的 model 更新端点（这是我的主要问题）

Here's the code to do all 3 in case you want to bring your own model and update it safely in production using sagemaker:如果您想带上自己的 model 并使用 sagemaker 在生产中安全地更新它，下面是执行所有 3 项的代码：

import boto3
import time
from datetime import datetime
from sagemaker import image_uris
from fileManager import *  # this is a local script for helper functions

# name of zipped model and zipped inference code
CODE_TAR = 'your_inference_code_and_other_artifacts.tar.gz'
MODEL_TAR = 'your_saved_xgboost_model.tar.gz'

# sagemaker params
smClient = boto3.client('sagemaker')
smRole = <your_sagemaker_role>
bucket = sagemaker.Session().default_bucket()

# deploy algorithm
class Deployer:

    def __init__(self, modelName, deployRetrained=False):
        self.modelName=modelName
        self.deployRetrained = deployRetrained
        self.prefix = <S3_model_path_prefix>
    
    def deploy(self):
        '''
        Main method to create a sagemaker model, create an endpoint configuration and deploy the model. If deployRetrained
        param is set to True, this method will update an already existing endpoint.
        '''
        # define model name and endpoint name to be used for model deployment/update
        model_name = self.modelName + <any_suffix>
        endpoint_config_name = self.modelName + '-%s' %datetime.now().strftime('%Y-%m-%d-%HH%M')
        endpoint_name = self.modelName
        
        # deploy model for the first time
        if not self.deployRetrained:
            print('Deploying for the first time')

            # here you should copy and zip the model dependencies that you may have (such as preprocessors, inference code, config code...)
            # mine were zipped into the file called CODE_TAR

            # upload model and model artifacts needed for inference to S3
            uploadFile(list_files=[MODEL_TAR, CODE_TAR], prefix = self.prefix)

            # create sagemaker model and endpoint configuration
            self.createSagemakerModel(model_name)
            self.createEndpointConfig(endpoint_config_name, model_name)

            # deploy model and wait while endpoint is being created
            self.createEndpoint(endpoint_name, endpoint_config_name)
            self.waitWhileCreating(endpoint_name)
        
        # update model
        else:
            print('Updating existing model')

            # upload model and model artifacts needed for inference (here the old ones are replaced)
            # make sure to make a backup in S3 if you would like to keep the older models
            # we replace the old ones and keep the same names to avoid having to recreate a sagemaker model with a different name for the update!
            uploadFile(list_files=[MODEL_TAR, CODE_TAR], prefix = self.prefix)

            # create a new endpoint config that takes the new model
            self.createEndpointConfig(endpoint_config_name, model_name)

            # update endpoint
            self.updateEndpoint(endpoint_name, endpoint_config_name)

            # wait while endpoint updates then delete outdated endpoint config once it is InService
            self.waitWhileCreating(endpoint_name)
            self.deleteOutdatedEndpointConfig(model_name, endpoint_config_name)

    def createSagemakerModel(self, model_name):
        ''' 
        Create a new sagemaker Model object with an xgboost container and an entry point for inference using boto3 API
        '''
        # Retrieve that inference image (container)
        docker_container = image_uris.retrieve(region=region, framework='xgboost', version='1.5-1')

        # Relative S3 path to pre-trained model to create S3 model URI
        model_s3_key = f'{self.prefix}/'+ MODEL_TAR

        # Combine bucket name, model file name, and relate S3 path to create S3 model URI
        model_url = f's3://{bucket}/{model_s3_key}'

        # S3 path to the necessary inference code
        code_url = f's3://{bucket}/{self.prefix}/{CODE_TAR}'
        
        # Create a sagemaker Model object with all its artifacts
        smClient.create_model(
            ModelName = model_name,
            ExecutionRoleArn = smRole,
            PrimaryContainer = {
                'Image': docker_container,
                'ModelDataUrl': model_url,
                'Environment': {
                    'SAGEMAKER_PROGRAM': 'inference.py', #inference.py is at the root of my zipped CODE_TAR
                    'SAGEMAKER_SUBMIT_DIRECTORY': code_url,
                }
            }
        )
    
    def createEndpointConfig(self, endpoint_config_name, model_name):
        ''' 
        Create an endpoint configuration (only for boto3 sdk procedure) and set production variants parameters.
        Each retraining procedure will induce a new variant name based on the endpoint configuration name.
        '''
        smClient.create_endpoint_config(
            EndpointConfigName=endpoint_config_name,
            ProductionVariants=[
                {
                    'VariantName': endpoint_config_name,
                    'ModelName': model_name,
                    'InstanceType': INSTANCE_TYPE,
                    'InitialInstanceCount': 1
                }
            ]
        )

    def createEndpoint(self, endpoint_name, endpoint_config_name):
        '''
        Deploy the model to an endpoint
        '''
        smClient.create_endpoint(
            EndpointName=endpoint_name,
            EndpointConfigName=endpoint_config_name)
    
    def deleteOutdatedEndpointConfig(self, name_check, current_endpoint_config):
        '''
        Automatically detect and delete endpoint configurations that contain a string 'name_check'. This method can be used
        after a retrain procedure to delete all previous endpoint configurations but keep the current one named 'current_endpoint_config'.
        '''
        # get a list of all available endpoint configurations
        all_configs = smClient.list_endpoint_configs()['EndpointConfigs']

        # loop over the names of endpoint configs
        names_list = []
        for config_dict in all_configs:
            endpoint_config_name = config_dict['EndpointConfigName']

            # get only endpoint configs that contain name_check in them and save names to a list
            if name_check in endpoint_config_name:
                names_list.append(endpoint_config_name)
        
        # remove the current endpoint configuration from the list (we do not want to detele this one since it is live)
        names_list.remove(current_endpoint_config)

        for name in names_list:
            try:
                smClient.delete_endpoint_config(EndpointConfigName=name)
                print('Deleted endpoint configuration for %s' %name)
            except:
                print('INFO : No endpoint configuration was found for %s' %endpoint_config_name)

    def updateEndpoint(self, endpoint_name, endpoint_config_name):
        ''' 
        Update existing endpoint with a new retrained model
        '''
        smClient.update_endpoint(
            EndpointName=endpoint_name,
            EndpointConfigName=endpoint_config_name,
            RetainAllVariantProperties=True)
    
    def waitWhileCreating(self, endpoint_name):
        ''' 
        While the endpoint is being created or updated sleep for 60 seconds.
        '''
        # wait while creating or updating endpoint
        status = smClient.describe_endpoint(EndpointName=endpoint_name)['EndpointStatus']
        print('Status: %s' %status)
        while status != 'InService' and status !='Failed':
            time.sleep(60)
            status = smClient.describe_endpoint(EndpointName=endpoint_name)['EndpointStatus']
            print('Status: %s' %status)
        
        # in case of a deployment failure raise an error
        if status == 'Failed':
            raise ValueError('Endpoint failed to deploy')

if __name__=="__main__":
    deployer = Deployer('churnmodel', deployRetrained=True)
    deployer.deploy()

Final comments:最后评论：

The sagemaker documentation mentions all this but fails to state that you can provide an 'entry_point' to the create_model method as well as a 'source_dir' for inference dependencies (eg normalization artifacts). sagemaker 文档提到了所有这些，但没有提到 state，您可以为create_model方法提供“entry_point”以及用于推理依赖项（例如规范化工件）的“source_dir”。 It can be done as seen in PrimaryContainer argument.可以按照PrimaryContainer参数中的说明来完成。
my fileManager.py script just contains basic functions to make tar files, upload and download to and from my S3 paths.我的fileManager.py脚本只包含制作 tar 文件、上传和下载到我的 S3 路径的基本功能。 To simplify the class, I have not included them in.为了简化 class，我没有将它们包括在内。
The method deleteOutdatedEndpointConfig may seem like an overkill with an unnecessary loop and checks, I do so because I have multiple endpoint configurations to handle and wanted to remove the ones that weren't live AND contain the string name_check (I do not know the exact name of the configuration since there is a datetime suffix).方法 deleteOutdatedEndpointConfig 可能看起来像一个不必要的循环和检查的过度杀伤，我这样做是因为我有多个端点配置要处理并且想要删除那些不存在的并且包含字符串name_check （我不知道确切的名称的配置，因为有一个日期时间后缀）。 Feel free to simplify it or remove it all together if you feel like it.如果您愿意，请随意简化它或将其全部删除。

Hope it helps.希望能帮助到你。

如何在 AWS sagemaker >= 2.0 中更新现有的 model

问题描述

2 个解决方案

解决方案1
0 2022-09-15 22:26:39

解决方案2
0 已采纳 2022-09-23 09:22:37

如何在 AWS sagemaker &gt;= 2.0 中更新现有的 model

问题描述

2 个解决方案

解决方案1 0 2022-09-15 22:26:39

解决方案2 0 已采纳 2022-09-23 09:22:37

如何在 AWS sagemaker >= 2.0 中更新现有的 model

解决方案1
0 2022-09-15 22:26:39

解决方案2
0 已采纳 2022-09-23 09:22:37