简体   繁体   English

带有 Scikit Learn 的 AWS Sagemaker 多模型终端节点:使用训练脚本时出现 UnexpectedStatusException

[英]AWS Sagemaker Multi-Model Endpoint with Scikit Learn: UnexpectedStatusException whilst using a training script

I am trying to create a multi-model endpoint in AWS sagemaker using Scikit-learn and a custom training script.我正在尝试使用 Scikit-learn 和自定义训练脚本在 AWS sagemaker 中创建一个多模型端点。 When I attempt to train my model using the following code:当我尝试使用以下代码训练我的模型时:

estimator = SKLearn(
    entry_point=TRAINING_FILE, # script to use for training job
    role=role,
    source_dir=SOURCE_DIR, # Location of scripts
    train_instance_count=1,
    train_instance_type=TRAIN_INSTANCE_TYPE,
    framework_version='0.23-1',
    output_path=s3_output_path,# Where to store model artifacts
    base_job_name=_job,
    code_location=code_location,# This is where the .tar.gz of the source_dir will be stored
    hyperparameters = {'max-samples'    : 100,
                       'model_name'     : key})

DISTRIBUTION_MODE = 'FullyReplicated'

train_input = sagemaker.s3_input(s3_data=inputs+'/train', 
                                  distribution=DISTRIBUTION_MODE, content_type='csv')
    
estimator.fit({'train': train_input}, wait=True)

where 'TRAINING_FILE' contains:其中“TRAINING_FILE”包含:


import argparse
import os

import numpy as np
import pandas as pd
import joblib
import sys

from sklearn.ensemble import IsolationForest

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument('--max_samples', type=int, default=100)
    
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--model_name', type=str)

    args, _ = parser.parse_known_args()

    print('reading data. . .')
    print('model_name: '+args.model_name)    
    
    train_file = os.path.join(args.train, args.model_name + '_train.csv')    
    train_df = pd.read_csv(train_file) # read in the training data
    train_tgt = train_df.iloc[:, 1] # target column is the second column
    
    clf = IsolationForest(max_samples = args.max_samples)
    clf = clf.fit([train_tgt])
    
    path = os.path.join(args.model_dir, 'model.joblib')
    joblib.dump(clf, path)
    print('model persisted at ' + path)

The training script succeeds but sagemaker throws an UnexpectedStatusException :训练脚本成功,但 sagemaker 抛出UnexpectedStatusException 在此处输入图片说明

Has anybody ever experienced anything like this before?有没有人经历过这样的事情? I've checked all the cloudwatch logs and found nothing of use, and I'm completely stumped on what to try next.我检查了所有的 cloudwatch 日志,没有发现任何有用的东西,我完全不知道接下来要尝试什么。

To anyone that comes across this issue in future, the problem has been solved.对于将来遇到此问题的任何人,问题都已解决。

The issue was nothing to do with the training, but with invalid characters in directory names being sent to S3.该问题与培训无关,而是将目录名称中的无效字符发送到 S3。 So the script would produce the artifacts correctly, but sagemaker would throw an exception when trying to save them to S3因此脚本会正确生成工件,但是 sagemaker 在尝试将它们保存到 S3 时会抛出异常

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 AWS SageMaker 中为 Scikit Learn 模型调用终端节点 - Invoking Endpoint in AWS SageMaker for Scikit Learn Model 在 AWS Sagemaker 中为 scikit 学习模型创建端点 - creating endpoint for scikit learn model in AWS Sagemaker 在 AWS Sagemaker 中训练 scikit 学习模型时无法创建 model.tar.gz 文件 - Couldn't create model.tar.gz file while training scikit learn model in AWS Sagemaker 在 aws sagemaker 中使用外部库进行模型训练 - Using external libraries for model training in aws sagemaker 具有不受支持的内置算法的 Sagemaker 多模型端点 - Sagemaker multi-model endpoints with unsupported built-in algorithms 使用 scikit learn 训练逻辑回归进行多类分类 - Training logistic regression using scikit learn for multi-class classification AWS SageMaker:使用托管在 S3 中的经过训练的 model 创建终端节点 - AWS SageMaker: Create an endpoint using a trained model hosted in S3 是否可以在不使用 SageMaker SDK 的情况下为我在 AWS SageMaker 中创建的模型设置终端节点 - Is it possible set up an endpoint for a model I created in AWS SageMaker without using the SageMaker SDK 如何使用scikit-learn加载先前保存的模型并使用新的训练数据扩展模型 - How to load previously saved model and expand the model with new training data using scikit-learn 在AWS Lambda中重用scikit learn model .pkl - Reuse a scikit learn model .pkl in AWS Lambda
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM