[英]AWS Sagemaker Multi-Model Endpoint with Scikit Learn: UnexpectedStatusException whilst using a training script
I am trying to create a multi-model endpoint in AWS sagemaker using Scikit-learn and a custom training script.我正在尝试使用 Scikit-learn 和自定义训练脚本在 AWS sagemaker 中创建一个多模型端点。 When I attempt to train my model using the following code:当我尝试使用以下代码训练我的模型时:
estimator = SKLearn(
entry_point=TRAINING_FILE, # script to use for training job
role=role,
source_dir=SOURCE_DIR, # Location of scripts
train_instance_count=1,
train_instance_type=TRAIN_INSTANCE_TYPE,
framework_version='0.23-1',
output_path=s3_output_path,# Where to store model artifacts
base_job_name=_job,
code_location=code_location,# This is where the .tar.gz of the source_dir will be stored
hyperparameters = {'max-samples' : 100,
'model_name' : key})
DISTRIBUTION_MODE = 'FullyReplicated'
train_input = sagemaker.s3_input(s3_data=inputs+'/train',
distribution=DISTRIBUTION_MODE, content_type='csv')
estimator.fit({'train': train_input}, wait=True)
where 'TRAINING_FILE' contains:其中“TRAINING_FILE”包含:
import argparse
import os
import numpy as np
import pandas as pd
import joblib
import sys
from sklearn.ensemble import IsolationForest
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--max_samples', type=int, default=100)
parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
parser.add_argument('--model_name', type=str)
args, _ = parser.parse_known_args()
print('reading data. . .')
print('model_name: '+args.model_name)
train_file = os.path.join(args.train, args.model_name + '_train.csv')
train_df = pd.read_csv(train_file) # read in the training data
train_tgt = train_df.iloc[:, 1] # target column is the second column
clf = IsolationForest(max_samples = args.max_samples)
clf = clf.fit([train_tgt])
path = os.path.join(args.model_dir, 'model.joblib')
joblib.dump(clf, path)
print('model persisted at ' + path)
The training script succeeds but sagemaker throws an UnexpectedStatusException
:训练脚本成功,但 sagemaker 抛出UnexpectedStatusException
:
Has anybody ever experienced anything like this before?有没有人经历过这样的事情? I've checked all the cloudwatch logs and found nothing of use, and I'm completely stumped on what to try next.我检查了所有的 cloudwatch 日志,没有发现任何有用的东西,我完全不知道接下来要尝试什么。
To anyone that comes across this issue in future, the problem has been solved.对于将来遇到此问题的任何人,问题都已解决。
The issue was nothing to do with the training, but with invalid characters in directory names being sent to S3.该问题与培训无关,而是将目录名称中的无效字符发送到 S3。 So the script would produce the artifacts correctly, but sagemaker would throw an exception when trying to save them to S3因此脚本会正确生成工件,但是 sagemaker 在尝试将它们保存到 S3 时会抛出异常
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.