[英]AWS sagemaker-container: How to create or pass the resourceconfig.json to framework for training?
我正在尝试为 Amazon Sagemaker 创建自定义模型/图像/容器。 我已经阅读了所有基础教程,了解如何根据您的要求创建图像。 实际上,我有一个正确设置的图像,它运行 tensorflow,在本地训练、部署和服务 model。
当我尝试使用 sagemaker python SDK 运行容器时,问题就来了。更准确地说,是尝试使用框架模块和 Class 创建我自己的自定义估算器来运行自定义图像/容器。
在这里我发布了最少的代码来解释我的情况:
文件结构:
.
├── Dockerfile
├── variables.env
├── requirements.txt
├── test_sagemaker.ipynb
├── src
| ├── train
| ├── serve
| ├── predict.py
| └── custom_code/my_model_functions
|
└── local_test
├── train_local.sh
├── serve_local.sh
├── predict.sh
└── test_dir
├── model/model.pkl
├── output/output.txt
└── input
├── data/data.pkl
└── config
├── hyperparameters.json
├── inputdataconfig.json
└── resourceconfig.json
dockerfile。
FROM ubuntu:16.04
MAINTAINER Amazon AI <sage-learner@amazon.com>
# Install python and other runtime dependencies
RUN apt-get update && \
apt-get -y install build-essential libatlas-dev git wget curl nginx jq && \
apt-get -y install python3-dev python3-setuptools
# Install pip
RUN cd /tmp && \
curl -O https://bootstrap.pypa.io/get-pip.py && \
python3 get-pip.py && \
rm get-pip.py
# Installing Requirements
COPY requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt
# Set SageMaker training environment variables
ENV SM_ENV_VARIABLES env_variables
COPY local_test/test_dir /opt/ml
# Set up the program in the image
COPY src /opt/program
WORKDIR /opt/program
火车
from __future__ import absolute_import
import json, sys, logging, os, subprocess, time, traceback
from pprint import pprint
# Custom Code Functions
from custom_code.custom_estimator import CustomEstimator
from custom_code.custom_dataset import create_dataset
# Important Seagemaker Modules
import sagemaker_containers.beta.framework as framework
from sagemaker_containers import _env
logger = logging.getLogger(__name__)
def run_algorithm_mode():
"""Run training in algorithm mode, which does not require a user entry point. """
train_config = os.environ.get('training_env_variables')
model_path = os.environ.get("model_path")
print("Downloading Dataset")
train_dataset, test_dataset = create_dataset(None)
print("Creating Model")
clf = CustomEstimator.create_model(train_config)
print("Starting Training")
clf = clf.train_model(train_dataset, test_dataset)
print("Saving Model")
module_name = 'classifier.pkl'
CustomEstimator.save_model(clf, model_path)
def train(training_environment):
"""Run Custom Model training in either 'algorithm mode' or using a user supplied module in local SageMaker environment.
The user supplied module and its dependencies are downloaded from S3.
Training is invoked by calling a "train" function in the user supplied module.
Args:
training_environment: training environment object containing environment variables,
training arguments and hyperparameters
"""
if training_environment.user_entry_point is not None:
print("Entry Point Receive")
framework.modules.run_module(training_environment.module_dir,
training_environment.to_cmd_args(),
training_environment.to_env_vars(),
training_environment.module_name,
capture_error=False)
print_directories()
else:
logger.info("Running Custom Model Sagemaker in 'algorithm mode'")
try:
_env.write_env_vars(training_environment.to_env_vars())
except Exception as error:
print(error)
run_algorithm_mode()
def main():
train(framework.training_env())
sys.exit(0)
if __name__ == '__main__':
main()
test_sagemaker.ipynb 文件
我使用 sagemaker 估计器的框架 class 创建了这个自定义 sagemaker 估计器。
import boto3
from sagemaker.estimator import Framework
class ScriptModeTensorFlow(Framework):
"""This class is temporary until the final version of Script Mode is released.
"""
__framework_name__ = "tensorflow-scriptmode"
create_model = TensorFlow.create_model
def __init__(
self,
entry_point,
source_dir=None,
hyperparameters=None,
py_version="py3",
image_name=None,
**kwargs
):
super(ScriptModeTensorFlow, self).__init__(
entry_point, source_dir , hyperparameters, image_name=image_name, **kwargs
)
self.py_version = py_version
self.image_name = None
self.framework_version = '2.0.0'
self.user_entry_point = entry_point
print(self.user_entry_point)
然后创建传递entry_point和图像的估计器(class 需要运行的所有其他参数。)
estimator = ScriptModeTensorFlow(entry_point='training_script_path/train_model.py',
image_name='sagemaker-custom-image:latest',
source_dir='source_dir_path/input/config',
train_instance_type='local', # Run in local mode
train_instance_count=1,
hyperparameters=hyperparameters,
py_version='py3',
role=role)
最后,击球训练...
estimator.fit({"train": "s3://s3-bucket-path/training_data"})
但我收到以下错误:
Creating tmpm3ft7ijm_algo-1-mjqkd_1 ...
Attaching to tmpm3ft7ijm_algo-1-mjqkd_12mdone
algo-1-mjqkd_1 | Reporting training FAILURE
algo-1-mjqkd_1 | framework error:
algo-1-mjqkd_1 | Traceback (most recent call last):
algo-1-mjqkd_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 65, in train
algo-1-mjqkd_1 | env = sagemaker_containers.training_env()
algo-1-mjqkd_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/__init__.py", line 27, in training_env
algo-1-mjqkd_1 | resource_config=_env.read_resource_config(),
algo-1-mjqkd_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 240, in read_resource_config
algo-1-mjqkd_1 | return _read_json(resource_config_file_dir)
algo-1-mjqkd_1 | File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 192, in _read_json
algo-1-mjqkd_1 | with open(path, "r") as f:
algo-1-mjqkd_1 | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1 |
algo-1-mjqkd_1 | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1 | Traceback (most recent call last):
algo-1-mjqkd_1 | File "/usr/local/bin/dockerd-entrypoint.py", line 24, in <module>
algo-1-mjqkd_1 | subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
algo-1-mjqkd_1 | File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
algo-1-mjqkd_1 | raise CalledProcessError(retcode, cmd)
algo-1-mjqkd_1 | subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 2.
tmpm3ft7ijm_algo-1-mjqkd_1 exited with code 1
Aborting on container exit...
乍一看,错误似乎很明显,文件“/opt/ml/input/config/resourceconfig.json”丢失了。 问题是我无法创建此文件,以便 sagemaker 框架可以获得多处理主机(我现在不需要它们)。 当我按照下面显示的文件夹结构创建图像“sagemaker-custom-image:latest”时,我已经将“resoruceconfig.json”提供给图像内的“/opt/ml/input/config/”文件夹。
/opt/ml
├── input
│ ├── config
│ │ ├── hyperparameters.json
│ │ ├── inputdataconfig.json
│ │ └── resourceConfig.json
│ └── data
│ └── <channel_name>
│ └── <input data>
├── model
│ └── <model files>
└── output
└── failure
阅读 AWS 中的文档,当使用 sagemaker sdk 运行您的图像时,它说文件夹“opt/ml”中容器中的所有文件在训练期间可能不再可见。
/opt/ml 和所有子目录由 Amazon SageMaker 培训保留。 在构建算法的 docker 图像时,请确保不要将算法所需的任何数据放在它们下面,因为在训练期间数据可能不再可见。 Amazon SageMaker 如何运行您的训练图像
这基本上恢复了我的问题。
是的,我知道我可以使用 sagemaker 的预建估算器和图像。
是的,我知道我可以绕过框架库并从 docker 运行图像序列。
但是我需要实现一个完全自定义的 sagemaker sdk/image/container/model 以与入口点一起使用。 我知道有点雄心勃勃。
因此,重新表述我的问题:如何获取 Sagemaker Framework 或 SDK 以在图像内创建 require resourceconfig.json 文件?
显然,远程运行图像解决了这个问题。 我正在使用远程 aws 机器“ml.m5.large”。 sagemaker sdk 代码中的某处正在创建并提供图像所需的文件。 但只有在远程机器上运行时,而不是在本地运行。
该文件似乎已从“resourceConfig.json”重命名为“resourceconfig.json”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.