AWS sagemaker-container：如何创建 resourceconfig.json 或将其传递给培训框架？

Question

我正在尝试为 Amazon Sagemaker 创建自定义模型/图像/容器。 我已经阅读了所有基础教程，了解如何根据您的要求创建图像。 实际上，我有一个正确设置的图像，它运行 tensorflow，在本地训练、部署和服务 model。

当我尝试使用 sagemaker python SDK 运行容器时，问题就来了。更准确地说，是尝试使用框架模块和 Class 创建我自己的自定义估算器来运行自定义图像/容器。

在这里我发布了最少的代码来解释我的情况：

文件结构：

.
├── Dockerfile
├── variables.env
├── requirements.txt
├── test_sagemaker.ipynb
├── src
|   ├── train
|   ├── serve
|   ├── predict.py
|   └── custom_code/my_model_functions
|
└── local_test
    ├── train_local.sh
    ├── serve_local.sh
    ├── predict.sh
    └── test_dir
        ├── model/model.pkl
        ├── output/output.txt
        └── input
            ├── data/data.pkl
            └── config
                ├── hyperparameters.json
                ├── inputdataconfig.json
                └── resourceconfig.json

dockerfile。

FROM ubuntu:16.04

MAINTAINER Amazon AI <sage-learner@amazon.com>

# Install python and other runtime dependencies
RUN apt-get update && \
    apt-get -y install build-essential libatlas-dev git wget curl nginx jq && \
    apt-get -y install python3-dev python3-setuptools

# Install pip
RUN cd /tmp && \
    curl -O https://bootstrap.pypa.io/get-pip.py && \
    python3 get-pip.py && \
    rm get-pip.py

# Installing Requirements
COPY requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt

# Set SageMaker training environment variables
ENV SM_ENV_VARIABLES env_variables

COPY local_test/test_dir /opt/ml

# Set up the program in the image
COPY src /opt/program
WORKDIR /opt/program

火车


from __future__ import absolute_import

import json, sys, logging, os, subprocess, time, traceback
from pprint import pprint

# Custom Code Functions
from custom_code.custom_estimator import CustomEstimator
from custom_code.custom_dataset import create_dataset

# Important Seagemaker Modules
import sagemaker_containers.beta.framework as framework
from sagemaker_containers import _env

logger = logging.getLogger(__name__)

def run_algorithm_mode():
    """Run training in algorithm mode, which does not require a user entry point. """

    train_config = os.environ.get('training_env_variables')
    model_path = os.environ.get("model_path")

    print("Downloading Dataset")
    train_dataset,  test_dataset = create_dataset(None)
    print("Creating Model")
    clf = CustomEstimator.create_model(train_config)
    print("Starting Training")
    clf = clf.train_model(train_dataset, test_dataset)
    print("Saving Model")
    module_name = 'classifier.pkl'
    CustomEstimator.save_model(clf, model_path)


def train(training_environment):
    """Run Custom Model training in either 'algorithm mode' or using a user supplied module in local SageMaker environment.
    The user supplied module and its dependencies are downloaded from S3.
    Training is invoked by calling a "train" function in the user supplied module.
    Args:
        training_environment: training environment object containing environment variables,
                               training arguments and hyperparameters
    """

    if training_environment.user_entry_point is not None:
        print("Entry Point Receive")
        framework.modules.run_module(training_environment.module_dir,
                                     training_environment.to_cmd_args(),
                                     training_environment.to_env_vars(),
                                     training_environment.module_name,
                                     capture_error=False)
        print_directories()
    else:
        logger.info("Running Custom Model Sagemaker in 'algorithm mode'")
        try:
            _env.write_env_vars(training_environment.to_env_vars())
        except Exception as error:
            print(error)
        run_algorithm_mode()

def main():
    train(framework.training_env())
    sys.exit(0)

if __name__ == '__main__':
    main()

test_sagemaker.ipynb 文件

我使用 sagemaker 估计器的框架 class 创建了这个自定义 sagemaker 估计器。

import boto3
from sagemaker.estimator import Framework

class ScriptModeTensorFlow(Framework):
    """This class is temporary until the final version of Script Mode is released.
    """

    __framework_name__ = "tensorflow-scriptmode"

    create_model = TensorFlow.create_model

    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        image_name=None,
        **kwargs
    ):
        super(ScriptModeTensorFlow, self).__init__(
            entry_point, source_dir , hyperparameters, image_name=image_name, **kwargs
        )
        self.py_version = py_version
        self.image_name = None
        self.framework_version = '2.0.0'
        self.user_entry_point = entry_point
        print(self.user_entry_point)

然后创建传递entry_point和图像的估计器（class 需要运行的所有其他参数。）

estimator = ScriptModeTensorFlow(entry_point='training_script_path/train_model.py',
                       image_name='sagemaker-custom-image:latest',
                       source_dir='source_dir_path/input/config',
                       train_instance_type='local',      # Run in local mode
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       py_version='py3',
                       role=role)

最后，击球训练...

estimator.fit({"train": "s3://s3-bucket-path/training_data"})

但我收到以下错误：

Creating tmpm3ft7ijm_algo-1-mjqkd_1 ... 
Attaching to tmpm3ft7ijm_algo-1-mjqkd_12mdone
algo-1-mjqkd_1  | Reporting training FAILURE
algo-1-mjqkd_1  | framework error: 
algo-1-mjqkd_1  | Traceback (most recent call last):
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 65, in train
algo-1-mjqkd_1  |     env = sagemaker_containers.training_env()
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/__init__.py", line 27, in training_env
algo-1-mjqkd_1  |     resource_config=_env.read_resource_config(),
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 240, in read_resource_config
algo-1-mjqkd_1  |     return _read_json(resource_config_file_dir)
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 192, in _read_json
algo-1-mjqkd_1  |     with open(path, "r") as f:
algo-1-mjqkd_1  | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1  | 
algo-1-mjqkd_1  | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1  | Traceback (most recent call last):
algo-1-mjqkd_1  |   File "/usr/local/bin/dockerd-entrypoint.py", line 24, in <module>
algo-1-mjqkd_1  |     subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
algo-1-mjqkd_1  |   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
algo-1-mjqkd_1  |     raise CalledProcessError(retcode, cmd)
algo-1-mjqkd_1  | subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 2.
tmpm3ft7ijm_algo-1-mjqkd_1 exited with code 1
Aborting on container exit...

乍一看，错误似乎很明显，文件“/opt/ml/input/config/resourceconfig.json”丢失了。 问题是我无法创建此文件，以便 sagemaker 框架可以获得多处理主机（我现在不需要它们）。 当我按照下面显示的文件夹结构创建图像“sagemaker-custom-image:latest”时，我已经将“resoruceconfig.json”提供给图像内的“/opt/ml/input/config/”文件夹。

/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   ├── inputdataconfig.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
│   └── <model files>
└── output
    └── failure

阅读 AWS 中的文档，当使用 sagemaker sdk 运行您的图像时，它说文件夹“opt/ml”中容器中的所有文件在训练期间可能不再可见。

/opt/ml 和所有子目录由 Amazon SageMaker 培训保留。 在构建算法的 docker 图像时，请确保不要将算法所需的任何数据放在它们下面，因为在训练期间数据可能不再可见。 Amazon SageMaker 如何运行您的训练图像

这基本上恢复了我的问题。

是的，我知道我可以使用 sagemaker 的预建估算器和图像。

是的，我知道我可以绕过框架库并从 docker 运行图像序列。

但是我需要实现一个完全自定义的 sagemaker sdk/image/container/model 以与入口点一起使用。 我知道有点雄心勃勃。

因此，重新表述我的问题：如何获取 Sagemaker Framework 或 SDK 以在图像内创建 require resourceconfig.json 文件？

Answer 1

显然，远程运行图像解决了这个问题。 我正在使用远程 aws 机器“ml.m5.large”。 sagemaker sdk 代码中的某处正在创建并提供图像所需的文件。 但只有在远程机器上运行时，而不是在本地运行。

Answer 2

该文件似乎已从“resourceConfig.json”重命名为“resourceconfig.json”。

AWS sagemaker-container：如何创建 resourceconfig.json 或将其传递给培训框架？

问题描述

2 个解决方案

解决方案1
1 2020-02-05 18:14:23

解决方案2
0 2021-10-02 16:50:52

AWS sagemaker-container：如何创建 resourceconfig.json 或将其传递给培训框架？

问题描述

2 个解决方案

解决方案1 1 2020-02-05 18:14:23

解决方案2 0 2021-10-02 16:50:52

解决方案1
1 2020-02-05 18:14:23

解决方案2
0 2021-10-02 16:50:52