AWS sagemaker-container：如何创建 resourceconfig.json 或将其传递给培训框架？

Question

I am trying to create a custom model/image/container for Amazon Sagemaker.我正在尝试为 Amazon Sagemaker 创建自定义模型/图像/容器。 I had read all the basics tutorials, how to create an image with your requirements.我已经阅读了所有基础教程，了解如何根据您的要求创建图像。 Actually i have a properly set image which runs tensorflow, trains, deploy and serve the model locally.实际上，我有一个正确设置的图像，它运行 tensorflow，在本地训练、部署和服务 model。

The problems come when i am trying to run the container using sagemaker python SDK. more precisely, trying to use the framework module and Class to create my own custom estimator to run the custom image/container.当我尝试使用 sagemaker python SDK 运行容器时，问题就来了。更准确地说，是尝试使用框架模块和 Class 创建我自己的自定义估算器来运行自定义图像/容器。

here i post the minimum code to explain my case:在这里我发布了最少的代码来解释我的情况：

File Structure:文件结构：

.
├── Dockerfile
├── variables.env
├── requirements.txt
├── test_sagemaker.ipynb
├── src
|   ├── train
|   ├── serve
|   ├── predict.py
|   └── custom_code/my_model_functions
|
└── local_test
    ├── train_local.sh
    ├── serve_local.sh
    ├── predict.sh
    └── test_dir
        ├── model/model.pkl
        ├── output/output.txt
        └── input
            ├── data/data.pkl
            └── config
                ├── hyperparameters.json
                ├── inputdataconfig.json
                └── resourceconfig.json

dockerfile. dockerfile。

FROM ubuntu:16.04

MAINTAINER Amazon AI <sage-learner@amazon.com>

# Install python and other runtime dependencies
RUN apt-get update && \
    apt-get -y install build-essential libatlas-dev git wget curl nginx jq && \
    apt-get -y install python3-dev python3-setuptools

# Install pip
RUN cd /tmp && \
    curl -O https://bootstrap.pypa.io/get-pip.py && \
    python3 get-pip.py && \
    rm get-pip.py

# Installing Requirements
COPY requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt

# Set SageMaker training environment variables
ENV SM_ENV_VARIABLES env_variables

COPY local_test/test_dir /opt/ml

# Set up the program in the image
COPY src /opt/program
WORKDIR /opt/program

Train火车


from __future__ import absolute_import

import json, sys, logging, os, subprocess, time, traceback
from pprint import pprint

# Custom Code Functions
from custom_code.custom_estimator import CustomEstimator
from custom_code.custom_dataset import create_dataset

# Important Seagemaker Modules
import sagemaker_containers.beta.framework as framework
from sagemaker_containers import _env

logger = logging.getLogger(__name__)

def run_algorithm_mode():
    """Run training in algorithm mode, which does not require a user entry point. """

    train_config = os.environ.get('training_env_variables')
    model_path = os.environ.get("model_path")

    print("Downloading Dataset")
    train_dataset,  test_dataset = create_dataset(None)
    print("Creating Model")
    clf = CustomEstimator.create_model(train_config)
    print("Starting Training")
    clf = clf.train_model(train_dataset, test_dataset)
    print("Saving Model")
    module_name = 'classifier.pkl'
    CustomEstimator.save_model(clf, model_path)


def train(training_environment):
    """Run Custom Model training in either 'algorithm mode' or using a user supplied module in local SageMaker environment.
    The user supplied module and its dependencies are downloaded from S3.
    Training is invoked by calling a "train" function in the user supplied module.
    Args:
        training_environment: training environment object containing environment variables,
                               training arguments and hyperparameters
    """

    if training_environment.user_entry_point is not None:
        print("Entry Point Receive")
        framework.modules.run_module(training_environment.module_dir,
                                     training_environment.to_cmd_args(),
                                     training_environment.to_env_vars(),
                                     training_environment.module_name,
                                     capture_error=False)
        print_directories()
    else:
        logger.info("Running Custom Model Sagemaker in 'algorithm mode'")
        try:
            _env.write_env_vars(training_environment.to_env_vars())
        except Exception as error:
            print(error)
        run_algorithm_mode()

def main():
    train(framework.training_env())
    sys.exit(0)

if __name__ == '__main__':
    main()

test_sagemaker.ipynb test_sagemaker.ipynb 文件

I created this custom sagemaker estimator using the Framework class of the sagemaker estimator.我使用 sagemaker 估计器的框架 class 创建了这个自定义 sagemaker 估计器。

import boto3
from sagemaker.estimator import Framework

class ScriptModeTensorFlow(Framework):
    """This class is temporary until the final version of Script Mode is released.
    """

    __framework_name__ = "tensorflow-scriptmode"

    create_model = TensorFlow.create_model

    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        image_name=None,
        **kwargs
    ):
        super(ScriptModeTensorFlow, self).__init__(
            entry_point, source_dir , hyperparameters, image_name=image_name, **kwargs
        )
        self.py_version = py_version
        self.image_name = None
        self.framework_version = '2.0.0'
        self.user_entry_point = entry_point
        print(self.user_entry_point)

Then create the estimator passing the entry_point and the images (all the others parameters the class needs to run.)然后创建传递entry_point和图像的估计器（class 需要运行的所有其他参数。）

estimator = ScriptModeTensorFlow(entry_point='training_script_path/train_model.py',
                       image_name='sagemaker-custom-image:latest',
                       source_dir='source_dir_path/input/config',
                       train_instance_type='local',      # Run in local mode
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       py_version='py3',
                       role=role)

Finally, hitting training...最后，击球训练...

estimator.fit({"train": "s3://s3-bucket-path/training_data"})

but I get the following error:但我收到以下错误：

Creating tmpm3ft7ijm_algo-1-mjqkd_1 ... 
Attaching to tmpm3ft7ijm_algo-1-mjqkd_12mdone
algo-1-mjqkd_1  | Reporting training FAILURE
algo-1-mjqkd_1  | framework error: 
algo-1-mjqkd_1  | Traceback (most recent call last):
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_trainer.py", line 65, in train
algo-1-mjqkd_1  |     env = sagemaker_containers.training_env()
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/__init__.py", line 27, in training_env
algo-1-mjqkd_1  |     resource_config=_env.read_resource_config(),
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 240, in read_resource_config
algo-1-mjqkd_1  |     return _read_json(resource_config_file_dir)
algo-1-mjqkd_1  |   File "/usr/local/lib/python3.6/dist-packages/sagemaker_containers/_env.py", line 192, in _read_json
algo-1-mjqkd_1  |     with open(path, "r") as f:
algo-1-mjqkd_1  | FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1  | 
algo-1-mjqkd_1  | [Errno 2] No such file or directory: '/opt/ml/input/config/resourceconfig.json'
algo-1-mjqkd_1  | Traceback (most recent call last):
algo-1-mjqkd_1  |   File "/usr/local/bin/dockerd-entrypoint.py", line 24, in <module>
algo-1-mjqkd_1  |     subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
algo-1-mjqkd_1  |   File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
algo-1-mjqkd_1  |     raise CalledProcessError(retcode, cmd)
algo-1-mjqkd_1  | subprocess.CalledProcessError: Command '['train']' returned non-zero exit status 2.
tmpm3ft7ijm_algo-1-mjqkd_1 exited with code 1
Aborting on container exit...

At first glance the error seems obvious, the file '/opt/ml/input/config/resourceconfig.json' is missing.乍一看，错误似乎很明显，文件“/opt/ml/input/config/resourceconfig.json”丢失了。 The thing is I have no way of creating this file so that sagemaker framework can get the host for multiprocessing (whcih i don t need them yeet).问题是我无法创建此文件，以便 sagemaker 框架可以获得多处理主机（我现在不需要它们）。 When I am creating the image 'sagemaker-custom-image:latest' following the folder structure show bellow, I already give the 'resoruceconfig.json' to the '/opt/ml/input/config/' folder inside the image.当我按照下面显示的文件夹结构创建图像“sagemaker-custom-image:latest”时，我已经将“resoruceconfig.json”提供给图像内的“/opt/ml/input/config/”文件夹。

/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   ├── inputdataconfig.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
│   └── <model files>
└── output
    └── failure

Reading the documentation in AWS, when using sagemaker sdk to run your image, it says that all the files in the container in the folder 'opt/ml' may no longer be visibles during training.阅读 AWS 中的文档，当使用 sagemaker sdk 运行您的图像时，它说文件夹“opt/ml”中容器中的所有文件在训练期间可能不再可见。

/opt/ml and all sub-directories are reserved by Amazon SageMaker training. /opt/ml 和所有子目录由 Amazon SageMaker 培训保留。 When building your algorithm's docker image, please ensure you don't place any data required by your algorithm under them as the data may no longer be visible during training.在构建算法的 docker 图像时，请确保不要将算法所需的任何数据放在它们下面，因为在训练期间数据可能不再可见。 How Amazon SageMaker Runs Your Training Image Amazon SageMaker 如何运行您的训练图像

This basically resumes my problem.这基本上恢复了我的问题。

Yes, I know I can make use of the prebuilt estimators and images from sagemaker.是的，我知道我可以使用 sagemaker 的预建估算器和图像。

Yes, I know I can bypass the framwork library and run the image train from docker run.是的，我知道我可以绕过框架库并从 docker 运行图像序列。

But i have the need to implement a fully custom sagemaker sdk/image/container/model to use with entrypoint.但是我需要实现一个完全自定义的 sagemaker sdk/image/container/model 以与入口点一起使用。 I know is a bit ambitious.我知道有点雄心勃勃。

So to Reformulate my question : How do I get Sagemaker Framework or SDK to create inside the image the require resourceconfig.json file?因此，重新表述我的问题：如何获取 Sagemaker Framework 或 SDK 以在图像内创建 require resourceconfig.json 文件？

Answer 1

Apparently, running the image remotely solved the problem.显然，远程运行图像解决了这个问题。 I am using a remote aws machine 'ml.m5.large'.我正在使用远程 aws 机器“ml.m5.large”。 Somewhere in the sagemaker sdk code, is creating and giving the files needed by the image. sagemaker sdk 代码中的某处正在创建并提供图像所需的文件。 BUT only when running in a remote machine, not locally.但只有在远程机器上运行时，而不是在本地运行。

Answer 2

It seems that this file has been renamed from "resourceConfig.json" to "resourceconfig.json".该文件似乎已从“resourceConfig.json”重命名为“resourceconfig.json”。

AWS sagemaker-container：如何创建 resourceconfig.json 或将其传递给培训框架？

问题描述

2 个解决方案

解决方案1
1 2020-02-05 18:14:23

解决方案2
0 2021-10-02 16:50:52

AWS sagemaker-container：如何创建 resourceconfig.json 或将其传递给培训框架？

问题描述

2 个解决方案

解决方案1 1 2020-02-05 18:14:23

解决方案2 0 2021-10-02 16:50:52

解决方案1
1 2020-02-05 18:14:23

解决方案2
0 2021-10-02 16:50:52