我在 AWS SageMaker 中训练 model 时遇到问题，在需要保存 model 之前一切都很好

Question

I've had trouble training a model in AWS SageMaker, everything is fine until the model needs to be saved.我在 AWS SageMaker 中训练 model 时遇到了麻烦，在需要保存 model 之前一切都很好。 I have tried with a 500MB dataset and everything works correctly, but when the.csv file occupies 10GB the training job fails.我尝试使用 500MB 数据集，一切正常，但是当 .csv 文件占用 10GB 时，训练作业失败。 Next I leave my training python file and the error output, the machine used to train was ml.m5.2xlarge with a train_volume_size = 100.接下来我离开我的训练 python 文件和错误 output，用于训练的机器是 ml.m5.2xlarge，train_volume_size = 100。

File .py to train the model in SageMaker with an output of 10GB使用.py文件在 SageMaker 中训练 model，output 为 10GB

import argparse
import pandas as pd 
import  os
import sys
from os.path import join
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import  GaussianNB
from sklearn import metrics
import numpy as np

import logging

import boto3

import time

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

if 'SAGEMAKER_METRICS_DIRECTORY' in os.environ:
    log_file_handler = logging.FileHandler(join(os.environ['SAGEMAKER_METRICS_DIRECTORY'], "metrics.json"))
    log_file_handler.setFormatter(
    "{'time':'%(asctime)s', 'name': '%(name)s', \
    'level': '%(levelname)s', 'message': '%(message)s'}"
    )
    logger.addHandler(log_file_handler)

os.system('pip install joblib')
import joblib


if __name__ == '__main__':

    parser = argparse.ArgumentParser()

    # Adicion de hyperparametros

    # Solamente se añade el parametro lambda de regularizacion

    parser.add_argument('--regularization_lambda',type=float, default=0.0)

    # Argumentos propios de sagemaker 

    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

    args = parser.parse_args()

    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))

    raw_data = [pd.read_csv(file,header=None,engine="python") for file in input_files]
    train_data = pd.concat(raw_data)


    # Definicion del modelo
    model = GaussianNB()
    
    matrix = train_data.values
    
    for submatrix in np.split(matrix,np.arange(100,12100,100),axis=0):
    
        # Generacion de los datos de entrenemiento asumiendo que
        # las etiquetas estan en la primera columna
 
        train_y = submatrix[:,0]
        train_x = submatrix[:,1:]

        model = model.partial_fit(train_x,train_y,classes=np.unique(train_y))
        print('Accuracy: ', model.score(train_x, train_y))

    logger.info('Train accuracy: {:.6f};'.format(model.score(train_x, train_y)))

    # Mustra de los coeficientes y guradarlos

    joblib.dump(model, os.path.join(args.model_dir, "model.joblib"))

    def model_fn(model_dir):
    
        # Se retorna el modelo entrenado
        model = joblib.load(os.path.join(model_dir, "model.joblib"))
        return model

When the finished the output was the next error当完成 output 是下一个错误

2020-07-20 09:49:52 Starting - Starting the training job...
2020-07-20 09:49:54 Starting - Launching requested ML instances......
2020-07-20 09:50:58 Starting - Preparing the instances for training...
2020-07-20 09:51:39 Downloading - Downloading input data...............
2020-07-20 09:54:22 Training - Training image download completed. Training in progress..2020-07-20 09:54:24,234 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
2020-07-20 09:54:24,236 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2020-07-20 09:54:24,246 sagemaker_sklearn_container.training INFO     Invoking user training script.
2020-07-20 09:54:24,803 sagemaker-containers INFO     Module eeg-NB-model does not provide a setup.py. 
Generating setup.py
2020-07-20 09:54:24,803 sagemaker-containers INFO     Generating setup.cfg
2020-07-20 09:54:24,803 sagemaker-containers INFO     Generating MANIFEST.in
2020-07-20 09:54:24,803 sagemaker-containers INFO     Installing module with the following command:
/miniconda3/bin/python -m pip install . 
Processing /opt/ml/code
Building wheels for collected packages: eeg-NB-model
  Building wheel for eeg-NB-model (setup.py): started
  Building wheel for eeg-NB-model (setup.py): finished with status 'done'
  Created wheel for eeg-NB-model: filename=eeg_NB_model-1.0.0-py2.py3-none-any.whl size=7074 sha256=2d6213105e4f7f707f68278b1291d2940b8de2c319f7084b322b2d4197402c33
  Stored in directory: /tmp/pip-ephem-wheel-cache-8kr3fxjv/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built eeg-NB-model
Installing collected packages: eeg-NB-model
Successfully installed eeg-NB-model-1.0.0
2020-07-20 09:54:26,753 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2020-07-20 09:54:26,763 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_sklearn_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "regularization_lambda": 0.0
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sagemaker-scikit-learn-2020-07-20-09-49-52-390",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-eu-west-1-798663412819/sagemaker-scikit-learn-2020-07-20-09-49-52-390/source/sourcedir.tar.gz",
    "module_name": "eeg-NB-model",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "eeg-NB-model.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"regularization_lambda":0.0}
SM_USER_ENTRY_POINT=eeg-NB-model.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=eeg-NB-model
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_sklearn_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-west-1-798663412819/sagemaker-scikit-learn-2020-07-20-09-49-52-390/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_sklearn_container.training:main","hosts":["algo-1"],"hyperparameters":{"regularization_lambda":0.0},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-scikit-learn-2020-07-20-09-49-52-390","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-1-798663412819/sagemaker-scikit-learn-2020-07-20-09-49-52-390/source/sourcedir.tar.gz","module_name":"eeg-NB-model","network_interface_name":"eth0","num_cpus":8,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"eeg-NB-model.py"}
SM_USER_ARGS=["--regularization_lambda","0.0"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_REGULARIZATION_LAMBDA=0.0
PYTHONPATH=/miniconda3/bin:/miniconda3/lib/python37.zip:/miniconda3/lib/python3.7:/miniconda3/lib/python3.7/lib-dynload:/miniconda3/lib/python3.7/site-packages

Invoking script with the following command:

/miniconda3/bin/python -m eeg-NB-model --regularization_lambda 0.0


/miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Collecting joblib
  Downloading https://files.pythonhosted.org/packages/51/dd/0e015051b4a27ec5a58b02ab774059f3289a94b0906f880a3f9507e74f38/joblib-0.16.0-py3-none-any.whl (300kB)
Installing collected packages: joblib
Successfully installed joblib-0.16.0

2020-07-20 09:58:21 Uploading - Uploading generated training model
2020-07-20 09:58:21 Failed - Training job failed
2020-07-20 09:58:12,544 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/miniconda3/bin/python -m eeg-NB-model --regularization_lambda 0.0"
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-7-267e445b3bf0> in <module>
     28 NB_training_job_name = "Naive-Bayes-training-job-{}".format(int(time.time()))
     29 
---> 30 estimator.fit({'train': train_input},wait=True)

/opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    463         self.jobs.append(self.latest_training_job)
    464         if wait:
--> 465             self.latest_training_job.wait(logs=logs)
    466 
    467     def _compilation_job_name(self):

/opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1056         # If logs are requested, call logs_for_jobs.
   1057         if logs != "None":
-> 1058             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1059         else:
   1060             self.sagemaker_session.wait_for_job(self.job_name)

/opt/conda/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3019 
   3020         if wait:
-> 3021             self._check_job_status(job_name, description, "TrainingJobStatus")
   3022             if dot:
   3023                 print()

/opt/conda/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   2613                 ),
   2614                 allowed_statuses=["Completed", "Stopped"],
-> 2615                 actual_status=status,
   2616             )
   2617 

UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2020-07-20-09-49-52-390: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/miniconda3/bin/python -m eeg-NB-model --regularization_lambda 0.0"

Answer 1

However, it looks like you are loading all of your data into memory in these lines:但是，看起来您正在将所有数据加载到 memory 中，如下所示：

    raw_data = [pd.read_csv(file,header=None,engine="python") for file in input_files]
    train_data = pd.concat(raw_data)

The model type you are using ml.m5.2xlarge has 32 GiB of memory .您使用的 model 类型ml.m5.2xlarge具有32 GiB 的 memory 。 It could be that loading all of your data into memory this way is leading to an out-of-memory exception or timeout.以这种方式将所有数据加载到 memory 可能会导致内存不足异常或超时。 Poke around the SageMaker / Cloudwatch logs to try to get a failure reason.查看 SageMaker / Cloudwatch 日志以尝试获取失败原因。 Unfortunately, the SageMaker logs are only showing ExecuteUserScriptError which doesn't tell you much, but in other cases this error code without details was due to resource errors.不幸的是，SageMaker 日志仅显示ExecuteUserScriptError并不能告诉您太多信息，但在其他情况下，此错误代码没有详细信息是由于资源错误造成的。

One way to test this is to increase the size of your sagemaker instance to one with bigger memory.对此进行测试的一种方法是将 sagemaker 实例的大小增加到具有更大 memory 的实例。

Or, you could refrain from loading all of your training data into memory at once.或者，您可以避免一次将所有训练数据加载到 memory 中。 It looks like your input CSV data is already split up into files.看起来您的输入 CSV 数据已经拆分为文件。 Have you considered programming a loop over all of these files to train from them one-by-one?您是否考虑过在所有这些文件上编写一个循环来逐个训练它们？ That way you don't have to store all of the features in memory at once.这样您就不必一次将所有功能存储在 memory 中。

for file in input_files:
    raw_data_block = pd.read_csv(file,header=None,engine="python")
    # training code for raw_data_block here.

我在 AWS SageMaker 中训练 model 时遇到问题，在需要保存 model 之前一切都很好

问题描述

1 个解决方案

解决方案1
0 2020-07-21 15:32:57

我在 AWS SageMaker 中训练 model 时遇到问题，在需要保存 model 之前一切都很好

问题描述

1 个解决方案

解决方案1 0 2020-07-21 15:32:57

解决方案1
0 2020-07-21 15:32:57