I've had trouble training a model in AWS SageMaker, everything is fine until the model needs to be saved

Question

I've had trouble training a model in AWS SageMaker, everything is fine until the model needs to be saved. I have tried with a 500MB dataset and everything works correctly, but when the.csv file occupies 10GB the training job fails. Next I leave my training python file and the error output, the machine used to train was ml.m5.2xlarge with a train_volume_size = 100.

File .py to train the model in SageMaker with an output of 10GB

import argparse
import pandas as pd 
import  os
import sys
from os.path import join
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import  GaussianNB
from sklearn import metrics
import numpy as np

import logging

import boto3

import time

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

if 'SAGEMAKER_METRICS_DIRECTORY' in os.environ:
    log_file_handler = logging.FileHandler(join(os.environ['SAGEMAKER_METRICS_DIRECTORY'], "metrics.json"))
    log_file_handler.setFormatter(
    "{'time':'%(asctime)s', 'name': '%(name)s', \
    'level': '%(levelname)s', 'message': '%(message)s'}"
    )
    logger.addHandler(log_file_handler)

os.system('pip install joblib')
import joblib


if __name__ == '__main__':

    parser = argparse.ArgumentParser()

    # Adicion de hyperparametros

    # Solamente se añade el parametro lambda de regularizacion

    parser.add_argument('--regularization_lambda',type=float, default=0.0)

    # Argumentos propios de sagemaker 

    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

    args = parser.parse_args()

    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))

    raw_data = [pd.read_csv(file,header=None,engine="python") for file in input_files]
    train_data = pd.concat(raw_data)


    # Definicion del modelo
    model = GaussianNB()
    
    matrix = train_data.values
    
    for submatrix in np.split(matrix,np.arange(100,12100,100),axis=0):
    
        # Generacion de los datos de entrenemiento asumiendo que
        # las etiquetas estan en la primera columna
 
        train_y = submatrix[:,0]
        train_x = submatrix[:,1:]

        model = model.partial_fit(train_x,train_y,classes=np.unique(train_y))
        print('Accuracy: ', model.score(train_x, train_y))

    logger.info('Train accuracy: {:.6f};'.format(model.score(train_x, train_y)))

    # Mustra de los coeficientes y guradarlos

    joblib.dump(model, os.path.join(args.model_dir, "model.joblib"))

    def model_fn(model_dir):
    
        # Se retorna el modelo entrenado
        model = joblib.load(os.path.join(model_dir, "model.joblib"))
        return model

When the finished the output was the next error

2020-07-20 09:49:52 Starting - Starting the training job...
2020-07-20 09:49:54 Starting - Launching requested ML instances......
2020-07-20 09:50:58 Starting - Preparing the instances for training...
2020-07-20 09:51:39 Downloading - Downloading input data...............
2020-07-20 09:54:22 Training - Training image download completed. Training in progress..2020-07-20 09:54:24,234 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
2020-07-20 09:54:24,236 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2020-07-20 09:54:24,246 sagemaker_sklearn_container.training INFO     Invoking user training script.
2020-07-20 09:54:24,803 sagemaker-containers INFO     Module eeg-NB-model does not provide a setup.py. 
Generating setup.py
2020-07-20 09:54:24,803 sagemaker-containers INFO     Generating setup.cfg
2020-07-20 09:54:24,803 sagemaker-containers INFO     Generating MANIFEST.in
2020-07-20 09:54:24,803 sagemaker-containers INFO     Installing module with the following command:
/miniconda3/bin/python -m pip install . 
Processing /opt/ml/code
Building wheels for collected packages: eeg-NB-model
  Building wheel for eeg-NB-model (setup.py): started
  Building wheel for eeg-NB-model (setup.py): finished with status 'done'
  Created wheel for eeg-NB-model: filename=eeg_NB_model-1.0.0-py2.py3-none-any.whl size=7074 sha256=2d6213105e4f7f707f68278b1291d2940b8de2c319f7084b322b2d4197402c33
  Stored in directory: /tmp/pip-ephem-wheel-cache-8kr3fxjv/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built eeg-NB-model
Installing collected packages: eeg-NB-model
Successfully installed eeg-NB-model-1.0.0
2020-07-20 09:54:26,753 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2020-07-20 09:54:26,763 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_sklearn_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "regularization_lambda": 0.0
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sagemaker-scikit-learn-2020-07-20-09-49-52-390",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-eu-west-1-798663412819/sagemaker-scikit-learn-2020-07-20-09-49-52-390/source/sourcedir.tar.gz",
    "module_name": "eeg-NB-model",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "eeg-NB-model.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"regularization_lambda":0.0}
SM_USER_ENTRY_POINT=eeg-NB-model.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=eeg-NB-model
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_sklearn_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-west-1-798663412819/sagemaker-scikit-learn-2020-07-20-09-49-52-390/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_sklearn_container.training:main","hosts":["algo-1"],"hyperparameters":{"regularization_lambda":0.0},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-scikit-learn-2020-07-20-09-49-52-390","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-west-1-798663412819/sagemaker-scikit-learn-2020-07-20-09-49-52-390/source/sourcedir.tar.gz","module_name":"eeg-NB-model","network_interface_name":"eth0","num_cpus":8,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"eeg-NB-model.py"}
SM_USER_ARGS=["--regularization_lambda","0.0"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_REGULARIZATION_LAMBDA=0.0
PYTHONPATH=/miniconda3/bin:/miniconda3/lib/python37.zip:/miniconda3/lib/python3.7:/miniconda3/lib/python3.7/lib-dynload:/miniconda3/lib/python3.7/site-packages

Invoking script with the following command:

/miniconda3/bin/python -m eeg-NB-model --regularization_lambda 0.0


/miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Collecting joblib
  Downloading https://files.pythonhosted.org/packages/51/dd/0e015051b4a27ec5a58b02ab774059f3289a94b0906f880a3f9507e74f38/joblib-0.16.0-py3-none-any.whl (300kB)
Installing collected packages: joblib
Successfully installed joblib-0.16.0

2020-07-20 09:58:21 Uploading - Uploading generated training model
2020-07-20 09:58:21 Failed - Training job failed
2020-07-20 09:58:12,544 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/miniconda3/bin/python -m eeg-NB-model --regularization_lambda 0.0"
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-7-267e445b3bf0> in <module>
     28 NB_training_job_name = "Naive-Bayes-training-job-{}".format(int(time.time()))
     29 
---> 30 estimator.fit({'train': train_input},wait=True)

/opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    463         self.jobs.append(self.latest_training_job)
    464         if wait:
--> 465             self.latest_training_job.wait(logs=logs)
    466 
    467     def _compilation_job_name(self):

/opt/conda/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1056         # If logs are requested, call logs_for_jobs.
   1057         if logs != "None":
-> 1058             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1059         else:
   1060             self.sagemaker_session.wait_for_job(self.job_name)

/opt/conda/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3019 
   3020         if wait:
-> 3021             self._check_job_status(job_name, description, "TrainingJobStatus")
   3022             if dot:
   3023                 print()

/opt/conda/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   2613                 ),
   2614                 allowed_statuses=["Completed", "Stopped"],
-> 2615                 actual_status=status,
   2616             )
   2617 

UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2020-07-20-09-49-52-390: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/miniconda3/bin/python -m eeg-NB-model --regularization_lambda 0.0"

Answer 1

However, it looks like you are loading all of your data into memory in these lines:

    raw_data = [pd.read_csv(file,header=None,engine="python") for file in input_files]
    train_data = pd.concat(raw_data)

The model type you are using ml.m5.2xlarge has 32 GiB of memory . It could be that loading all of your data into memory this way is leading to an out-of-memory exception or timeout. Poke around the SageMaker / Cloudwatch logs to try to get a failure reason. Unfortunately, the SageMaker logs are only showing ExecuteUserScriptError which doesn't tell you much, but in other cases this error code without details was due to resource errors.

One way to test this is to increase the size of your sagemaker instance to one with bigger memory.

Or, you could refrain from loading all of your training data into memory at once. It looks like your input CSV data is already split up into files. Have you considered programming a loop over all of these files to train from them one-by-one? That way you don't have to store all of the features in memory at once.

for file in input_files:
    raw_data_block = pd.read_csv(file,header=None,engine="python")
    # training code for raw_data_block here.

I've had trouble training a model in AWS SageMaker, everything is fine until the model needs to be saved

Question

1 answers

solution1
0 2020-07-21 15:32:57

I've had trouble training a model in AWS SageMaker, everything is fine until the model needs to be saved

Question

1 answers

solution1 0 2020-07-21 15:32:57

solution1
0 2020-07-21 15:32:57