Couldn't create model.tar.gz file while training scikit learn model in AWS Sagemaker

I want to create an endpoint for scikit logistic regression in AWS Sagemaker. I have a train.py file which contains training code for scikit sagemaker.

import subprocess as sb
import pandas as pd
import numpy as np
import pickle,json
import sys

def install(package):
    sb.call([sys.executable, "-m", "pip", "install", package])


import argparse
import os

if __name__ =='__main__':
    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--solver', type=str, default='liblinear')

    # Data, model, and output directories
    parser.add_argument('--output_data_dir', type=str, default=os.environ.get('SM_OUTPUT_DIR'))
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))

    args, _ = parser.parse_known_args()

    # ... load from args.train and args.test, train a model, write model to args.model_dir.

    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))

    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]

    df = pd.concat(raw_data)

    y = df.iloc[:,0]
    X = df.iloc[:,1:]

    solver = args.solver
    from sklearn.linear_model import LogisticRegression
    lr = LogisticRegression(solver=solver).fit(X, y)
from sklearn.externals import joblib
def model_fn(model_dir):
    lr = joblib.dump(lr, "model.joblib")
    return lr

In my sagemaker notebook I ran the following code

import os
import boto3
import re
import copy
import time
from time import gmtime, strftime
from sagemaker import get_execution_role
import sagemaker

role = get_execution_role()

region = boto3.Session().region_name

bucket=<bucket> # Replace with your s3 bucket name
prefix = <prefix>

output_path = 's3://{}/{}/{}'.format(bucket, prefix,'output_data_dir')
train_data = 's3://{}/{}/{}'.format(bucket, prefix, 'train')
train_channel = sagemaker.session.s3_input(train_data, content_type='text/csv')

from sagemaker.sklearn.estimator import SKLearn
sklearn = SKLearn(
    role=role,output_path = output_path,

I'm fitting my model here

sklearn.fit({'train': train_channel})

Now, for creating endpoint,

from sagemaker.predictor import csv_serializer
predictor = sklearn.deploy(1, 'ml.m4.xlarge')

While trying to create endpoint, it is throwing

ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://<bucket>/<prefix>/output_data_dir/sagemaker-scikit-learn-x-y-z-000/output/model.tar.gz.

I checked my S3 bucket. Inside my output_data_dir there is sagemaker-scikit-learn-xyz-000 dir which has debug-output\training_job_end.ts file. An additional directory got created outside my <prefix> folder with name sagemaker-scikit-learn-xyz-000 that has source\sourcedir.tar.gz file. Generally whenever I trained my models with sagemaker built-in algorithms, output_data_dir\sagemaker-scikit-learn-xyz-000\output\model.tar.gz kind of files get created. Can someone please tell me where my scikit model got stored, how to push source\sourcedir.tar.gz inside my prefix code without having doing it manually and how to see contents of sourcedir.tar.gz ?

Edit: I elaborated the question regarding prefix . Whenever I run sklearn.fit() , two files with same name sagemaker-scikit-learn-xyz-000 are getting created in my S3 bucket. One created inside my <bucket>/<prefix>/output_data_dir/sagemaker-scikit-learn-xyz-000/debug-output/training_job_end.ts and other file is created in <bucket>/sagemaker-scikit-learn-xyz-000/source/sourcedir.tar.gz . Why is the second file not created inside my <prefix> like the first one? What is contained in sourcedir.tar.gz file?

I am not sure if your model is really stored, if you can't find it in S3. While you define a function with the call of joblib.dump in your entry point script, I am having the call at the end of the main. For example:

# persist model
path = os.path.join(args.model_dir, "model.joblib")
joblib.dump(myestimator, path)
print('model persisted at ' + path)

Then the file can be found in ..\\output\\model.tar.gz just as in your other cases. In order to double-check that is created you maybe want to have a print statement that can be found in the protocol of the training.

You must dump the model as the last step of your training code. Currently you are doing it in the wrong place, as model_fn goal is to load the model for inference, not for training.

  1. Add the dump after training:

     lr = LogisticRegression(solver=solver).fit(X, y) lr = joblib.dump(lr, args.model_dir)
  2. Change model_fn() to load the model instead of dumping it.

See more here .

This post here explains it well: https://towardsdatascience.com/deploying-a-pre-trained-sklearn-model-on-amazon-sagemaker-826a2b5ac0b6

In short, the tar.gz gets created by tar-gz-ing the model.joblib binary which was first created joblib.dump. To quote the article:

#Build tar file with model data + inference code
bashCommand = "tar -cvpzf model.tar.gz model.joblib inference.py"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

The inference.py is probably optional.

