简体   繁体   中英

How to solve the error with deploying a model in aws sagemaker?

I have to deploy a custom keras model in AWS Sagemaker. I have a created a notebook instance and I have the following files:

AmazonSagemaker-Codeset16
   -ann
      -nginx.conf
      -predictor.py
      -serve
      -train.py
      -wsgi.py
   -Dockerfile

I now open the AWS terminal and build the docker image and push the image in the ECR repository. Then I open a new jupyter python notebook and try to fit the model and deploy the same. The training is done correctly but while deploying I get the following error:

"Error hosting endpoint sagemaker-example-2019-10-25-06-11-22-366: Failed. >Reason: The primary container for production variant AllTraffic did not pass >the ping health check. Please check CloudWatch logs for this endpoint..."

When I check the logs, I find the following:

2019/11/11 11:53:32 [crit] 19#19: *3 connect() to unix:/tmp/gunicorn.sock >failed (2: No such file or directory) while connecting to upstream, client: >10.32.0.4, server: , request: "GET /ping HTTP/1.1", upstream: >" http://unix:/tmp/gunicorn.sock:/ping ", host: "model.aws.local:8080"

and

Traceback (most recent call last): File "/usr/local/bin/serve", line 8, in sys.exit(main()) File "/usr/local/lib/python2.7/dist->packages/sagemaker_containers/cli/serve.py", line 19, in main server.start(env.ServingEnv().framework_module) File "/usr/local/lib/python2.7/dist->packages/sagemaker_containers/_server.py", line 107, in start module_app, File "/usr/lib/python2.7/subprocess.py", line 711, in init errread, errwrite) File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child raise child_exception

I tried to deploy the same model in AWS Sagemaker with these files in my local computer and the model was deployed successfully but inside AWS, I am facing this problem.

Here is my serve file code:

from __future__ import print_function
import multiprocessing
import os
import signal
import subprocess
import sys

cpu_count = multiprocessing.cpu_count()

model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60)
model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count))


def sigterm_handler(nginx_pid, gunicorn_pid):
    try:
        os.kill(nginx_pid, signal.SIGQUIT)
    except OSError:
        pass
    try:
        os.kill(gunicorn_pid, signal.SIGTERM)
    except OSError:
        pass

    sys.exit(0)


def start_server():
    print('Starting the inference server with {} workers.'.format(model_server_workers))


    # link the log streams to stdout/err so they will be logged to the container logs
    subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
    subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])

    nginx = subprocess.Popen(['nginx', '-c', '/opt/ml/code/nginx.conf'])
    gunicorn = subprocess.Popen(['gunicorn',
                                 '--timeout', str(model_server_timeout),
                                 '-b', 'unix:/tmp/gunicorn.sock',
                                 '-w', str(model_server_workers),
                                 'wsgi:app'])

    signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))

    # If either subprocess exits, so do we.
    pids = set([nginx.pid, gunicorn.pid])
    while True:
        pid, _ = os.wait()
        if pid in pids:
            break

    sigterm_handler(nginx.pid, gunicorn.pid)
    print('Inference server exiting')


# The main routine just invokes the start function.
if __name__ == '__main__':
    start_server()

I deploy the model using the following:

predictor = classifier.deploy(1, 'ml.t2.medium', serializer=csv_serializer)

Kindly let me know the mistake I am doing while deploying.

Using Sagemaker script mode can be much simpler than dealing with container and nginx low-level stuff like you're trying to do, have you considered that?
You only need to provide the keras script:

With Script Mode, you can use training scripts similar to those you would use outside SageMaker with SageMaker's prebuilt containers for various deep learning frameworks such TensorFlow, PyTorch, and Apache MXNet.

https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/tf-sentiment-script-mode/sentiment-analysis.ipynb

You should ensure that your container can respond to GET /ping requests: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-algo-ping-requests

From the traceback, it looks like the server is failing to start when the container is started within SageMaker. I would look further in the stack trace and see why the server is failing start.

You can also try to run your container locally to debug any issues. SageMaker starts your container using the command 'docker run serve', and so you could run the same command and debug your container. https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-run-image

you don't have gunicorn installed, that's why the error /tmp/gunicorn.sock >failed (2: No such file or directory), you need to write on Dockerfile pip install gunicorn and apt-get install nginx.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM