Sagemaker 使用推理代码和要求部署模型

Question

I trained a TensorFlow model and now I would like to deploy it.我训练了一个 TensorFlow 模型，现在我想部署它。 The data needs to be processed thus I have to specify one inference.py script and one requirements.txt file.数据需要处理，因此我必须指定一个 inference.py 脚本和一个 requirements.txt 文件。 When I deploy the model it gives the following error:当我部署模型时，它会出现以下错误：
Failed Reason: The primary container for production variant All Traffic did not pass the ping health check. Please check CloudWatch logs for this endpoint. I am not using any VPC and when i try to download a python package from the notebook instance it works without any error.我没有使用任何 VPC，当我尝试从笔记本实例下载 python 包时，它可以正常工作，没有任何错误。 There is a problem with the connection and it can't install the dependecies apparently.连接有问题，显然无法安装依赖项。 What can I do?我能做什么？

INFO:__main__:PYTHON SERVICE: True
INFO:__main__:starting services
INFO:__main__:using default model name: model
INFO:__main__:tensorflow serving model config: 
model_config_list: {
  config: {
    name: 'model'
    base_path: '/opt/ml/model'
    model_platform: 'tensorflow'
    model_version_policy: {
      specific: {
        versions: 1
      }
    }
  }
}

INFO:__main__:tensorflow version info:
2021-07-15 14:48:01.085492: W external/org_tensorflow/tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-07-15 14:48:01.087774: W external/org_tensorflow/tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
TensorFlow ModelServer: 2.4.0-rc4+dev.sha.no_git
TensorFlow Library: 2.4.1
INFO:__main__:tensorflow serving command: tensorflow_model_server --port=15000 --rest_api_port=15001 --model_config_file=/sagemaker/model-config.cfg --max_num_load_retries=0    
INFO:__main__:started tensorflow serving (pid: 17)
INFO:tfs_utils:Trying to connect with model server: http://localhost:15001/v1/models/model
WARNING:urllib3.connectionpool:Retrying (Retry(total=8, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0400b14d90>: Failed to establish a new connection: [Errno 111] Connection refused')': /v1/models/model
2021-07-15 14:48:01.589503: W external/org_tensorflow/tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.
2021-07-15 14:48:01.589629: W external/org_tensorflow/tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.
2021-07-15 14:48:01.596877: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2021-07-15 14:48:01.596910: I tensorflow_serving/model_servers/server_core.cc:587]  (Re-)adding model: model
2021-07-15 14:48:01.698159: I tensorflow_serving/util/retrier.cc:46] Retrying of Reserving resources for servable: {name: model version: 1} exhausted max_num_retries: 0
2021-07-15 14:48:01.698222: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: model version: 1}
2021-07-15 14:48:01.698242: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: model version: 1}
2021-07-15 14:48:01.698259: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: model version: 1}
2021-07-15 14:48:01.698325: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading SavedModel from: /opt/ml/model/000000001
2021-07-15 14:48:01.716135: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:55] Reading meta graph with tags { serve }
2021-07-15 14:48:01.716197: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:93] Reading SavedModel debug info (if present) from: /opt/ml/model/000000001
2021-07-15 14:48:01.720153: I external/org_tensorflow/tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.
WARNING:urllib3.connectionpool:Retrying (Retry(total=7, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0400b29450>: Failed to establish a new connection: [Errno 111] Connection refused')': /v1/models/model
2021-07-15 14:48:01.825477: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring SavedModel bundle.
2021-07-15 14:48:01.833263: I external/org_tensorflow/tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300060000 Hz
2021-07-15 14:48:01.971809: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running initialization op on SavedModel bundle at path: /opt/ml/model/000000001
2021-07-15 14:48:01.989851: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: success: OK. Took 291516 microseconds.
2021-07-15 14:48:01.992763: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /opt/ml/model/000000001/assets.extra/tf_serving_warmup_requests
2021-07-15 14:48:01.994208: I tensorflow_serving/util/retrier.cc:46] Retrying of Loading servable: {name: model version: 1} exhausted max_num_retries: 0
2021-07-15 14:48:01.994232: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: model version: 1}
2021-07-15 14:48:02.003695: I tensorflow_serving/model_servers/server.cc:371] Running gRPC ModelServer at 0.0.0.0:15000 ...
[warn] getaddrinfo: address family for nodename not supported
2021-07-15 14:48:02.006269: I tensorflow_serving/model_servers/server.cc:391] Exporting HTTP/REST API at:localhost:15001 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
WARNING:urllib3.connectionpool:Retrying (Retry(total=6, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0400b29a10>: Failed to establish a new connection: [Errno 111] Connection refused')': /v1/models/model
INFO:tfs_utils:<Response [200]>
INFO:tfs_utils:model: http://localhost:15001/v1/models/model is available now
INFO:__main__:nginx config: 
load_module modules/ngx_http_js_module.so;
worker_processes auto;
daemon off;
pid /tmp/nginx.pid;
error_log  /dev/stderr error;
worker_rlimit_nofile 4096;
events {
  worker_connections 2048;
}
http {
  include /etc/nginx/mime.types;
  default_type application/json;
  access_log /dev/stdout combined;
  js_include tensorflow-serving.js;

  upstream tfs_upstream {
    server localhost:15001;
  }

  upstream gunicorn_upstream {
    server unix:/tmp/gunicorn.sock fail_timeout=1;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 0;
    client_body_buffer_size 100m;
    subrequest_output_buffer_size 100m;

    set $tfs_version 2.4;
    set $default_tfs_model model;

    location /tfs {
        rewrite ^/tfs/(.*) /$1  break;
        proxy_redirect off;
        proxy_pass_request_headers off;
        proxy_set_header Content-Type 'application/json';
        proxy_set_header Accept 'application/json';
        proxy_pass http://tfs_upstream;
    }

    location /ping {
        proxy_pass http://gunicorn_upstream/ping;
    }

    location /invocations {
        proxy_pass http://gunicorn_upstream/invocations;
    }

    location /models {
        proxy_pass http://gunicorn_upstream/models;
    }

    location / {
        return 404 '{"error": "Not Found"}';
    }

    keepalive_timeout 3;
  }
}

INFO:__main__:gunicorn command: gunicorn -b unix:/tmp/gunicorn.sock -k gevent --chdir /sagemaker --workers 1 --threads 1 --pythonpath /opt/ml/model/code,/opt/ml/model/code/lib -e TFS_GRPC_PORT_RANGE=15000-15002 -e TFS_REST_PORT_RANGE=15001-15003 -e SAGEMAKER_MULTI_MODEL=False -e SAGEMAKER_SAFE_PORT_RANGE=15000-15999 -e SAGEMAKER_TFS_WAIT_TIME_SECONDS=300 python_service:app
INFO:__main__:gunicorn version info:
gunicorn (version 20.0.4)
INFO:__main__:started gunicorn (pid: 72)
[2021-07-15 14:48:02 +0000] [72] [INFO] Starting gunicorn 20.0.4
[2021-07-15 14:48:02 +0000] [72] [INFO] Listening at: unix:/tmp/gunicorn.sock (72)
INFO:__main__:gunicorn server is ready!
[2021-07-15 14:48:02 +0000] [72] [INFO] Using worker: gevent
[2021-07-15 14:48:02 +0000] [76] [INFO] Booting worker with pid: 76
INFO:__main__:nginx version info:
nginx version: nginx/1.20.0
built by gcc 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) 
built with OpenSSL 1.1.1  11 Sep 2018
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -fdebug-prefix-map=/data/builder/debuild/nginx-1.20.0/debian/debuild-base/nginx-1.20.0=. -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'
INFO:__main__:started nginx (pid: 77)
INFO:python_service:Creating grpc channel for port: 15000
[2021-07-15 14:48:03 +0000] [76] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/ggevent.py", line 162, in init_process
    super().init_process()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 119, in init_process
    self.load_wsgi()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 144, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 49, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 39, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/util.py", line 358, in import_app
    mod = importlib.import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/sagemaker/python_service.py", line 414, in <module>
    resources = ServiceResources()
  File "/sagemaker/python_service.py", line 400, in __init__
    self._python_service_resource = PythonServiceResource()
  File "/sagemaker/python_service.py", line 83, in __init__
    self._handler, self._input_handler, self._output_handler = self._import_handlers()
  File "/sagemaker/python_service.py", line 278, in _import_handlers
    spec.loader.exec_module(inference)
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/ml/model/code/inference.py", line 1, in <module>
    import librosa
ModuleNotFoundError: No module named 'librosa'
[2021-07-15 14:48:03 +0000] [76] [INFO] Worker exiting (pid: 76)
[2021-07-15 14:48:03 +0000] [72] [INFO] Shutting down: Master
[2021-07-15 14:48:03 +0000] [72] [INFO] Reason: Worker failed to boot.

The code I used is this:我使用的代码是这样的：

from sagemaker.tensorflow.serving import Model
model = Model(entry_point='inference.py',
                        dependencies=['requirements.txt'],
                        model_data=bucket,
                       role=role,
                       sagemaker_session=sagemaker_session,
                       framework_version='2.4.1')
predictor = model.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

Answer 1

Since you have already trained your model outside of SageMaker you want to focus on just deployment/inference.由于您已经在 SageMaker 之外训练了模型，因此您只想专注于部署/推理。 Thus, you want to store your model artifacts in S3 in a tar.gz format.因此，您希望以 tar.gz 格式将模型工件存储在 S3 中。 The correct api call that you want to be working with is the following code block.您要使用的正确 api 调用是以下代码块。

from sagemaker.tensorflow import TensorFlowModel

model = TensorFlowModel(model_data='s3://mybucket/model.tar.gz', role='MySageMakerRole')

predictor = model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge')

Check out more information at the following link https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#deploy-tensorflow-serving-models在以下链接中查看更多信息https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#deploy-tensorflow-serving-models

For preprocessing there are two approaches you can take.对于预处理，您可以采用两种方法。

Have your data preprocessing done in the SageMaker notebook and then invoke the endpoint that you have created with your formatted data.在 SageMaker 笔记本中完成数据预处理，然后调用您使用格式化数据创建的端点。
Take the TensorFlow container and adjust it for your own use case.使用 TensorFlow 容器并针对您自己的用例进行调整。 This follows an example known as "Bring Your Own Model".这遵循称为“自带模型”的示例。 Everything in SageMaker is Dockerized and to bring your own inference code/logic you want to follow the container structure SageMaker has. SageMaker 中的所有内容都是 Docker 化的，为了带来您自己的推理代码/逻辑，您希望遵循 SageMaker 的容器结构。 Here is an end to end example of bringing a TensorFlow model to SageMaker for training and/or deployment.这是将 TensorFlow 模型引入 SageMaker 进行训练和/或部署的端到端示例。

https://github.com/aws/amazon-sagemaker-examples/tree/master/advanced_functionality/tensorflow_bring_your_own https://github.com/aws/amazon-sagemaker-examples/tree/master/advanced_functionality/tensorflow_bring_your_own

I work for AWS & my opinions are my own我为 AWS 工作，我的意见是我自己的

Sagemaker 使用推理代码和要求部署模型

问题描述

1 个解决方案

解决方案1
0 2021-07-22 17:35:25

Sagemaker 使用推理代码和要求部署模型

问题描述

1 个解决方案

解决方案1 0 2021-07-22 17:35:25

解决方案1
0 2021-07-22 17:35:25