AWS Sagemaker inference endpoint not utilizing all vCPUs

Question

I have deployed a custom model on sagemaker inference endpoint (single instance) and while I was load testing, I have observed that CPU utilization metric is maxing out at 100% but according to this post it should max out at #vCPU*100 %. I have confirmed that the inference endpoint is not using all cores in clowdwatch logs.

So if one prediction call requires one second to be processed to give response, the deployed model is only able to handle one API call per second which could have been increased to 8 calls per second if all vCPUs would have been used.

Are there any settings in AWS Sagemaker deployment to use all vCPUs to increase concurrency?

Or could we use multiprocessing python package inside inference.py file while deploying such that each call comes to the default core and from there all calculations/prediction is done in any other core whichever is empty at that instance?

Answer 1

UPDATE

Set three environment variables
1. ENABLE_MULTI_MODEL as "true" (make sure it is string and not bool) and set SAGEMAKER_HANDLER as custom model handler python module path if custom service else dont define it. Also make sure model name model.mar , before compressing it as tar ball and storing in s3
2. TS_DEFAULT_WORKERS_PER_MODEL as number of vcpus
3. First environment variable makes sure torch serve env_vars are enabled and second one uses first setting and loads requested number of workers
4. Setting can be done by passing env dictionary argument to PyTorch function . Below is explanation as to why it works
From the looks of it, sagemaker deployment for pytorch model as given in Sagemaker SDK guide , uses this dockerfile . In this docker, entrypoint is torchserve-entrypoint.py as in Dockerfile line#124 .
This torchserve-entrypoint.py calls serving.main() from serving.py . Which ends up calling torchserve.start_torchserve(handler_service=HANDLER_SERVICE) from torchserve.py .
At line 34 in torchserve.py it defines "/etc/default-ts.properties" as DEFAULT_TS_CONFIG_FILE. This file is located here . In this file enable_envvars_config=true is set. It will use this file setting IFF Environment variable "ENABLE_MULTI_MODEL" is set to "false" as refered here . If it is set to "true" then it will use /etc/mme-ts.properties

As for the question Are there any settings in AWS Sagemaker deployment to use all vCPUs to increase concurrency? There are various settings you can use For models you can set default_workers_per_model in config.properties TS_DEFAULT_WORKERS_PER_MODEL=$(nproc --all) in environment variables. Environment variables take top priority.

Other than that for each model, you can set the number of workers by using management API, but sadly it is not possible to curl to management API in sagemaker. SO TS_DEFAULT_WORKERS_PER_MODEL is the best bet. Setting this should make sure all cores are used.

But if you are using docker file then in entrypoint you can setup scripts which wait for model loading and curl to it to set number of workers

# load the model
curl -X POST localhost:8081/models?url=model_1.mar&batch_size=8&max_batch_delay=50
# after loading the model it is possible to set min_worker, etc
curl -v -X PUT http://localhost:8081/models/model_1?min_worker=1

About the other issue that logs confirm that not all cores are used, I face the same issue and believe that is a problem in the logging system. Please look at this issue https://github.com/pytorch/serve/issues/782 . The community itself agrees that if threads are not set, then by default then it prints 0, even if by default it uses 2*num_cores.

For an exhaustive set of all configs possible

# Reference: https://github.com/pytorch/serve/blob/master/docs/configuration.md
# Variables that can be configured through config.properties and Environment Variables
# NOTE: Variables which can be configured through environment variables **SHOULD** have a
# "TS_" prefix
# debug
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
model_store=/opt/ml/model
load_models=model_1.mar
# blacklist_env_vars
# default_workers_per_model
# default_response_timeout
# unregister_model_timeout
# number_of_netty_threads
# netty_client_threads
# job_queue_size
# number_of_gpu
# async_logging
# cors_allowed_origin
# cors_allowed_methods
# cors_allowed_headers
# decode_input_request
# keystore
# keystore_pass
# keystore_type
# certificate_file
# private_key_file
# max_request_size
# max_response_size
# default_service_handler
# service_envelope
# model_server_home
# snapshot_store
# prefer_direct_buffer
# allowed_urls
# install_py_dep_per_model
# metrics_format
# enable_metrics_api
# initial_worker_port

# Configuration which are not documented or enabled through environment variables

# When below variable is set true, then the variables set in environment have higher precedence.
# For example, the value of an environment variable overrides both command line arguments and a property in the configuration file. The value of a command line argument overrides a value in the configuration file.
# When set to false, environment variables are not used at all
# use_native_io=
# io_ratio=
# metric_time_interval=
enable_envvars_config=true
# model_snapshot=
# version=

AWS Sagemaker inference endpoint not utilizing all vCPUs

Question

1 answers

solution1
4 ACCPTED 2021-06-23 02:40:05

AWS Sagemaker inference endpoint not utilizing all vCPUs

Question

1 answers

solution1 4 ACCPTED 2021-06-23 02:40:05

solution1
4 ACCPTED 2021-06-23 02:40:05