AWS Sagemaker 推理端点未使用所有 vCPU

Question

I have deployed a custom model on sagemaker inference endpoint (single instance) and while I was load testing, I have observed that CPU utilization metric is maxing out at 100% but according to this post it should max out at #vCPU*100 %.我已经在 sagemaker 推理端点（单个实例）上部署了一个自定义模型，在我进行负载测试时，我观察到 CPU 利用率指标最高为 100%，但根据这篇文章，它应该最高为 #vCPU*100%。 I have confirmed that the inference endpoint is not using all cores in clowdwatch logs.我已经确认推理端点没有使用 clowdwatch 日志中的所有核心。

So if one prediction call requires one second to be processed to give response, the deployed model is only able to handle one API call per second which could have been increased to 8 calls per second if all vCPUs would have been used.因此，如果一个预测调用需要一秒钟来处理才能给出响应，那么部署的模型每秒只能处理一个 API 调用，如果所有 vCPU 都被使用，则可以增加到每秒 8 个调用。

Are there any settings in AWS Sagemaker deployment to use all vCPUs to increase concurrency? AWS Sagemaker 部署中是否有任何设置可以使用所有 vCPU 来增加并发性？

Or could we use multiprocessing python package inside inference.py file while deploying such that each call comes to the default core and from there all calculations/prediction is done in any other core whichever is empty at that instance?或者我们是否可以在部署时在inference.py文件中使用多处理 python 包，这样每个调用都会到达默认核心，然后所有计算/预测都在任何其他核心中完成，无论哪个是空的？

Answer 1

UPDATE更新

Set three environment variables设置三个环境变量
1. ENABLE_MULTI_MODEL as "true" (make sure it is string and not bool) and set SAGEMAKER_HANDLER as custom model handler python module path if custom service else dont define it. ENABLE_MULTI_MODEL 为“true”（确保它是字符串而不是 bool）并将SAGEMAKER_HANDLER设置为自定义模型处理程序 python 模块路径，如果自定义服务没有定义它。 Also make sure model name model.mar , before compressing it as tar ball and storing in s3在将其压缩为焦油球并存储在 s3 之前，还要确保模型名称为model.mar
2. TS_DEFAULT_WORKERS_PER_MODEL as number of vcpus TS_DEFAULT_WORKERS_PER_MODEL 作为 vcpus 的数量
3. First environment variable makes sure torch serve env_vars are enabled and second one uses first setting and loads requested number of workers第一个环境变量确保启用火炬服务 env_vars，第二个环境变量使用第一个设置并加载请求的工人数量
4. Setting can be done by passing env dictionary argument to PyTorch function .可以通过将 env 字典参数传递给PyTorch 函数来完成设置。 Below is explanation as to why it works下面是关于它为什么起作用的解释
From the looks of it, sagemaker deployment for pytorch model as given in Sagemaker SDK guide , uses this dockerfile .从它的外观来看， Sagemaker SDK 指南中给出的 pytorch 模型的 sagemaker 部署使用了这个 dockerfile 。 In this docker, entrypoint is torchserve-entrypoint.py as in Dockerfile line#124 .在这个docker 中，入口点是torchserve-entrypoint.py，就像在Dockerfile line#124 中一样。
This torchserve-entrypoint.py calls serving.main() from serving.py .这个torchserve-entrypoint.py从services.py调用services.main() 。 Which ends up calling torchserve.start_torchserve(handler_service=HANDLER_SERVICE) from torchserve.py .最终从torchserve.py调用torchserve.start_torchserve(handler_service=HANDLER_SERVICE) 。
At line 34 in torchserve.py it defines "/etc/default-ts.properties" as DEFAULT_TS_CONFIG_FILE. 在 torchserve.py 的第 34 行，它将“/etc/default-ts.properties”定义为 DEFAULT_TS_CONFIG_FILE。 This file is located here .该文件位于此处。 In this file enable_envvars_config=true is set.在这个文件中enable_envvars_config=true被设置。 It will use this file setting IFF Environment variable "ENABLE_MULTI_MODEL" is set to "false" as refered here .它将使用此文件设置 IFF 环境变量“ENABLE_MULTI_MODEL”设置为“false”，如此处所述。 If it is set to "true" then it will use /etc/mme-ts.properties如果它设置为“true”，那么它将使用 /etc/mme-ts.properties

As for the question Are there any settings in AWS Sagemaker deployment to use all vCPUs to increase concurrency?至于问题Are there any settings in AWS Sagemaker deployment to use all vCPUs to increase concurrency? There are various settings you can use For models you can set default_workers_per_model in config.properties TS_DEFAULT_WORKERS_PER_MODEL=$(nproc --all) in environment variables.您可以使用多种设置对于模型，您可以在环境变量中的 config.properties TS_DEFAULT_WORKERS_PER_MODEL=$(nproc --all)中设置default_workers_per_model 。 Environment variables take top priority.环境变量优先。

Other than that for each model, you can set the number of workers by using management API, but sadly it is not possible to curl to management API in sagemaker.除了对于每个模型，您可以使用管理 API 设置工作人员的数量，但遗憾的是无法在 sagemaker 中 curl 到管理 API。 SO TS_DEFAULT_WORKERS_PER_MODEL is the best bet.所以 TS_DEFAULT_WORKERS_PER_MODEL 是最好的选择。 Setting this should make sure all cores are used.设置此项应确保使用所有内核。

But if you are using docker file then in entrypoint you can setup scripts which wait for model loading and curl to it to set number of workers但是，如果您使用的是 docker 文件，那么在入口点您可以设置脚本，等待模型加载并卷曲到它以设置工人数量

# load the model
curl -X POST localhost:8081/models?url=model_1.mar&batch_size=8&max_batch_delay=50
# after loading the model it is possible to set min_worker, etc
curl -v -X PUT http://localhost:8081/models/model_1?min_worker=1

About the other issue that logs confirm that not all cores are used, I face the same issue and believe that is a problem in the logging system.关于日志确认并非所有内核都被使用的另一个问题，我面临同样的问题，并认为这是日志系统中的问题。 Please look at this issue https://github.com/pytorch/serve/issues/782 .请看这个问题https://github.com/pytorch/serve/issues/782 。 The community itself agrees that if threads are not set, then by default then it prints 0, even if by default it uses 2*num_cores.社区本身同意，如果未设置线程，则默认情况下会打印 0，即使默认情况下它使用 2*num_cores。

For an exhaustive set of all configs possible对于可能的所有配置的详尽集合

# Reference: https://github.com/pytorch/serve/blob/master/docs/configuration.md
# Variables that can be configured through config.properties and Environment Variables
# NOTE: Variables which can be configured through environment variables **SHOULD** have a
# "TS_" prefix
# debug
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
model_store=/opt/ml/model
load_models=model_1.mar
# blacklist_env_vars
# default_workers_per_model
# default_response_timeout
# unregister_model_timeout
# number_of_netty_threads
# netty_client_threads
# job_queue_size
# number_of_gpu
# async_logging
# cors_allowed_origin
# cors_allowed_methods
# cors_allowed_headers
# decode_input_request
# keystore
# keystore_pass
# keystore_type
# certificate_file
# private_key_file
# max_request_size
# max_response_size
# default_service_handler
# service_envelope
# model_server_home
# snapshot_store
# prefer_direct_buffer
# allowed_urls
# install_py_dep_per_model
# metrics_format
# enable_metrics_api
# initial_worker_port

# Configuration which are not documented or enabled through environment variables

# When below variable is set true, then the variables set in environment have higher precedence.
# For example, the value of an environment variable overrides both command line arguments and a property in the configuration file. The value of a command line argument overrides a value in the configuration file.
# When set to false, environment variables are not used at all
# use_native_io=
# io_ratio=
# metric_time_interval=
enable_envvars_config=true
# model_snapshot=
# version=

AWS Sagemaker 推理端点未使用所有 vCPU

问题描述

1 个解决方案

解决方案1
4 已采纳 2021-06-23 02:40:05

AWS Sagemaker 推理端点未使用所有 vCPU

问题描述

1 个解决方案

解决方案1 4 已采纳 2021-06-23 02:40:05

解决方案1
4 已采纳 2021-06-23 02:40:05