简体   繁体   English

AWS Sagemaker 推理端点未使用所有 vCPU

[英]AWS Sagemaker inference endpoint not utilizing all vCPUs

I have deployed a custom model on sagemaker inference endpoint (single instance) and while I was load testing, I have observed that CPU utilization metric is maxing out at 100% but according to this post it should max out at #vCPU*100 %.我已经在 sagemaker 推理端点(单个实例)上部署了一个自定义模型,在我进行负载测试时,我观察到 CPU 利用率指标最高为 100%,但根据这篇文章,它应该最高为 #vCPU*100%。 I have confirmed that the inference endpoint is not using all cores in clowdwatch logs.我已经确认推理端点没有使用 clowdwatch 日志中的所有核心。

So if one prediction call requires one second to be processed to give response, the deployed model is only able to handle one API call per second which could have been increased to 8 calls per second if all vCPUs would have been used.因此,如果一个预测调用需要一秒钟来处理才能给出响应,那么部署的模型每秒只能处理一个 API 调用,如果所有 vCPU 都被使用,则可以增加到每秒 8 个调用。

Are there any settings in AWS Sagemaker deployment to use all vCPUs to increase concurrency? AWS Sagemaker 部署中是否有任何设置可以使用所有 vCPU 来增加并发性?

Or could we use multiprocessing python package inside inference.py file while deploying such that each call comes to the default core and from there all calculations/prediction is done in any other core whichever is empty at that instance?或者我们是否可以在部署时在inference.py文件中使用多处理 python 包,这样每个调用都会到达默认核心,然后所有计算/预测都在任何其他核心中完成,无论哪个是空的?

UPDATE更新


As for the question Are there any settings in AWS Sagemaker deployment to use all vCPUs to increase concurrency?至于问题Are there any settings in AWS Sagemaker deployment to use all vCPUs to increase concurrency? There are various settings you can use For models you can set default_workers_per_model in config.properties TS_DEFAULT_WORKERS_PER_MODEL=$(nproc --all) in environment variables.您可以使用多种设置对于模型,您可以在环境变量中的 config.properties TS_DEFAULT_WORKERS_PER_MODEL=$(nproc --all)中设置default_workers_per_model Environment variables take top priority.环境变量优先。

Other than that for each model, you can set the number of workers by using management API, but sadly it is not possible to curl to management API in sagemaker.除了对于每个模型,您可以使用管理 API 设置工作人员的数量,但遗憾的是无法在 sagemaker 中 curl 到管理 API。 SO TS_DEFAULT_WORKERS_PER_MODEL is the best bet.所以 TS_DEFAULT_WORKERS_PER_MODEL 是最好的选择。 Setting this should make sure all cores are used.设置此项应确保使用所有内核。

But if you are using docker file then in entrypoint you can setup scripts which wait for model loading and curl to it to set number of workers但是,如果您使用的是 docker 文件,那么在入口点您可以设置脚本,等待模型加载并卷曲到它以设置工人数量

# load the model
curl -X POST localhost:8081/models?url=model_1.mar&batch_size=8&max_batch_delay=50
# after loading the model it is possible to set min_worker, etc
curl -v -X PUT http://localhost:8081/models/model_1?min_worker=1

About the other issue that logs confirm that not all cores are used, I face the same issue and believe that is a problem in the logging system.关于日志确认并非所有内核都被使用的另一个问题,我面临同样的问题,并认为这是日志系统中的问题。 Please look at this issue https://github.com/pytorch/serve/issues/782 .请看这个问题https://github.com/pytorch/serve/issues/782 The community itself agrees that if threads are not set, then by default then it prints 0, even if by default it uses 2*num_cores.社区本身同意,如果未设置线程,则默认情况下会打印 0,即使默认情况下它使用 2*num_cores。

For an exhaustive set of all configs possible对于可能的所有配置的详尽集合

# Reference: https://github.com/pytorch/serve/blob/master/docs/configuration.md
# Variables that can be configured through config.properties and Environment Variables
# NOTE: Variables which can be configured through environment variables **SHOULD** have a
# "TS_" prefix
# debug
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
model_store=/opt/ml/model
load_models=model_1.mar
# blacklist_env_vars
# default_workers_per_model
# default_response_timeout
# unregister_model_timeout
# number_of_netty_threads
# netty_client_threads
# job_queue_size
# number_of_gpu
# async_logging
# cors_allowed_origin
# cors_allowed_methods
# cors_allowed_headers
# decode_input_request
# keystore
# keystore_pass
# keystore_type
# certificate_file
# private_key_file
# max_request_size
# max_response_size
# default_service_handler
# service_envelope
# model_server_home
# snapshot_store
# prefer_direct_buffer
# allowed_urls
# install_py_dep_per_model
# metrics_format
# enable_metrics_api
# initial_worker_port

# Configuration which are not documented or enabled through environment variables

# When below variable is set true, then the variables set in environment have higher precedence.
# For example, the value of an environment variable overrides both command line arguments and a property in the configuration file. The value of a command line argument overrides a value in the configuration file.
# When set to false, environment variables are not used at all
# use_native_io=
# io_ratio=
# metric_time_interval=
enable_envvars_config=true
# model_snapshot=
# version=

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM