如何调试 sagemaker 批量转换中的调用超时错误？

Question

I am experimenting with sagemaker, using a container from list here, https://github.com/aws/deep-learning-containers/blob/master/available_images.md to run my model and overwriting model_fn and predict_fn functions in inference.py file for loading model and prediction as shown in link here ( https://github.com/PacktPublishing/Learn-Amazon-SageMaker-second-edition/blob/main/Chapter%2007/huggingface/src/torchserve-predictor.py ).我正在试验 sagemaker，使用此处列表中的容器https://github.com/aws/deep-learning-containers/blob/master/available_images.md来运行我的 model 并覆盖 inference.py 中的 model_fn 和 predict_fn 函数用于加载 model 和预测的文件，如链接所示（ https://github.com/PacktPublishing/Learn-Amazon-SageMaker-second-edition/blob/main/Chapter%2007/huggingface/src/torchserve-predictor.py ） . I keep getting invocations timeout error => "Model server did not respond to /invocations request within 3600 seconds".我不断收到调用超时错误 =>“模型服务器未在 3600 秒内响应 /invocations 请求”。 am i missing anything in my inference.py code, as to adding something to response to the ping/healthcheck?我是否在我的 inference.py 代码中遗漏了任何关于添加一些东西来响应 ping/healthcheck 的东西？

file : inference.py

import json
import torch
from transformers import AutoConfig, AutoTokenizer, DistilBertForSequenceClassification

JSON_CONTENT_TYPE = 'application/json'

def model_fn(model_dir):
    config_path = '{}/config.json'.format(model_dir)
    model_path =  '{}/pytorch_model.bin'.format(model_dir)
    config = AutoConfig.from_pretrained(config_path)
   ...

def predict_fn(input_data, model):
    //return predictions
...

Answer 1

The issue is not with the health checks.问题不在于健康检查。 It is with the container not responding to the /invocations request and this is can be due to model taking longer time than expected to get predictions from the input data.容器未响应 /invocations 请求，这可能是由于 model 从输入数据中获取预测所需的时间比预期的要长。

如何调试 sagemaker 批量转换中的调用超时错误？

问题描述

1 个解决方案

解决方案1
1 2022-04-30 01:04:19

如何调试 sagemaker 批量转换中的调用超时错误？

问题描述

1 个解决方案

解决方案1 1 2022-04-30 01:04:19

解决方案1
1 2022-04-30 01:04:19