在 aws sagemaker 上部署预训练的 tensorflow model - ModelError：调用 InvokeEndpoint 操作时发生错误 (ModelError)

Question

This is the first time I am using amazon web services to deploy my machine learning pre-trained model.这是我第一次使用亚马逊 web 服务来部署我的机器学习预训练 model。 I want to deploy my pre-trained TensorFlow model to Aws-Sagemaker.我想将我预训练的 TensorFlow model 部署到 Aws-Sagemaker。 I am somehow able to deploy the endpoints successfully But whenever I call the predictor.predict(some_data) method to make prediction to invoking the endpoints it's throwing an error.我能够以某种方式成功部署端点但是每当我调用predictor.predict(some_data)方法进行预测以调用端点时，它都会引发错误。

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-2020-04-07-04-25-27-055 in account 453101909370 for more information.

After going through the cloud watch logs I found this error.通过云观察日志后，我发现了这个错误。

#011details = "NodeDef mentions attr 'explicit_paddings' not in Op<name=Conv2D; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_HALF, DT_BFLOAT16, DT_FLOAT, DT_DOUBLE]; attr=strides:list(int); attr=use_cudnn_on_gpu:bool,default=true; attr=padding:string,allowed=["SAME", "VALID"]; attr=data_format:string,default="NHWC",allowed=["NHWC", "NCHW"]; attr=dilations:list(int),default=[1, 1, 1, 1]>; NodeDef: {{node conv1_conv/convolution}} = Conv2D[T=DT_FLOAT, _output_shapes=[[?,112,112,64]], data_format="NHWC", dilations=[1, 1, 1, 1], explicit_paddings=[], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](conv1_pad/Pad, conv1_conv/kernel/read). (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).

I don't know where I am wrong and I have wasted 2 days already to solve this error and couldn't find out the information regarding this.我不知道我错在哪里，我已经浪费了 2 天时间来解决这个错误并且找不到关于这个的信息。 The detailed logs I have shared here .我在这里分享的详细日志。

Tensorflow version of my notebook instance is 1.15 Tensorflow 我的笔记本实例版本是1.15

Answer 1

After a lot of searching and try & error, I was able to solve this problem.经过大量的搜索和尝试和错误，我能够解决这个问题。 In many cases, the problem arises because of the TensorFlow and Python versions.在许多情况下，由于 TensorFlow 和 Python 版本而出现问题。

Cause of the problem: To deploy the endpoints, I was using the TensorflowModel on TF 1.12 and python 3 and which exactly caused the problem.问题的原因：为了部署端点，我在 TF 1.12 和 python 3 上使用了TensorflowModel ，这正是导致问题的原因。

 sagemaker_model = TensorFlowModel(model_data = model_data, role = role, framework_version = '1.12', entry_point = 'train.py')

Apparently, TensorFlowModel only allows python 2 on TF version 1.11, 1.12.显然， TensorFlowModel只允许在 TF 版本 1.11、1.12 上使用 python 2。 2.1.0. 2.1.0。

How I fixed it: There are two TensorFlow solutions that handle serving in the Python SDK.我如何修复它：有两个 TensorFlow 解决方案可以处理 Python SDK 中的服务。 They have different class representations and documentation as shown here.它们有不同的 class 表示和文档，如此处所示。

TensorFlowModel - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/model.py#L47 TensorFlowModel - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/model.py#L47

Doc: https://github.com/aws/sagemaker-python-sdk/tree/v1.12.0/src/sagemaker/tensorflow#deploying-directly-from-model-artifacts文档： https://github.com/aws/sagemaker-python-sdk/tree/v1.12.0/src/sagemaker/tensorflow#deploying-directly-from-model-artifacts
Key difference: Uses a proxy GRPC client to send requests关键区别：使用代理 GRPC 客户端发送请求
Container impl: https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/serve.py容器实现： https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/serve.py

Model - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/serving.py#L96 Model - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/serving.py#L96

Doc: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst文档： https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst
Key difference: Utilizes the TensorFlow serving rest API主要区别：利用为 rest API 服务的 TensorFlow
Container impl: https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/container/sagemaker/serve.py容器实现： https://github.com/aws/sagemaker-tensorflow-serving-container/blob/master/container/sagemaker/serve.py

Python 3 isn't supported using the TensorFlowModel object, as the container uses the TensorFlow serving API library in conjunction with the GRPC client to handle making inferences, however, the TensorFlow serving API isn't supported in Python 3 officially, so there are only Python 2 versions of the containers when using the TensorFlowModel object. Python 3 isn't supported using the TensorFlowModel object, as the container uses the TensorFlow serving API library in conjunction with the GRPC client to handle making inferences, however, the TensorFlow serving API isn't supported in Python 3 officially, so there are only Python 使用TensorFlowModel object 时的 2 个容器版本。 If you need Python 3 then you will need to use the Model object defined in #2 above.如果您需要 Python 3，那么您将需要使用上面 #2 中定义的Model object。

Finally, I used the Model with the TensorFlow version 1.15.1.最后，我使用了Model和 TensorFlow 版本 1.15.1。

 sagemaker_model = Model(model_data = model_data, role = role, framework_version='1.15.2', entry_point = 'train.py')

Also, here are the successful results.此外，这里是成功的结果。

在 aws sagemaker 上部署预训练的 tensorflow model - ModelError：调用 InvokeEndpoint 操作时发生错误 (ModelError)

问题描述

1 个解决方案

解决方案1
4 已采纳 2020-06-30 07:46:24

在 aws sagemaker 上部署预训练的 tensorflow model - ModelError：调用 InvokeEndpoint 操作时发生错误 (ModelError)

问题描述

1 个解决方案

解决方案1 4 已采纳 2020-06-30 07:46:24

解决方案1
4 已采纳 2020-06-30 07:46:24