简体   繁体   中英

Deploy pre-trained tensorflow model on the aws sagemaker - ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation

This is the first time I am using amazon web services to deploy my machine learning pre-trained model. I want to deploy my pre-trained TensorFlow model to Aws-Sagemaker. I am somehow able to deploy the endpoints successfully But whenever I call the predictor.predict(some_data) method to make prediction to invoking the endpoints it's throwing an error.

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-2020-04-07-04-25-27-055 in account 453101909370 for more information.

After going through the cloud watch logs I found this error.

#011details = "NodeDef mentions attr 'explicit_paddings' not in Op<name=Conv2D; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_HALF, DT_BFLOAT16, DT_FLOAT, DT_DOUBLE]; attr=strides:list(int); attr=use_cudnn_on_gpu:bool,default=true; attr=padding:string,allowed=["SAME", "VALID"]; attr=data_format:string,default="NHWC",allowed=["NHWC", "NCHW"]; attr=dilations:list(int),default=[1, 1, 1, 1]>; NodeDef: {{node conv1_conv/convolution}} = Conv2D[T=DT_FLOAT, _output_shapes=[[?,112,112,64]], data_format="NHWC", dilations=[1, 1, 1, 1], explicit_paddings=[], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](conv1_pad/Pad, conv1_conv/kernel/read). (Check whether your GraphDef-interpreting binary is up to date with your GraphDef-generating binary.).

I don't know where I am wrong and I have wasted 2 days already to solve this error and couldn't find out the information regarding this. The detailed logs I have shared here .

Tensorflow version of my notebook instance is 1.15

After a lot of searching and try & error, I was able to solve this problem. In many cases, the problem arises because of the TensorFlow and Python versions.

Cause of the problem: To deploy the endpoints, I was using the TensorflowModel on TF 1.12 and python 3 and which exactly caused the problem.

 sagemaker_model = TensorFlowModel(model_data = model_data, role = role, framework_version = '1.12', entry_point = 'train.py')

Apparently, TensorFlowModel only allows python 2 on TF version 1.11, 1.12. 2.1.0.

How I fixed it: There are two TensorFlow solutions that handle serving in the Python SDK. They have different class representations and documentation as shown here.

  1. TensorFlowModel - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/model.py#L47
  1. Model - https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/serving.py#L96

Python 3 isn't supported using the TensorFlowModel object, as the container uses the TensorFlow serving API library in conjunction with the GRPC client to handle making inferences, however, the TensorFlow serving API isn't supported in Python 3 officially, so there are only Python 2 versions of the containers when using the TensorFlowModel object. If you need Python 3 then you will need to use the Model object defined in #2 above.

Finally, I used the Model with the TensorFlow version 1.15.1.

 sagemaker_model = Model(model_data = model_data, role = role, framework_version='1.15.2', entry_point = 'train.py')

Also, here are the successful results. 在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM