如何序列化 Tensorflow 服务请求以减少推理/预测延迟？

Question

I have exported a TF SavedModel from an Estimator using TensorServingInputReceiver as follows:我使用 TensorServingInputReceiver 从 Estimator 导出了一个 TF SavedModel，如下所示：

def serving_input_fn():
    input_ph = tf.placeholder(tf.float32, shape=[None, 3, 224, 224], name = 'image_batches')
    input_tensors = input_ph
    return tf.estimator.export.TensorServingInputReceiver(input_tensors, input_ph)

and export the SavedModel as follows:并按如下方式导出 SavedModel：

warm_start = tf.estimator.WarmStartSettings(CKPT_DIR)
    classifier = tf.estimator.Estimator(model_fn = model_fn, warm_start_from = warm_start)
    classifier.export_savedmodel(export_dir_base = SAVED_MODEL_DIR, serving_input_receiver_fn = serving_input_fn)

However, when I use this SavedModel to perform predictions in Tensorflow Serving:但是，当我使用这个 SavedModel 在 Tensorflow Serving 中执行预测时：

json_dict = {'signature_name': 'serving_default', 'instances': data}

where data is a numpy array, I obtain speed of only about 1/5 to 1/6 of local direct inference using the SavedModel.其中数据是一个 numpy 数组，我使用 SavedModel 获得的速度仅为本地直接推理的大约 1/5 到 1/6。

Currently I'm thinking the problem may be the serialization part of the request in JSON as suggested here .目前我认为问题可能是 JSON 中请求的序列化部分，如这里所建议的。 So does anyone know how to perform serialization of the request before sending, or have any suggestions as to why the inference speed using TF Serving is much slower than direct inference?那么有没有人知道如何在发送之前对请求进行序列化，或者对为什么使用 TF Serving 的推理速度比直接推理慢很多有什么建议？

Answer 1

I have been facing the same situation, jason.dumps causes 90% of delay, if the image size is reduced to say (1,100,100,3) the inference speed increases by 3 fold, so basically json.dumps file is too large and thus consuming more time to write and read from memory. I tried ultrajson ie ujson but there wasn't any observable improvement.我一直面临同样的情况，jason.dumps 导致 90% 的延迟，如果图像大小减小到 (1,100,100,3) 推理速度增加 3 倍，所以基本上 json.dumps 文件太大，因此消耗更多时间从 memory 写入和读取。我尝试了 ultrajson 即 ujson，但没有任何明显的改进。

trs = img1.tolist()
data = json.dumps({"signature_name": "serving_default", "instances": trs})

As my model architecture requires INT I don't see any other option to serialize the data.由于我的 model 架构需要 INT，所以我看不到任何其他序列化数据的选项。

Modifying the input architecture can shift our input requirement of UINT to STRING which can be done by修改输入架构可以将我们对 UINT 的输入要求转移到 STRING，这可以通过以下方式完成

dl_request = requests.get(IMAGE, stream=True)
jpeg_bytes = base64.b64encode(dl_request.content).decode('utf-8')
predict_request = '{"instances" : [{"b64": "%s"}]}' % jpeg_bytes
response = requests.post(SERVER_URL, data=predict_request)

This fetches good results, but how to convert model input type from INT to STRING?这取得了很好的结果，但是如何将 model 输入类型从 INT 转换为 STRING？

Any suggestions??有什么建议么？？

Answer 2

I have been facing the same situation, jason.dumps causes 90% of delay, if the image size is reduced to say (1,100,100,3) the inference speed increases by 3 fold, so basically json.dumps file is too large and thus consuming more time to write and read from memory. I tried ultrajson ie ujson but there wasn't any observable improvement.我一直面临同样的情况，jason.dumps 导致 90% 的延迟，如果图像大小减小到 (1,100,100,3) 推理速度增加 3 倍，所以基本上 json.dumps 文件太大，因此消耗更多时间从 memory 写入和读取。我尝试了 ultrajson 即 ujson，但没有任何明显的改进。

trs = img1.tolist()
data = json.dumps({"signature_name": "serving_default", "instances": trs})

As my model architecture requires INT I don't see any other option to serialize the data.由于我的 model 架构需要 INT，所以我看不到任何其他序列化数据的选项。

Any suggestions??有什么建议么？？

如何序列化 Tensorflow 服务请求以减少推理/预测延迟？

问题描述

2 个解决方案

解决方案1
1 2020-10-16 05:25:30

解决方案2
0 2020-10-16 05:21:35

如何序列化 Tensorflow 服务请求以减少推理/预测延迟？

问题描述

2 个解决方案

解决方案1 1 2020-10-16 05:25:30

解决方案2 0 2020-10-16 05:21:35

解决方案1
1 2020-10-16 05:25:30

解决方案2
0 2020-10-16 05:21:35