简体   繁体   中英

Performance issue with Tensorflow model server on CPU in comparison with Tensorflow model inference

I observe performance issue for CPU with Tensorflow model server. It doubles the time for inference in comparison to raw Tensorflow model inference. Both built with MKL for CPU only.

Code to reproduce: https://github.com/BogdanRuzh/tf_model_service_benchmark

Tensorflow MKL build: bazel build --config=mkl -c opt --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-O3 //tensorflow/tools/pip_package:build_pip_package

Tensorflow server MKL build: bazel build --config=mkl --config=opt --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-O3 tensorflow_serving/model_servers:tensorflow_model_server

The target model is simple CNN for segmentation.

Raw Tensorflow model process an image in 0.17s. Tensorflow model server process the same image in 0.32s.

How can I improve this performance? It's very critical for my application.

If you need better performance, I'd recommend trying OpenVINO . It optimizes the inference time by eg graph pruning and fusing some operations. OpenVINO is optimized for Intel hardware but it should work with any CPU. However, I assume you have Intel CPU as you compiled TensorFlow with MKL support.

Here are some performance benchmarks for various models and CPUs.

It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here . Some snippets below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.

pip install openvino-dev[tensorflow2]

Use Model Optimizer to convert SavedModel model

The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:

mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"

Run the inference

The converted model can be loaded by the runtime and compiled for a specific device eg CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")

# Get output layer
output_layer_ir = compiled_model_ir.output(0)

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

There is even OpenVINO Model Server which is very similar to Tensorflow Serving.

Disclaimer: I work on OpenVINO.

I suppose that explonation will help you. It's said that with bad configuration tensorflow with intel optimizations may have worse performance then clear build https://github.com/tensorflow/serving/issues/1272#issuecomment-477878180

You may try to configure parameters of batching (with config file and --enable_batching parameter) https://github.com/tensorflow/serving/tree/master/tensorflow_serving/batching

And set (inter/intra)_op_parallelism_threads.

Additionaly, MKL has it's own flags for improving performance https://www.tensorflow.org/guide/performance/overview#tuning_mkl_for_the_best_performance

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM