简体   繁体   中英

how to speedup tensorflow RNN inference time

We've trained a tf-seq2seq model for question answering. The main framework is from google/seq2seq . We use bidirectional RNN( GRU encoders/decoders 128units), adding soft attention mechanism.

We limit maximum length to 100 words. It mostly just generates 10~20 words.

For model inference, we try two cases:

  1. normal(greedy algorithm). Its inference time is about 40ms~100ms
  2. beam search. We try to use beam width 5, and its inference time is about 400ms~1000ms.

So, we want to try to use beam width 3, its time may decrease, but it may also influence the final effect.

So are there any suggestion to decrease inference time for our case? Thanks.

  1. you can do network compression .
  2. cut the sentence into pieces by byte-pair-encoding or unigram language model and etc and then try TreeLSTM .
  3. you can try faster softmax like adaptive softmax]( https://arxiv.org/pdf/1609.04309.pdf )
  4. try cudnnLSTM
  5. try dilated RNN
  6. switch to CNN like dilated CNN, or BERT for parallelization and more efficient GPU support

If you require improved performance, I'd propose that you use OpenVINO . It reduces inference time by graph pruning and fusing some operations. Although OpenVINO is optimized for Intel hardware, it should work with any CPU.

Here are some performance benchmarks for NLP model (BERT) and various CPUs.

It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here . Some snippets below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.

pip install openvino-dev[tensorflow2]

Use Model Optimizer to convert SavedModel model

The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:

mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"

Run the inference

The converted model can be loaded by the runtime and compiled for a specific device eg CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")

# Get output layer
output_layer_ir = compiled_model_ir.output(0)

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

Disclaimer: I work on OpenVINO.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM