Jetson NX 使用 TensorRT 优化 tensorflow model

Question

I am trying to speed up the segmentation model(unet-mobilenet-512x512).我正在尝试加快分割模型（unet-mobilenet-512x512）。 I converted my tensorflow model to tensorRT with FP16 precision mode.我将我的 tensorflow model 转换为具有 FP16 精度模式的 tensorRT。 And the speed is lower than I expected.而且速度比我预想的要低。 Before the optimization i had 7FPS on inference with.pb frozen graph.在优化之前，我有 7FPS 的推理与.pb 冻结图。 After tensorRT oprimization I have 14FPS.在 tensorRT 优化后，我有 14FPS。

Here is benchmark results of Jetson NX from their site这是他们网站上 Jetson NX 的基准测试结果
You can see, that unet 256x256 segmentation speed is 146 FPS.可以看到，unet 256x256 的分割速度是 146 FPS。 I thought, the speed of my unet512x512 should be 4 times slower in the worst case.我想，在最坏的情况下，我的 unet512x512 的速度应该慢 4 倍。

Here is my code for optimizing tensorflow saved model using TensorRt:这是我使用 TensorRt 优化 tensorflow 保存的 model 的代码：

import numpy as np
from tensorflow.python.compiler.tensorrt import trt_convert as trt
import tensorflow as tf

params = trt.DEFAULT_TRT_CONVERSION_PARAMS
params = params._replace(
    max_workspace_size_bytes=(1<<32))
params = params._replace(precision_mode="FP16")
converter = tf.experimental.tensorrt.Converter(input_saved_model_dir='./model1', conversion_params=params)
converter.convert()

def my_input_fn():
  inp1 = np.random.normal(size=(1, 512, 512, 3)).astype(np.float32)
  yield [inp1]

converter.build(input_fn=my_input_fn)  # Generate corresponding TRT engines
output_saved_model_dir = "trt_graph2"
converter.save(output_saved_model_dir)  # Generated engines will be saved.


print("------------------------freezing the graph---------------------")


from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2

saved_model_loaded = tf.saved_model.load(
    output_saved_model_dir, tags=[tf.compat.v1.saved_model.SERVING])
graph_func = saved_model_loaded.signatures[
    tf.compat.v1.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
frozen_func = convert_variables_to_constants_v2(
    graph_func)
frozen_func.graph.as_graph_def()

tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
                logdir="./",
                name="unet_frozen_graphTensorRt.pb",
                as_text=False)

I downloaded the repository, that was used for Jetson NX benchmarking ( https://github.com/NVIDIA-AI-IOT/jetson_benchmarks ), and the speed of unet256x256 really is ~146FPS.我下载了用于 Jetson NX 基准测试的存储库（ https://github.com/NVIDIA-AI-IOT/jetson_benchmarks ），并且 unet256x256 的速度确实是 ~146FPS。 But there is no pipeline to optimize the model.但是没有优化model的管道。 How can I get the similar results?我怎样才能得到类似的结果？ I am looking for the solutions to get speed of my model(unet-mobilenet-512x512) close to 30FPS我正在寻找使我的模型（unet-mobilenet-512x512）的速度接近 30FPS 的解决方案
Maybe I should run inference in other way(without tensorflow) or change some converting parameters?也许我应该以其他方式（没有张量流）运行推理或更改一些转换参数？
Any suggestions, thanks任何建议，谢谢

Answer 1

As far as I can see, the repository you linked to uses command line tools that use TensorRT (TRT) under the hood.据我所知，您链接到的存储库在后台使用了使用 TensorRT (TRT) 的命令行工具。 Note that TensorRT is not the same as "TensorRT in TensorFlow" aka TensorFlow-TensorRT (TF-TRT) which is what you are using in your code.请注意， TensorRT与您在代码中使用的“TensorFlow 中的 TensorRT”又名TensorFlow-TensorRT (TF-TRT)不同。 Both TF-TRT and TRT models run faster than regular TF models on a Jetson device but TF-TRT models still tend to be slower than TRT ones ( source 1 , source 2 ). TF-TRT 和 TRT 模型在 Jetson 设备上的运行速度都比常规 TF 模型快，但 TF-TRT 模型仍然比 TRT 模型慢（源 1 ，源 2 ）。

The downside of TRT is that the conversion to TRT needs to be done on the target device and that it can be quite difficult to implement it successfully as there are various TensorFlow operations that TRT does not support (in which case you need to write a custom plugin or pray to God that someone on the internet has already done so. …or use TensorRT only for part of your model and do pre-/postprocessing in TensorFlow). TRT 的缺点是需要在目标设备上完成到 TRT 的转换，并且成功实现它可能非常困难，因为TRT 不支持各种 TensorFlow 操作（在这种情况下，您需要编写自定义插件或向上帝祈祷互联网上的某个人已经这样做了。……或仅将 TensorRT 用于 model 的一部分，并在 TensorFlow 中进行预处理/后处理）。

There are basically two ways to convert models from TensorFlow models to TensorRT "engines" aka "plan files", both of which use intermediate formats:基本上有两种方法可以将模型从 TensorFlow 模型转换为 TensorRT“引擎”又名“计划文件”，这两种方法都使用中间格式：

TF -> UFF -> TRT TF -> UFF -> TRT
TF -> ONNX -> TRT TF -> ONNX -> TRT

In both cases, the graphsurgeon / onnx-graphsurgeon libraries can be used to modify the TF/ONNX graph to achieve compatibility of graph operations.在这两种情况下，都可以使用graphsurgeon / onnx-graphsurgeon库来修改 TF/ONNX 图，以实现图操作的兼容性。 Unsupported operations can be added by means of TensorRT plugins, as mentioned above.如上所述，可以通过 TensorRT 插件添加不受支持的操作。 (This is really the main challenge here: Different graph file formats and different target GPUs support different graph operations.) （这确实是这里的主要挑战：不同的图形文件格式和不同的目标 GPU 支持不同的图形操作。）

There's also a third way where you do TF -> Caffe -> TRT and apparently a fourth one where you use Nvidia's Transfer Learning Toolkit (TLT) (based upon TF/Keras) and a tool called tlt-converter but I'm not familiar with it.还有第三种方式，你可以使用 TF -> Caffe -> TRT，显然还有第四种方式，你使用Nvidia 的迁移学习工具包（TLT）（基于 TF/Keras）和一个名为tlt-converter的工具，但我不熟悉用它。 The latter link does mention converting a UNet model, though.不过，后一个链接确实提到了转换 UNet model。

Note that the paths involving UFF and Caffe are now deprecated and support will be removed in TensorRT 9.0, so if you want something future-proof, you should probably go for ONNX.请注意，涉及 UFF 和 Caffe 的路径现在已弃用，并且支持将在 TensorRT 9.0 中删除，因此如果您想要一些面向未来的东西，您可能应该为 ONNX 提供 go。 That being said, most sample code online I've come across online still uses UFF and TensorRT 9.0 is still some time away.话虽如此，我在网上遇到的大多数示例代码仍然使用 UFF，而 TensorRT 9.0 还需要一段时间。

Anyway, I haven't tried converting a UNet to TensorRT yet, but the following repositories provide sample code which might give you an idea of how it works in principle:无论如何，我还没有尝试将 UNet 转换为 TensorRT，但以下存储库提供了示例代码，可以让您大致了解它的工作原理：

TF -> UFF -> TRT: jkjung-avt/tensorrt_demos ,NVIDIA-AI-IOT/tf_to_trt_image_classification (the latter using a bit of C++) TF -> UFF -> TRT: jkjung-avt/tensorrt_demos ,NVIDIA-AI-IOT/tf_to_trt_image_classification （后者使用一点 C++）
TF -> ONNX -> TRT: tensorflow-onnx , onnx-tensorrt TF -> ONNX -> TRT: tensorflow-onnx , onnx-tensorrt
Keras -> ONNX -> TRT: Nvidia blog post (This one mentions converting a Unet to TRT!) Keras -> ONNX -> TRT： Nvidia 博客文章（这篇文章提到将 Unet 转换为 TRT！）

Note that even if you don't manage to pull off the conversion from ONNX to TRT for your model, using the ONNX runtime for inference could potentially still give you a performance gain , especially when you're using the CUDA or the TensorRT execution provider which will be enabled automatically provided you're on a Jetson device and running the correct ONNXRuntime build .请注意，即使您无法为 model 完成从 ONNX 到 TRT 的转换，使用 ONNX 运行时进行推理仍可能会为您带来性能提升，尤其是当您使用CUDA或TensorRT 执行提供程序时如果您在 Jetson 设备上并运行正确的 ONNXRuntime build ，它将自动启用。 (I'm not sure how it compares to TF-TRT or TRT, though, but it might still be worth a shot.) （不过，我不确定它与 TF-TRT 或 TRT 相比如何，但它可能仍然值得一试。）

Finally, for completeness's sake let me also mention that at least my team has been dabbling with the idea of switching from TF to PyTorch, partly because the Nvidia support has been getting a lot better lately and Nvidia employees seem to gravitate towards PyTorch, too.最后，为了完整起见，我还要提一下，至少我的团队一直在尝试从 TF 切换到 PyTorch，部分原因是最近 Nvidia 的支持变得更好，而且 Nvidia 员工似乎也倾向于 PyTorch。 In particular, there are now two separate ways to convert models to TRT:特别是，现在有两种不同的方法可以将模型转换为 TRT：

PyTorch -> ONNX -> TRT (used by dusty_nv ) PyTorch -> ONNX -> TRT（由dusty_nv使用）
PyTorch -> TRT (direct conversion via torch2trt ). PyTorch -> TRT（通过torch2trt直接转换）。 It seems that quite a few Nvidia repositories use this.似乎很多 Nvidia 存储库都使用它。

Answer 2

Hi can you share the errors you are getting?嗨，你能分享你得到的错误吗？ Its should work with the following steps:它应该与以下步骤一起工作：

Convert the TensorFlow/Keras model to a.pb file.将 TensorFlow/Keras model 转换为 a.pb 文件。
Convert the.pb file to ONNX format.将 .pb 文件转换为 ONNX 格式。
Create a TensorRT engine.创建一个 TensorRT 引擎。
Run inference from the TensorRT engine.从 TensorRT 引擎运行推理。

I am not sure about Unet (I will check) but you may have some operations not supported by onnx (please share your errors).我不确定 Unet（我会检查），但您可能有一些 onnx 不支持的操作（请分享您的错误）。

Here is an example with Resnet-50.这是 Resnet-50 的示例。

Conversion to.pb:转换为.pb：

import tensorflow as tf
import keras
from tensorflow.keras.models import Model
import keras.backend as K
K.set_learning_phase(0)

def keras_to_pb(model, output_filename, output_node_names):

   """
   This is the function to convert the Keras model to pb.

   Args:
      model: The Keras model.
      output_filename: The output .pb file name.
      output_node_names: The output nodes of the network. If None, then
      the function gets the last layer name as the output node.
   """

   # Get the names of the input and output nodes.
   in_name = model.layers[0].get_output_at(0).name.split(':')[0]

   if output_node_names is None:
       output_node_names = [model.layers[-1].get_output_at(0).name.split(':')[0]]

   sess = keras.backend.get_session()

   # The TensorFlow freeze_graph expects a comma-separated string of output node names.
   output_node_names_tf = ','.join(output_node_names)

   frozen_graph_def = tf.graph_util.convert_variables_to_constants(
       sess,
       sess.graph_def,
       output_node_names)

   sess.close()
   wkdir = ''
   tf.train.write_graph(frozen_graph_def, wkdir, output_filename, as_text=False)

   return in_name, output_node_names

# load the ResNet-50 model pretrained on imagenet
model = keras.applications.resnet.ResNet50(include_top=True, weights='imagenet', input_tensor=None, input_shape=None, pooling=None, classes=1000)

# Convert the Keras ResNet-50 model to a .pb file
in_tensor_name, out_tensor_names = keras_to_pb(model, "models/resnet50.pb", None)

Then you need to convert the.pb model to the ONNX format.然后需要将.pb model 转换为ONNX格式。 To do this, you will need to install tf2onnx.为此，您需要安装 tf2onnx。 Example:例子：

python -m tf2onnx.convert  --input /Path/to/resnet50.pb --inputs input_1:0 --outputs probs/Softmax:0 --output resnet50.onnx

Last step create the TensorRT engine from the ONNX file:最后一步从 ONNX 文件创建 TensorRT 引擎：

import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
def build_engine(onnx_path, shape = [1,224,224,3]):

   """
   This is the function to create the TensorRT engine
   Args:
      onnx_path : Path to onnx_file. 
      shape : Shape of the input of the ONNX file. 
  """
   with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
       builder.max_workspace_size = (256 << 20)
       with open(onnx_path, 'rb') as model:
           parser.parse(model.read())
       network.get_input(0).shape = shape
       engine = builder.build_cuda_engine(network)
       return engine

def save_engine(engine, file_name):
   buf = engine.serialize()
   with open(file_name, 'wb') as f:
       f.write(buf)
def load_engine(trt_runtime, plan_path):
   with open(engine_path, 'rb') as f:
       engine_data = f.read()
   engine = trt_runtime.deserialize_cuda_engine(engine_data)
   return engine

I suggest you check this Pytorch TRT Unet implementation我建议你检查这个Pytorch TRT Unet implementation

Jetson NX 使用 TensorRT 优化 tensorflow model

问题描述

2 个解决方案

解决方案1
2 2021-02-13 09:58:44

解决方案2
1 2021-03-16 07:58:54

Jetson NX 使用 TensorRT 优化 tensorflow model

问题描述

2 个解决方案

解决方案1 2 2021-02-13 09:58:44

解决方案2 1 2021-03-16 07:58:54

解决方案1
2 2021-02-13 09:58:44

解决方案2
1 2021-03-16 07:58:54