简体繁体 English

用于 SageMaker 推理的 NVIDIA Triton 与 TorchServe

[英]NVIDIA Triton vs TorchServe for SageMaker Inference

原文 2022-09-23 14:28:58 4 2 amazon-sagemaker/ inference/ tritonserver/ torchserve/ amazon-sagemaker-model-servers

NVIDIA Triton vs TorchServe for SageMaker inference? NVIDIA Triton vs TorchServe用于 SageMaker 推理？ When to recommend each?什么时候推荐？

Both are modern, production grade inference servers.两者都是现代的生产级推理服务器。 TorchServe is the DLC default inference server for PyTorch models. TorchServe 是 PyTorch 模型的 DLC 默认推理服务器。 Triton is also supported for PyTorch inference on SageMaker. Triton 也支持在 SageMaker 上进行 PyTorch 推理。

Anyone has a good comparison matrix for both?任何人都有一个很好的比较矩阵？

2 个解决方案

Important notes to add here where both serving stacks differ:要在此处添加两个服务堆栈不同的重要说明：

TorchServe does not provide the Instance Groups feature that Triton does (that is, stacking many copies of the same model or even different models onto the same GPU). TorchServe 不提供 Triton 提供的实例组功能（即将相同 model 甚至不同模型的多个副本堆叠到同一个 GPU 上）。 This is a major advantage for both realtime and batch use-cases, as the performance increase is almost proportional to the model replication count (ie 2 copies of the model get you almost twice the throughput and half the latency; check out a BERT benchmark of this here).这是实时和批处理用例的主要优势，因为性能提升几乎与 model 复制计数成正比（即 model 的 2 个副本为您带来几乎两倍的吞吐量和一半的延迟；查看 BERT 基准这里）。 Hard to match a feature that is almost like having 2+ GPU's for the price of one.很难以一个价格匹配一个几乎就像拥有 2 个以上 GPU 的功能。 if you are deploying PyTorch DL models, odds are you often want to accelerate them with GPU's.如果您正在部署 PyTorch DL 模型，您可能经常希望使用 GPU 来加速它们。 TensorRT (TRT) is a compiler developed by NVIDIA that automatically quantizes and optimizes your model graph, which represents another huge speed up, depending on GPU architecture and model. TensorRT (TRT) 是由 NVIDIA 开发的编译器，可自动量化和优化您的 model 图，这代表了另一个巨大的加速，具体取决于 GPU 架构和 Z20F35E630DAF44DBFA4C3F68F5399D8。 It is understandably so probably the best way of automatically optimizing your model to run efficiently on GPU's and make good use of TensorCores.可以理解，这可能是自动优化 model 以在 GPU 上高效运行并充分利用 TensorCores 的最佳方式。 Triton has native integration to run TensorRT engines as they're called (even automatically converting your model to a TRT engine via config file), while TorchServe does not (even though you can use TRT engines with it). Triton 具有运行被称为 TensorRT 引擎的本机集成（甚至通过配置文件自动将您的 model 转换为 TRT 引擎），而 TorchServe 没有（即使您可以使用 TRT 引擎）。 There is more parity between both when it comes to other important serving features: both have dynamic batching support, you can define inference DAG's with both (not sure if the latter works with TorchServe on SageMaker without a big hassle), and both support custom code/handlers instead of just being able to serve a model's forward function.当涉及到其他重要的服务功能时，两者之间有更多的平等：两者都支持动态批处理，您可以使用两者定义推理 DAG（不确定后者是否可以轻松地在 SageMaker 上与 TorchServe 一起使用），并且都支持自定义代码/handlers 而不是仅仅能够为模型的前向 function 服务。

Finally, MME on GPU (coming shortly) will be based on Triton, which is a valid argument for customers to get familiar with it so that they can quickly leverage this new feature for cost-optimization.最后，GPU（即将推出）上的 MME 将基于 Triton，这是客户熟悉它的有效论据，以便他们可以快速利用这一新功能进行成本优化。

Bottom line I think that Triton is just as easy (if not easier) ot use, a lot more optimized/integrated for taking full advantage of the underlying hardware (and will be updated to keep being that way as newer GPU architectures are released, enabling an easy move to them), and in general blows TorchServe out of the water performance-wise when its optimization features are used in combination.底线我认为 Triton 使用起来同样简单（如果不是更容易的话），更优化/集成以充分利用底层硬件（并且随着更新的 GPU 架构的发布，将进行更新以保持这种方式，使一个简单的移动），并且通常在组合使用其优化功能时将 TorchServe 从性能方面吹走。

Because I don't have enough reputation for replying in comments, I write in answer.因为我没有足够的声誉在评论中回复，所以我写在答案中。 MME is Multi-model endpoints. MME 是多模型端点。 MME enables sharing GPU instances behind an endpoint across multiple models and dynamically loads and unloads models based on the incoming traffic. MME 支持跨多个模型在端点后面共享 GPU 个实例，并根据传入流量动态加载和卸载模型。 You can read it further in this link您可以在此链接中进一步阅读