使用 NVIDIA Triton Inference Server 现代化 Grab 的模型服务平台

Catwalk is Grab’s machine learning (ML) model serving platform, designed to enable data scientists and engineers in deploying production-ready inference APIs. Currently, Catwalk powers hundreds of ML models and online deployments. To accommodate this growth, the platform has adapted to the rapidly evolving machine learning technology landscape. This involved progressively integrating support for multiple frameworks such as ONNX, PyTorch, TensorFlow, and vLLM. While this approach initially worked for a limited number of frameworks, it soon became unsustainable as maintaining various inference engines, ensuring backward compatibility, and managing deprecated legacy components (such as the ONNX server) introduced significant technical debt. Over time, this resulted in degraded platform performance: with increased latency, reduced throughput, and escalating costs. These issues began to impact users, as larger models could no longer be served efficiently or cost-effectively by legacy components. Recognising the need for change, the team revisited the platform’s design to address these challenges.

Catwalk 是 Grab 的机器学习（ML）模型服务平台，旨在让数据科学家和工程师能够部署生产就绪的推理 API。目前，Catwalk 支撑着数百个 ML 模型和在线部署。为了适应这种增长，平台不断适应快速演进的机器学习技术格局。这包括逐步集成对 ONNX、PyTorch、TensorFlow 和 vLLM 等多种框架的支持。虽然这种方法在框架数量有限时可行，但随着需要维护各种推理引擎、确保向后兼容性以及管理已弃用的旧组件（如 ONNX server），技术债务迅速累积，变得不可持续。久而久之，这导致平台性能下降：延迟增加、吞吐量降低、成本上升。这些问题开始影响用户，因为更大的模型已无法被旧组件高效或经济地服务。意识到变革的必要性，团队重新审视了平台设计，以应对这些挑战。

Evaluation and implementation

评估与实施

After evaluating other industry-leading model serving platforms and studying best practices, we decided to conduct an in-depth analysis of NVIDIA Triton. Triton offers significant advantages as an inference engine, including:

在评估了其他行业领先的模型服务平台并研究了最佳实践后，我们决定对 NVIDIA Triton 进行深入分析。Triton 作为推理引擎具有显著优势，包括：

Multi-framework support: Compatibility with major ML frameworks, including ONNX, PyTorch, and TensorFlow, ensuring versatility and broad applicability.
多框架支持：兼容 ONNX、PyTorch 和 TensorFlow 等主要 ML 框架，确保通用性和广泛适用性。
Unified inference interface: Provides a sin...