模拟 LLM 服务如何削减 $500K AI 基准测试成本，提升开发者生产力

By Sandeep Bansal and Seetharaman Gudetee.

作者：Sandeep Bansal 和 Seetharaman Gudetee。

In our Engineering Energizers Q&A series, we spotlight the engineering minds driving innovation across Salesforce. Today’s edition features Sandeep Bansal, a senior software engineer from the AI Cloud Platform Engineering team, whose internal LLM mock service validates performance, reliability, and cost efficiency at scale — supporting production-readiness benchmarks beyond 24,000 requests per minute while significantly reducing LLM model costs during benchmarking.

在我们的 Engineering Energizers Q&A 系列中，我们聚焦于推动 Salesforce 创新的工程头脑。本期聚焦 Sandeep Bansal，他是来自 AI Cloud Platform Engineering 团队的高级软件工程师，其内部 LLM 模拟服务在规模上验证性能、可靠性和成本效率 — 支持超过 24,000 请求/分钟 的生产就绪基准测试，同时在基准测试期间显著降低 LLM 模型成本。

Explore how the team saved more than $500K annually in token-based costs by replacing live LLM dependencies with a controllable simulation layer, enforced deterministic latency to accelerate performance validation, and enabled rapid scale and failover benchmarking by simulating high-volume traffic and controlled outages without relying on external provider infrastructure.

探索团队如何通过用可控模拟层替换实时 LLM 依赖，从而每年节省超过 $500K 的基于令牌的成本，强制执行确定性延迟以加速性能验证，并通过模拟高流量和受控中断来启用快速规模和故障转移基准测试，而无需依赖外部提供商基础设施。

Our mission is to help engineering teams move faster while reducing the cost and uncertainty of developing and ensuring performance and scale for AI-powered systems. As Salesforce’s AI platform expanded, teams increasingly needed to benchmark performance, scalability, and reliability under production-like conditions. However, running those benchmark directly against live LLM providers introduced cost pressure, variability, and external dependencies that slowed iteration.

我们的使命是帮助工程团队更快行动，同时降低开发和确保 AI 驱动系统性能和规模的成本和不确定性。随着 Salesforce 的 AI 平台扩展，团队越来越需要在类似生产的环境下基准测试性能、可扩展性和可靠性。然而，直接针对实时 LLM 提供商运行这些基准测试引入了成本压力、变异性和外部依赖，从而减缓了迭代。

Users can add their desired mock response or status code for a given unique key. It also allows to configure static or dynamic latency to simulate variable OpenAI ...