大规模评估对话式 AI 的实用蓝图

LLM applications present a deceptively simple interface: a single text box. But behind that minimalism runs a chain of probabilistic stages, including intent classification, document retrieval, ranking, prompt construction, model inference, and safety filtering. A tweak to any link in this chain can ripple unpredictably through the pipeline, turning yesterday’s perfect answer into today’s hallucination. Building Dropbox Dash taught us that in the foundation-model era, AI evaluation—the set of structured tests that ensure accuracy and reliability—matters just as much as model training.

LLM 应用看似只有一个极简的文本框，但背后是一连串概率阶段：意图分类、文档检索、排序、提示构建、模型推理、安全过滤。链条中任何一环的微调都可能不可预测地波及整个流程，让昨天的完美答案变成今天的幻觉。构建 Dropbox Dash 让我们认识到，在基础模型时代，AI 评估——确保准确性和可靠性的结构化测试——与模型训练同等重要。

In the beginning, our evaluations were somewhat unstructured—more ad-hoc testing than a systematic approach. Over time, as we kept experimenting, we noticed that the real progress came from how we shaped the processes: refining how models retrieved info, tweaking prompts, and striking the right balance between consistency and variety in answers. So we decided to make our approach more rigorous. We designed and built a standardized evaluation process that treated every experiment like production code. Our rule was simple: Handle every change with the same care as shipping new code. Every update had to pass testing before it could be merged. In other words, evaluation wasn’t something we simply tacked on at the end. It was baked into every step of our process.

最初，我们的评估比较随意——更像是临时测试而非系统化方法。随着不断实验，我们意识到真正的进步来自流程的塑造：优化信息检索方式、微调提示词、在答案的一致性与多样性之间取得平衡。于是我们决定让方法更严谨。我们设计并构建了一套标准化评估流程，把每一次实验都当作生产代码对待。规则很简单：像发布新代码一样对待每一次改动。任何更新都必须通过测试才能合并。换句话说，评估不是最后才贴上的补丁，而是融入每一步。

We captured these lessons in a playbook that covers the full arc of datasets, metrics, tooling, and workflows. And because people don’t just work in text, evaluation must ultimately extend to images, video, and audio to reflect how work really happens. We’re sharing those findings here so that anyone working with LLMs t...