Agent Quality

如果无法正常显示，请先停止浏览器的去广告插件。

1. Agent Quality Authors: Meltem Subasioglu, Turan Bulmus, and Wafae Bakkali

2. Agent Quality Acknowledgements Content contributors Hussain Chinoy Ale Fin Peter Grabowski Michelle Liu Anant Nawalgaria Kanchana Patlolla Steven Pecht Julia Wiesinger Curators and editors Anant Nawalgaria Kanchana Patlolla Designer Michael Lanning November 2025 2

3. Table of contents Introduction 6 How to Read This Whitepaper 7 Agent Quality in a Non-Deterministic World 8 Why Agent Quality Demands a New Approach 9 The Paradigm Shift: From Predictable Code to Unpredictable Agents 11 The Pillars of Agent Quality: A Framework for Evaluation 13 Summary & What's Next 15 The Art of Agent Evaluation: Judging the Process A Strategic Framework: The "Outside-In" Evaluation Hierarchy 16 17 The "Outside-In" View: End-to-End Evaluation (The Black Box) 18 The "Inside-Out" View: Trajectory Evaluation (The Glass Box) 19 The Evaluators: The Who and What of Agent Judgment 21 Automated Metrics 22 The LLM-as-a-Judge Paradigm 23 Agent-as-a-Judge 25

4. Table of contents Human-in-the-Loop (HITL) Evaluation 26 User Feedback and Reviewer UI 27 Beyond Performance: Responsible AI (RAI) & Safety Evaluation 28 Summary & What's Next 30 Observability: Seeing Inside the Agent's Mind From Monitoring to True Observability 31 31 The Kitchen Analogy: Line Cook vs. Gourmet Chef 31 The Three Pillars of Observability 32 Pillar 1: Logging – The Agent's Diary 33 Pillar 2: Tracing – Following the Agent's Footsteps 36 Why Tracing is Indispensable 36 Key Elements of an Agent Trace 37 Pillar 3: Metrics – The Agent's Health Report 38 System Metrics: The Vital Signs 38 Quality Metrics: Judging the Decision-Making 39 Putting It All Together: From Raw Data to Actionable Insights 41

5. Table of contents Summary & What's Next Conclusion: Building Trust in an Autonomous World 43 44 Introduction: From Autonomous Capability to Enterprise Trust 44 The Agent Quality Flywheel: A Synthesis of the Framework 45 Three Core Principles for Building Trustworthy Agents 46 The Future is Agentic - and Reliable 47 References 49

6. Agent Quality The future of AI is agentic. Its success is determined by quality. Introduction We are at the dawn of the agentic era. The transition from predictable, instruction-based tools to autonomous, goal-oriented AI agents presents one of the most profound shifts in software engineering in decades. While these agents unlock incredible capabilities, their inherent non-determinism makes them unpredictable and shatters our traditional models of quality assurance. This whitepaper serves as a practical guide to this new reality, founded on a simple but radical principle: Agent quality is an architectural pillar, not a final testing phase. November 2025 6

7. Agent Quality This guide is built on three core messages: • The Trajectory is the Truth: We must evolve beyond evaluating just the final output. The true measure of an agent's quality and safety lies in its entire decision-making process. • Observability is the Foundation: You cannot judge a process you cannot see. We detail the "three pillars" of observability - Logging , Tracing , and Metrics - as the essential technical foundation for capturing the agent's "thought process." • Evaluation is a Continuous Loop: We synthesize these concepts into the "Agent Quality Flywheel", an operational playbook for turning this data into actionable insights. This system uses a hybrid of scalable AI-driven evaluators and indispensable Human-in-the- Loop (HITL) judgment to drive relentless improvement. This whitepaper is for the architects, engineers, and product leaders building this future. It provides the framework to move from building capable agents to building reliable and trustworthy ones. How to Read This Whitepaper This guide is structured to build from the "why" to the "what" and finally to the "how." Use this section to navigate to the chapters most relevant to your role. • For All Readers: Start with Chapter 1: Agent Quality in a Non-Deterministic World. This chapter establishes the core problem. It explains why traditional QA fails for AI agents and introduces the Four Pillars of Agent Quality (Effectiveness, Efficiency, Robustness, and Safety) that define our goals. November 2025 7

8. Agent Quality • For Product Managers, Data Scientists, and QA Leaders: If you're responsible for what to measure and how to judge quality, focus on Chapter 2: The Art of Agent Evaluation. This chapter is your strategic guide. It details the "Outside-In" hierarchy for evaluation, explains the scalable "LLM-as-a-Judge" paradigm , and clarifies the critical role of Human-in-the-Loop (HITL) evaluation. • For Engineers, Architects, and SREs: If you build the systems, your technical blueprint is Chapter 3: Observability. This chapter moves from theory to implementation. It provides the "kitchen analogy" (Line Cook vs. Gourmet Chef) to explain monitoring vs. observability and details the Three Pillars of Observability: Logs, Traces, and Metrics - the tools you need to build an "evaluatable" agent. • For Team Leads and Strategists: To understand how these pieces create a self- improving system, read Chapter 4: Conclusion. This chapter unites the concepts into an operational playbook. It introduces the "Agent Quality Flywheel" as a model for continuous improvement and summarizes the three core principles for building trustworthy AI. Agent Quality in a Non-Deterministic World The world of artificial intelligence is transforming at full speed. We are moving from building predictable tools that execute instructions to designing autonomous agents that interpret intent, formulate plans, and execute complex, multi-step actions. For data scientists and engineers who build, compete, and deploy at the cutting edge, this transition presents a profound challenge. The very mechanisms that make AI agents powerful also make them unpredictable. November 2025 8

9. Agent Quality To understand this shift, compare traditional software to a delivery truck and an AI agent to a Formula 1 race car. The truck requires only basic checks (“Did the engine start? Did it follow the fixed route?“). The race car, like an AI agent, is a complex, autonomous system whose success depends on dynamic judgment. Its evaluation cannot be a simple checklist; it requires continuous telemetry to judge the quality of every decision—from fuel consumption to braking strategy. This evolution is fundamentally changing how we must approach software quality. Traditional quality assurance (QA) practices, while robust for deterministic systems, are insufficient for the nuanced and emergent behaviors of modern AI. An agent can pass 100 unit tests and still fail catastrophically in production because its failure isn't a bug in the code; it's a flaw in its judgment. Traditional software verification asks: “Did we build the product right?” It verifies logic against a fixed specification. Modern AI evaluation must ask a far more complex question: “Did we build the right product?” This is a process of validation, assessing quality, robustness, and trustworthiness in a dynamic and uncertain world. This chapter inspects this new paradigm. We will explore why agent quality demands a new approach, analyze the technical shift that makes our old methods obsolete, and establish the strategic "Outside-In" framework for evaluating systems that "think”. Why Agent Quality Demands a New Approach For an engineer, risk is something to be identified and mitigated. In traditional software, failure is explicit: a system crashes, throws a NullPointerException, or returns an explicitly incorrect calculation. These failures are obvious, deterministic, and traceable to a specific error in logic. November 2025 9

10. Agent Quality AI agents fail differently. Their failures are often not system crashes but subtle degradations of quality, emerging from the complex interplay of model weights, training data, and environmental interactions. These failures are insidious: the system continues to run, API calls return 200 OK, and the output looks plausible. But it is profoundly wrong, operationally dangerous, and silently eroding trust. Organizations that fail to grasp this shift face significant failures, operational inefficiencies, and reputational damage. While failure modes like algorithmic bias and concept drift existed in passive models, the autonomy and complexity of agents compound these risks, making them harder to trace and mitigate. Consider these real-world failure modes highlighted in Table 1: Failure Mode Description Examples Algorithmic Bias An agent operationalizes and potentially amplifies systemic biases present in its training data, leading to unfair or discriminatory outcomes. • A financial agent tasked with risk summarization over-penalizes loan applications based on zip codes found in biased training data. Factual Hallucination The agent produces plausible-sounding but factually incorrect or invented information with high confidence, often when it cannot find a valid source. • A research tool generating a highly specific but utterly false historical date or geographical location in a scholarly report, undermining academic integrity. Performance & Concept Drift The agent's performance degrades over time as the real-world data it interacts with ("concept") changes, making its original training obsolete. • A fraud detection agent failing to spot new attack patterns. Emergent Unintended Behaviors The agent develops novel or unanticipated strategies to achieve its goal, which can be inefficient, unhelpful, or exploitative. • Finding and exploiting loopholes in a system's rules. • Engaging in "proxy wars" with other bots (e.g., repeatedly overwriting edits). Table 1: Agent Failure Modes November 2025 10

11. Agent Quality These failures render traditional debugging and testing paradigms ineffective. You cannot use a breakpoint to debug a hallucination. You cannot write a unit test to prevent emergent bias. Root cause analysis requires deep data analysis, model retraining, and systemic evaluation - a new discipline entirely. The Paradigm Shift: From Predictable Code to Unpredictable Agents The core technical challenge stems from the evolution from model-centric AI to system- centric AI. Evaluating an AI agent is fundamentally different from evaluating an algorithm because the agent is a system. This evolution has occurred in compounding stages, each adding a new layer of evaluative complexity. Figure 1: From Traditional ML to Multi-Agent Systems 1. Traditional Machine Learning: Evaluating regression or classification models, while non- trivial, is a well-defined problem. We rely on statistical metrics like Precision, Recall, F1- Score, and RMSE against a held-out test set. The problem is complex, but the definition of "correct" is clear. November 2025 11

12. Agent Quality 2. The Passive LLM: With the rise of generative models, we lost our simple metrics. How do we measure the "accuracy" of a generated paragraph? The output is probabilistic. Even with identical inputs, the output can vary. Evaluation became more complex, relying on human raters and model-vs-model benchmarking. Still, these systems were largely passive, text-in, text-out tools. 3. LLM+RAG (Retrieval-Augmented Generation): The next leap introduced a multi- component pipeline, as pioneered by Lewis et al. (2020) 1 in their work "Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks." Now, failure could occur in the LLM or in the retrieval system. Did the agent give a bad answer because the LLM reasoned poorly, or because the vector database retrieved irrelevant snippets? Our evaluation surface expanded from just the model to include the performance of chunking strategies, embeddings, and retrievers. 4. The Active AI Agent: Today, we face a profound architectural shift. The LLM is no longer just a text generator; it is the reasoning "brain" within a complex system, integrated into a loop capable of autonomous action. This agentic system introduces three core technical capabilities that break our evaluation models: • Planning and Multi-Step Reasoning: Agents decompose complex goals ("plan my trip") into multiple sub-tasks. This creates a trajectory (Thought → Action → Observation → Thought...). The non-determinism of the LLM now compounds at every step. A small, stochastic word choice in Step 1 can send the agent down a completely different and unrecoverable reasoning path by Step 4. • Tool Use and Function Calling: Agents interact with the real world through APIs and external tools (code interpreters, search engines, booking APIs). This introduces dynamic environmental interaction. The agent's next action depends entirely on the state of an external, uncontrollable world. November 2025 12

13. Agent Quality • Memory: Agents maintain state. Short-term "scratchpad" memory tracks the current task, while long-term memory allows the agent to learn from past interactions. This means the agent's behavior evolves, and an input that worked yesterday might produce a different result today based on what the agent has "learned." 5. Multi-Agent Systems: The ultimate architectural complexity arises when multiple active agents are integrated into a shared environment. This is no longer the evaluation of a single trajectory but of a system-level emergent phenomenon, introducing new, fundamental challenges: • Emergent System Failures: The system's success depends on the unscripted interactions between agents, such as resource contention, communication bottlenecks, and systemic deadlocks, which cannot be attributed to a single agent's failure. • Cooperative vs. Competitive Evaluation: The objective function itself may become ambiguous. In cooperative MAS (e.g., supply chain optimization), success is a global metric, while in competitive MAS (e.g., game theory scenarios or auction systems), the evaluation often requires tracking individual agent performance and the stability of the overall market/environment. This combination of capabilities means the primary unit of evaluation is no longer the model, but the entire system trajectory. The agent's emergent behavior arises from the intricate interplay between its planning module, its tools, its memory, and the dynamic environment. The Pillars of Agent Quality: A Framework for Evaluation If we can no longer rely on simple accuracy metrics, and we must evaluate the entire system, where do we begin? The answer is a strategic shift known as the "Outside-In" approach. November 2025 13

14. Agent Quality This approach anchors AI evaluation in user-centric metrics and overarching business goals, moving beyond a sole reliance on internal, component-level technical scores. We must stop asking only "What is the model's F1-score?" and start asking, "Does this agent deliver measurable value and align with our user's intent?" This strategy requires a holistic framework that connects high-level business goals to technical performance. We define agent quality across four interconnected pillars: Figure 2: The four pillars of Agent Quality Effectiveness (Goal Achievement): This is the ultimate "black-box" question: Did the agent successfully and accurately achieve the user's actual intent? This pillar connects directly to user-centered metrics and business KPIs. For a retail agent, this isn't just "did it find a product?" but "did it drive a conversion?" For a data analysis agent, it's not "did it write code?" but "did the code produce the correct insight?" Effectiveness is the final measure of task success. Efficiency (Operational Cost): Did the agent solve the problem well? An agent that takes 25 steps, five failed tool calls, and three self-correction loops to book a simple flight can be considered as a low-quality agent - even if it eventually succeeds. Efficiency is measured in resources consumed: total tokens (cost), wall-clock time (latency), and trajectory complexity (total number of steps). November 2025 14

15. Agent Quality Robustness (Reliability): How does the agent handle adversity and the messiness of the real world? When an API times out, a website's layout changes, data is missing, or a user provides an ambiguous prompt, does the agent fail gracefully? A robust agent retries failed calls, asks the user for clarification when needed, and reports what it couldn't do and why rather than crashing or hallucinating. Safety & Alignment (Trustworthiness): This is the non-negotiable gate. Does the agent operate within its defined ethical boundaries and constraints? This pillar encompasses everything from Responsible AI metrics for fairness and bias to security against prompt injection and data leakage. It ensures the agent stays on task, refuses harmful instructions, and operates as a trustworthy proxy for your organization. This framework makes one thing clear: you cannot measure any of these pillars if you only see the final answer. You cannot measure Efficiency if you don't count the steps. You cannot diagnose a Robustness failure if you don't know which API call failed. You cannot verify Safety if you cannot inspect the agent's internal reasoning. A holistic framework for agent quality demands a holistic architecture for agent visibility. Summary & What's Next The intrinsic non-deterministic nature of agents has broken traditional quality assurance. Risks now include subtle issues like bias, hallucination, and drift, driven by a shift from passive models to active, system-centric agents that plan and use tools. We must change our focus from verification (checking specs) to validation (judging value). November 2025 15

16. Agent Quality This requires an "Outside-In" framework measuring agent quality across four pillars: Effectiveness, Efficiency, Robustness, and Safety. Measuring these pillars demands deep visibility—seeing inside the agent's decision-making trajectory. Before building the how (observability architecture), we must define the what: What does good evaluation look like? Chapter 2 will define the strategies and judges for assessing complex agent behavior. Chapter 3 will then build the technical foundation (logging, tracing, and metrics) needed to capture the data. The Art of Agent Evaluation: Judging the Process In Chapter 1, we established the fundamental shift from traditional software testing to modern AI evaluation. Traditional testing is a deterministic process of verification - it asks, “Did we build the product right?” against a fixed specification. This approach fails when a system’s core logic is probabilistic, because non-deterministic output may be more likely to introduce subtle degradations of quality that do not result in explicit crashes and may not be repeatable. Agent evaluation, by contrast, is a holistic process of validation. It asks a far more complex and essential strategic question: “Did we build the right product?” This question is the strategic anchor for the "Outside-In" evaluation framework, representing the necessary shift from internal compliance to judging the system's external value and alignment with user intent. This requires us to assess the overall quality, robustness, and user value of an agent operating in a dynamic world. November 2025 16

17. Agent Quality The rise of AI agents, which can plan, use tools, and interact with complex environments, significantly complicates this evaluation landscape. We must move beyond "testing" an output and learn the art of "evaluating" a process. This chapter provides the strategic framework for doing just that: judging the agent’s entire decision-making trajectory, from initial intent to final outcome. A Strategic Framework: The "Outside-In" Evaluation Hierarchy To avoid getting lost in a sea of component-level metrics, evaluation must be a top-down, strategic process. We call this the "Outside-In" Hierarchy. This approach prioritizes the only metric that ultimately matters - real-world success - before diving into the technical details of why that success did or did not occur. This model is a two-stage process: start with the black box, then open it up. November 2025 17

18. Agent Quality The "Outside-In" View: End-to-End Evaluation (The Black Box) Figure 3: A Framework for Holistic Agent Evaluation The first and most important question is: "Did the agent achieve the user's goal effectively?" This is the "Outside-In" view. Before analyzing a single internal thought or tool call, we must evaluate the agent's final performance against its defined objective. Metrics at this stage focus on overall task completion. We measure: • Task Success Rate: A binary (or graded) score of whether the final output was correct, complete, and solved the user's actual problem, e.g. PR acceptance rate for a coding agent, successful database transaction rate for a financial agent, or session completion rate for a customer service bot. November 2025 18

19. Agent Quality • User Satisfaction: For interactive agents, this can be a direct user feedback score (e.g., thumbs up/down) or a Customer Satisfaction Score (CSAT). • Overall Quality: If the agent's goal was quantitative (e.g., "summarize these 10 articles"), the metric might be accuracy or completeness (e.g., "Did it summarize all 10?"). If the agent scores 100% at this stage, our work may be done. But in a complex system, it rarely will. When the agent produces a flawed final output, abandons a task, or fails to converge on a solution, the "Outside-In" view tells us what went wrong. Now we must open the box to see why. Applied Tip: To build an output regression test with the Agent Development Kit (ADK), start the ADK web UI (adk web) and interact with your agent. When you receive an ideal response that you want to set as the benchmark, navigate to the Eval tab and click "Add current session." This saves the entire interaction as an Eval Case (in a .test. json file) and locks in the agent's current text as the ground truth final_response. You can then run this Eval Set via the CLI (adk eval) or pytest to automatically check future agent versions against this saved answer, catching any regressions in output quality. The "Inside-Out" View: Trajectory Evaluation (The Glass Box) Once a failure is identified, we move to the "Inside-Out" view. We analyze the agent's approach by systematically assessing every component of its execution trajectory: November 2025 19

20. Agent Quality 1. LLM Planning (The "Thought"): We first check the core reasoning. Is the LLM itself the problem? Failures here include hallucinations, nonsensical or off-topic responses, context pollution, or repetitive output loops. 2. Tool Usage (Selection & Parameterization): An agent is only as good as its tools. We must analyze if the agent is calling the wrong tool, failing to call a necessary tool, hallucinating tool names or parameter names/types, or calling one unnecessarily. Even if it selects the right tool, it can fail by providing missing parameters, incorrect data types, or malformed JSON for the API call. 3. Tool Response Interpretation (The "Observation"): After a tool executes correctly, the agent must understand the result. Agents frequently fail here by misinterpreting numerical data, failing to extract key entities from the response, or, critically, not recognizing an error state returned by the tool (e.g., an API's 404 error) and proceeding as if the call was successful. 4. RAG Performance: If the agent uses Retrieval-Augmented Generation (RAG), the trajectory depends on the quality of its retrieved information. Failures include irrelevant document retrieval, fetching outdated or incorrect information, or the LLM ignoring the retrieved context entirely and hallucinating an answer anyway. 5. Trajectory Efficiency and Robustness: Beyond correctness, we must evaluate the process itself: exposing inefficient resource allocation, such as an excessive number of API calls, high latency, or redundant efforts. It also reveals robustness failures, such as unhandled exceptions. 6. Multi-Agent Dynamics: In advanced systems, trajectories involve multiple agents. Evaluation must then also include inter-agent communication logs to check for misunderstandings or communication loops and ensure agents are adhering to their defined roles without conflicting with others. November 2025 20

21. Agent Quality By analyzing the trace, we can move from "the final answer is wrong" (Black Box) to "the final answer is wrong because …." (Glass Box). This level of diagnostic power is the entire goal of agent evaluation. Applied Tip: When you save an Eval Case (as described in the previous tip) in the ADK, it also saves the entire sequence of tool calls as the ground truth trajectory. Your automated pytest or adk eval run will then check this trajectory for a perfect match (by default). To manually implement process evaluation (i.e., debug a failure), use the Trace tab in the adk web UI. This provides an interactive graph of the agent's execution, allowing you to visually inspect the agent’s plan, see every tool it called with its exact arguments, and compare its actual path against the expected path to pinpoint the exact step where its logic failed. The Evaluators: The Who and What of Agent Judgment Knowing what to evaluate (the trajectory) is half the battle. The other half is how to judge it. For nuanced aspects like quality, safety, and interpretability, this judgment requires a sophisticated, hybrid approach. Automated systems provide scale, but human judgment remains the crucial arbiter of quality. November 2025 21

22. Agent Quality Automated Metrics Automated metrics provide speed and reproducibility. They are useful for regression testing and benchmarking outputs. Examples include: • String-based similarity (ROUGE, BLEU), comparing generated text to references. • Embedding-based similarity (BERTScore, cosine similarity), measuring semantic closeness. • Task-specific benchmarks, e.g., TruthfulQA 2 Metrics are efficient but shallow: they capture surface similarity, not deeper reasoning or user value. Applied Tip: Implement automated metrics as the first quality gate in your CI/CD pipeline. The key is to treat them as trend indicators, not as absolute measures of quality. A specific BERTScore of 0.8, for example, doesn't definitively mean the answer is "good." Their real value is in tracking changes: if your main branch consistently averages a 0.8 BERTScore on your "golden set," and a new code commit drops that average to 0.6, you have automatically detected a significant regression. This makes metrics the perfect, low-cost "first filter" to catch obvious failures at scale before escalating to more expensive LLM-as-a-Judge or human evaluation. November 2025 22

23. Agent Quality The LLM-as-a-Judge Paradigm How can we automate the evaluation of qualitative outputs like "is this summary good?" or "was this plan logical?" The answer is to use the same technology we are trying to evaluate. The LLM-as-a-Judge 3 paradigm involves using a powerful, state-of-the-art model (like Google's Gemini Advanced) to evaluate the outputs of another agent. We provide the "judge" LLM with the agent's output, the original prompt, the "golden" answer or reference (if one exists), and a detailed evaluation rubric (e.g., "Rate the helpfulness, correctness, and safety of this response on a scale of 1-5, explaining your reasoning."). This approach provides scalable, fast, and surprisingly nuanced feedback, especially for intermediate steps like the quality of an agent's "Thought" or its interpretation of a tool response. While it doesn't replace human judgment, it allows data science teams to rapidly evaluate performance across thousands of scenarios, making an iterative evaluation process feasible. November 2025 23

24. Agent Quality Applied Tip: To implement this, prioritize pairwise comparison over single-scoring to mitigate the exact biases mentioned. First, run your evaluation set of prompts against two different agent versions (e.g., your old production agent vs. your new experimental one) to generate an "Answer A" and "Answer B" for each prompt. Then, create the LLM judge by giving a powerful LLM (like Gemini Pro) a clear rubric and a prompt that forces a choice: "Given this User Query, which response is more helpful: A or B? Explain your reasoning." By automating this process, you can scalably calculate a win/loss/tie rate for your new agent. A high "win rate" is a far more reliable signal of improvement than a small change in an absolute (and often noisy) 1-5 score. A prompt for an LLM-as-a-Judge, especially for the robust pairwise comparison, might look like this: You are an expert evaluator for a customer support chatbot. Your goal is to assess which of two responses is more helpful, polite, and correct. [User Query] "Hi, my order #12345 hasn't arrived yet." [Answer A] "I can see that order #12345 is currently out for delivery and should arrive by 5 PM today." [Answer B] "Order #12345 is on the truck. It will be there by 5." Please evaluate which answer is better. Compare them on correctness, helpfulness, and tone. Provide your reasoning and then output your final decision in a JSON object with a "winner" key (either "A", "B", or "tie") and a "rationale" key. November 2025 24

25. Agent Quality Agent-as-a-Judge While LLMs can score final responses, agents require deeper evaluation of their reasoning and actions. The emerging Agent-as-a-Judge 4 paradigm uses one agent to evaluate the full execution trace of another. Instead of scoring only outputs, it assesses the process itself. Key evaluation dimensions include: • Plan quality: Was the plan logically structured and feasible? • Tool use: Were the right tools chosen and applied correctly? • Context handling: Did the agent use prior information effectively? This approach is particularly valuable for process evaluation, where failures often arise from flawed intermediate steps rather than the final output. Applied Tip: To implement an Agent-as-a-Judge, consider feeding relevant parts of the execution trace object to your judge. First, configure your agent framework to log and export the trace, including the internal plan, the list of tools chosen, and the exact arguments passed. Then, create a specialized "Critic Agent" with a prompt (rubric) that asks it to evaluate this trace object directly. Your prompt should ask specific process questions: "1. Based on the trace, was the initial plan logical? 2. Was the {tool_A} tool the correct first choice, or should another tool have been used? 3. Were the arguments correct and properly formatted?" This allows you to automatically detect process failures (like an inefficient plan), even when the agent produced a final answer that looked correct. November 2025 25

26. Agent Quality Human-in-the-Loop (HITL) Evaluation While automation provides scale, it struggles with deep subjectivity and complex domain knowledge. Human-in-the-Loop (HITL) evaluation is the essential process for capturing the critical qualitative signals and nuanced judgments that automated systems miss. We must, however, move away from the idea that human rating provides a perfect "objective ground truth." For highly subjective tasks (like assessing creative quality or nuanced tone), perfect inter-annotator agreement is rare. Instead, HITL is the indispensable methodology for establishing a human-calibrated benchmark, ensuring the agent's behavior aligns with complex human values, contextual needs, and domain-specific accuracy. The HITL process involves several key functions: • Domain Expertise: For specialized agents (e.g., medical, legal, or financial), you must leverage domain experts to evaluate factual correctness and adherence to specific industry standards. • Interpreting Nuance: Humans are essential for judging the subtle qualities that define a high-quality interaction, such as tone, creativity, user intent, and complex ethical alignment. • Creating the "Golden Set": Before automation can be effective, humans must establish the "gold standard" benchmark. This involves curating a comprehensive evaluation set, defining the objectives for success , and crafting a robust suite of test cases that cover typical, edge, and adversarial scenarios. November 2025 26

27. Agent Quality Applied Tip: For runtime safety, implement an interruption workflow. In a framework like ADK, you can configure the agent to pause its execution before committing to a high-stakes tool call (like execute_payment or delete_database_entry). The agent’s state and planned action are then surfaced in a Reviewer UI, where a human operator must manually approve or reject the step before the agent is allowed to resume. User Feedback and Reviewer UI Evaluation must also capture real-world user feedback. Every interaction is a signal of usefulness, clarity, and trust. This feedback includes both qualitative signals (like thumbs up/ down) and quantitative in-product success metrics, such as pull request (PR) acceptance rate for a coding agent, or successful booking completion rate for a travel agent. Best practices include: • Low-friction feedback: thumbs up/down, quick sliders, or short comments. • Context-rich review: feedback should be paired with the full conversation and agent’s reasoning trace. • Reviewer User Interface (UI): a two-panel interface: conversation on the left, reasoning steps on the right, with inline tagging for issues like “bad plan” or “tool misuse.” • Governance dashboards: aggregate feedback to highlight recurring issues and risks. Without usable interfaces, evaluation frameworks fail in practice. A strong UI makes user and reviewer feedback visible, fast, and actionable. November 2025 27

28. Agent Quality Applied Tip: Implement your user feedback system as an event-driven pipeline, not just a static log. When a user clicks "thumbs down," that signal must automatically capture the full, context-rich conversation trace and add it to a dedicated review queue within your developer's Reviewer UI. Beyond Performance: Responsible AI (RAI) & Safety Evaluation A final dimension of evaluation operates not as a component, but as a mandatory, non- negotiable gate for any production agent: Responsible AI and Safety. An agent that is 100% effective but causes harm is a total failure. Evaluation for safety is a specialized discipline that must be woven into the entire development lifecycle. This involves: • Systematic Red Teaming: Actively trying to break the agent using adversarial scenarios. This includes attempts to generate hate speech, reveal private information, propagate harmful stereotypes, or induce the agent to engage in malicious actions. • Automated Filters & Human Review: Implementing technical filters to catch policy violations and coupling them with human review, as automation alone may not catch nuanced forms of bias or toxicity. November 2025 28

29. Agent Quality • Adherence to Guidelines: Explicitly evaluating the agent’s outputs against predefined ethical guidelines and principles to ensure alignment and prevent unintended consequences. Ultimately, performance metrics tell us if the agent can do the job, but safety evaluation tells us if it should. Applied Tip: Implement your guardrails as a structured Plugin, rather than as isolated functions. In this pattern, the callback is the mechanism (the hook provided by ADK), while the Plugin is the reusable module you build. For example, you can build a single SafetyPlugin class. This plugin would then register its internal methods with the framework's available callbacks: 1. Your plugin's check_input_safety() method would register with the before_model_callback. This method's job is to run your prompt injection classifier. 2. Your plugin's check_output_pii() method would register with the after_ model_callback. This method's job is to run your PII scanner. This plugin architecture makes your guardrails reusable, independently testable, and cleanly layered on top of the foundation model's built-in safety settings (like those in Gemini). November 2025 29

30. Agent Quality Summary & What's Next Effective agent evaluation requires moving beyond simple testing to a strategic, hierarchical framework. This "Outside-In" approach first validates end-to-end task completion (the Black Box) before analyzing the full trajectory within the "Glass Box"—assessing reasoning quality, tool use, robustness, and efficiency. Judging this process demands a hybrid approach: scalable automation like LLM-as-a-Judge, paired with the indispensable, nuanced judgment of Human-in-the-Loop (HITL) evaluators. This framework is secured by a non-negotiable layer of Responsible AI and safety evaluation to build trustworthy systems. We understand the need to judge the entire trajectory, but this framework is purely theoretical without the data. To enable this "Glass Box" evaluation, the system must first be observable. Chapter 3 will provide the architectural blueprint, moving from the theory of evaluation to the practice of observability by mastering the three pillars: logging, tracing, and metrics. November 2025 30

31. Agent Quality Observability: Seeing Inside the Agent's Mind From Monitoring to True Observability In the last chapter, we established that AI Agents are a new breed of software. They don't just follow instructions; they make decisions. This fundamental difference demands a new approach to quality assurance, moving us beyond traditional software monitoring into the deeper realm of observability. To grasp the difference, let's leave the server room and step into a kitchen. The Kitchen Analogy: Line Cook vs. Gourmet Chef Traditional Software is a Line Cook: Imagine a fast-food kitchen. The line cook has a laminated recipe card for making a burger. The steps are rigid and deterministic: toast bun for 30 seconds, grill patty for 90 seconds, add one slice of cheese, two pickles, one squirt of ketchup. • Monitoring in this world is a checklist. Is the grill at the right temperature? Did the cook follow every step? Was the order completed on time? We are verifying a known, predictable process. November 2025 31

32. Agent Quality An AI Agent is a Gourmet Chef in a "Mystery Box" Challenge: The chef is given a goal ("Create an amazing dessert") and a basket of ingredients (the user's prompt, data, and available tools). There is no single correct recipe. They might create a chocolate lava cake, a deconstructed tiramisu, or a saffron-infused panna cotta. All could be valid, even brilliant, solutions. • Observability is how a food critic would judge the chef. The critic doesn't just taste the final dish. They want to understand the process and the reasoning. Why did the chef choose to pair raspberries with basil? What technique did they use to crystallize the ginger? How did they adapt when they realized they were out of sugar? We need to see inside their "thought process" to truly evaluate the quality of their work. This represents a fundamental shift for AI agents, moving beyond simple monitoring to true observability. The focus is no longer on merely verifying if an agent is active, but on understanding the quality of its cognitive processes. Instead of asking "Is the agent running?", the critical question becomes "Is the agent thinking effectively?". The Three Pillars of Observability So, how do we get access to the agent's "thought process"? We can't read its mind directly, but we can analyze the evidence it leaves behind. This is achieved by building our observability practice on three foundational pillars: Logs, Traces, and Metrics. They are the tools that allow us to move from tasting the final dish to critiquing the entire culinary performance. November 2025 32

33. Agent Quality Figure 4: Three foundational pillars for Agent Observability Let’s dissect each pillar and see how they work together to give us a critic's-eye view of our agent's performance. Pillar 1: Logging – The Agent's Diary What are Logs? Logs are the atomic unit of observability. Think of them as timestamped entries in your agent's diary. Each entry is a raw, immutable fact about a discrete event: "At 10:01:32, I was asked a question. At 10:01:33, I decided to use the get_weather tool." They tell us what happened. November 2025 33

34. Agent Quality Beyond print(): What Makes a Log Effective? A fully managed service like Google Cloud Logging allows you to store, search, and analyze log data at scale. It can automatically collect logs from Google Cloud services, and its Log Analytics capabilities allow you to run SQL queries to uncover trends in your agent's behavior. A best-in-class framework makes this easy. For example, the Agent Development Kit (ADK) is built on Python's standard logging module. This allows a developer to configure the desired level of detail - from high-level INFO messages in production to granular DEBUG messages during development - without changing the agent's code. The Anatomy of a Critical Log Entry To reconstruct an agent's "thought process," a log must be rich with context. A structured JSON format is the gold standard. • Core Information: A good log captures the full context: prompt/response pairs, intermediate reasoning steps (the agent's "chain of thought", a concept explored by Wei et al. (2022)), structured tool calls (inputs, outputs, errors), and any changes to the agent's internal state. • The Tradeoff: Verbosity vs. Performance: A highly detailed DEBUG log is a developer's best friend for troubleshooting but can be too "noisy" and create performance overhead in a production environment. This is why structured logging is so powerful; it allows you to collect detailed data but filter it efficiently. Here’s a practical example showing the power of a structured log, adapted from an ADK DEBUG output: November 2025 34

35. Agent Quality JSON // A structured log entry capturing a single LLM request ... 2025-07-10 15:26:13,778 - DEBUG - google_adk.google.adk.models.google_llm - Sending out request, model: gemini-2.0-flash, backend: GoogleLLMVariant.GEMINI_API, stream: False 2025-07-10 15:26:13,778 - DEBUG - google_adk.google.adk.models.google_llm - LLM Request: ----------------------------------------------------------- System Instruction: You roll dice and answer questions about the outcome of the dice rolls..... The description about you is "hello world agent that can roll a dice of 8 sides and check prime numbers." ----------------------------------------------------------- Contents: {"parts":[{"text":"Roll a 6 sided dice"}],"role":"user"} {"parts":[{"function_call":{"args":{"sides":6},"name":"roll_die"}}],"role":"model"} {"parts":[{"function_response":{"name":"roll_die","response":{"result":2}}}],"role":"user"} ----------------------------------------------------------- Functions: roll_die: {'sides': {'type': <Type.INTEGER: 'INTEGER'>}} check_prime: {'nums': {'items': {'type': <Type.INTEGER: 'INTEGER'>}, 'type': <Type.ARRAY: 'ARRAY'>}} ----------------------------------------------------------- 2025-07-10 15:26:13,779 - INFO - google_genai.models - AFC is enabled with max remote calls: 10. 2025-07-10 15:26:14,309 - INFO - google_adk.google.adk.models.google_llm - LLM Response: ----------------------------------------------------------- Text: I have rolled a 6 sided die, and the result is 2. ... Snippet 1: A structured log entry capturing a single LLM request Applied Tip: A powerful logging pattern is to record the agent's intent before an action and the outcome after. This immediately clarifies the difference between a failed attempt and a deliberate decision not to act. November 2025 35

36. Agent Quality Pillar 2: Tracing – Following the Agent's Footsteps What is Tracing? If logs are diary entries, traces are the narrative thread that connects them into a coherent story. Tracing follows a single task - from the initial user query to the final answer - stitching together individual logs (called spans) into a complete, end-to-end view. Traces reveal the crucial "why" by showing the causal relationship between events. Imagine a detective's corkboard. Logs are the individual clues - a photo, a ticket stub. A trace is the red yarn connecting them, revealing the full sequence of events. Why Tracing is Indispensable Consider a complex agent failure where a user asks a question and gets a nonsensical answer. • Isolated Logs might show: ERROR: RAG search failed and ERROR: LLM response failed validation. You see the errors, but the root cause is unclear. • A Trace reveals the full causal chain: User Query → RAG Search (failed) → Faulty Tool Call (received null input) → LLM Error (confused by bad tool output) → Incorrect Final Answer The trace makes the root cause instantly obvious, making it indispensable for debugging complex, multi-step agent behaviors. November 2025 36

37. Agent Quality Key Elements of an Agent Trace Modern tracing is built on open standards like OpenTelemetry. The core components are: • Spans: The individual, named operations within a trace (e.g., an llm_call span, a tool_execution span). • Attributes: The rich metadata attached to each span - prompt_id, latency_ms, token_count, user_id, etc. • Context Propagation: The "magic" that links spans together via a unique trace_id, allowing backends like Google Cloud Trace to assemble the full picture. Cloud Trace is a distributed tracing system that helps you understand how long it takes for your application to handle requests. When an agent is deployed on a managed runtime like Vertex AI Agent Engine, this integration is streamlined. The Agent Engine handles the infrastructure for scaling agents in production and automatically integrates with Cloud Trace to provide end- to-end observability, linking the agent invocation with all subsequent model and tool calls. Figure 5: OpenTelemetry view lets you inspect attributes, logs, events, and other details November 2025 37

38. Agent Quality Pillar 3: Metrics – The Agent's Health Report What are Metrics? If logs are the chef's prep notes and traces are the critic watching the recipe unfold step-by-step, then metrics are the final scorecard the critic publishes. They are the quantitative, aggregated health scores that give you an immediate, at-a-glance understanding of your agent's overall performance. Crucially, a food critic doesn't just invent these scores based on a single taste of the final dish. Their judgment is informed by everything they observe. Metrics are the same: they are not a new source of data. They are derived by aggregating the data from your logs and traces over time. They answer the question, "How well did the performance go, on average?" For AI Agents, it's useful to divide metrics into two distinct categories: the directly measurable System Metrics and the more complex, evaluative Quality Metrics. System Metrics: The Vital Signs System Metrics are the foundational, quantitative measures of operational health. They are directly calculated from the attributes on your logs and traces through aggregation functions (like average, sum, or percentile). Think of these as the agent's vital signs: its pulse, temperature, and blood pressure. Key System Metrics to track include: • Performance: • Latency (P50/P99): Calculated by aggregating the duration_ms attribute from traces to find the median and 99th percentile response times. This tells you about the typical and worst-case user experience. November 2025 38

39. Agent Quality • Error Rate: The percentage of traces that contain a span with an error=true attribute. • Cost: • Tokens per Task: The average of the token_count attribute across all traces, which is vital for managing LLM costs. • API Cost per Run: By combining token counts with model pricing, you can track the average financial cost per task. • Effectiveness: • Task Completion Rate: The percentage of traces that successfully reach a designated "success" span. • Tool Usage Frequency: A count of how often each tool (e.g., get_weather) appears as a span name, revealing which tools are most valuable. These metrics are essential for operations, setting alerts, and managing the cost and performance of your agent fleet. Quality Metrics: Judging the Decision-Making Quality Metrics are second-order metrics derived by applying the judgment frameworks detailed in Chapter 2 on top of the raw observability data. They move beyond efficiency to assess the agent's reasoning and final output quality itself. November 2025 39

40. Agent Quality These are not simple counters or averages. They are second-order metrics derived by applying a judgment layer on top of the raw observability data. They assess the quality of the agent's reasoning and final output. Examples of critical Quality Metrics include: • Correctness & Accuracy: Did the agent provide a factually correct answer? If it summarized a document, was the summary faithful to the source? • Trajectory Adherence: Did the agent follow the intended path or "ideal recipe" for a given task? Did it call the right tools in the right order? • Safety & Responsibility: Did the agent's response avoid harmful, biased, or inappropriate content? • Helpfulness & Relevance: Was the agent's final response actually helpful to the user and relevant to their query? Generating these metrics requires more than a simple database query. It often involves comparing the agent's output against a "golden" dataset or using a sophisticated LLM-as-a- Judge to score the response against a rubric. The observability data from our logs and traces is the essential evidence needed to calculate these scores, but the process of judgment itself is a separate, critical discipline. November 2025 40

41. Agent Quality Putting It All Together: From Raw Data to Actionable Insights Having logs, traces, and metrics is like having a talented chef, a well-stocked pantry, and a judging rubric. But these are just the components. To run a successful restaurant, you need to assemble them into a working system for a busy dinner service. This section is about that practical assembly - turning your observability data into real-time actions and insights during live operations. This involves three key operational practices: 1. Dashboards & Alerting: Separating System Health from Model Quality A single dashboard is not enough. To effectively manage an AI agent, you need distinct views for your System Metrics and your Quality Metrics, as they serve different purposes and different teams. • Operational Dashboards (for System Metrics): This dashboard category focuses on real-time operational health. It tracks the agent's core vital signs and is primarily intended for Site Reliability Engineers (SREs), DevOps, and operations teams responsible for system uptime and performance. • What it tracks: P99 Latency, Error Rates, API Costs, Token Consumption. • Purpose: To immediately spot system failures, performance degradation, or budget overruns. • Example Alert: ALERT: P99 latency > 3s for 5 minutes. This indicates a system bottleneck that requires immediate engineering attention. • Quality Dashboards (for Quality Metrics): This category tracks the more nuanced, slower-moving indicators of agent effectiveness and correctness. It is essential for product owners, data scientists, and AgentOps teams who are responsible for the quality of the agent's decisions and outputs. November 2025 41

42. Agent Quality • What it tracks: Factual Correctness Score, Trajectory Adherence, Helpfulness Ratings, Hallucination Rate. • Purpose: To detect subtle drifts in agent quality, especially after a new model or prompt is deployed. • Example Alert: ALERT: 'Helpfulness Score' has dropped by 10% over the last 24 hours. This signals that while the system may be running fine (System Metrics are OK), the quality of the agent's output is degrading, requiring an investigation into its logic or data. 2. Security & PII: Protecting Your Data This is a non-negotiable aspect of production operations. User inputs captured in logs and traces often contain Personally Identifiable Information (PII). A robust PII scrubbing mechanism must be an integrated part of your logging pipeline before data is stored long- term to ensure compliance with privacy regulations and protect your users. 3. The Core Trade-off: Granularity vs. Overhead Capturing highly detailed logs and traces for every single request in production can be prohibitively expensive and add latency to your system. The key is to find a strategic balance. • Best Practice - Dynamic Sampling: Use high-granularity logging (DEBUG level) in development environments. In production, set a lower default log level (INFO) but implement dynamic sampling. For example, you might decide to trace only 10% of successful requests but 100% of all errors. This gives you broad performance data for your metrics without overwhelming your system, while still capturing the rich diagnostic detail you need to debug every failure. November 2025 42

43. Agent Quality Summary & What's Next To trust an autonomous agent, you must first be able to understand its process. You wouldn't judge a gourmet chef's final dish without having some insight into their recipe, technique, and decision-making along the way. This chapter has established that Observability is the framework that gives us this crucial insight into our agents. It provides the "eyes and ears" inside the kitchen. We've learned that a robust observability practice is built upon three foundational pillars, which work together to transform raw data into a complete picture: • Logs: The structured diary, providing the granular, factual record of what happened at every step. • Traces: The narrative story that connects individual logs, showing the causal path to reveal why it happened. • Metrics: The aggregated report card, summarizing performance at scale to tell us how well it happened. We further divided these into vital System Metrics (like latency and cost) and crucial Quality Metrics (like correctness and helpfulness). By assembling these pillars into a coherent operational system, we move from flying blind to having a clear, data-driven view of our agent's behavior, efficiency, and effectiveness. We now have all the pieces: the why (the problem of non-determinism in Chapter 1), the what (the evaluation framework in Chapter 2), and the how (the observability architecture in Chapter 3). In Chapter 4, we will bring this all together into a single, operational playbook, showing how these components form the "Agent Quality Flywheel" - a continuous improvement loop to build agents that are not just capable, but truly trustworthy. November 2025 43

44. Agent Quality Conclusion: Building Trust in an Autonomous World Introduction: From Autonomous Capability to Enterprise Trust In the opening of this whitepaper, we posed a fundamental challenge: AI agents, with their non-deterministic and autonomous nature, shatter our traditional models of software quality. We likened the task of assessing an agent to evaluating a new employee - you don't just ask if the task was done, you ask how it was done. Was it efficient? Was it safe? Did it create a good experience? Flying blind is not an option when the consequence is business risk. The journey since that opening has been about building the blueprint for trust in this new paradigm. We established the need for a new discipline by defining the Four Pillars of Agent Quality: Effectiveness, Cost-Efficiency, Safety, and User Trust. We then showed how to gain "eyes and ears" inside the agent's mind through Observability (Chapter 3) and how to judge its performance with a holistic Evaluation framework (Chapter 2). This paper has laid the foundation for what to measure and how to see it. The critical next step, covered in the subsequent whitepaper, "Day 5: Prototype to Production" is to operationalize these principles. This involves taking an evaluated agent and successfully running it in a production environment through robust CI/CD pipelines, safe rollout strategies, and scalable infrastructure. Now, we bring it all together. This isn't just a summary; it's the operational playbook for turning abstract principles into a reliable, self-improving system, bridging the gap between evaluation and production. November 2025 44

45. Agent Quality The Agent Quality Flywheel: A Synthesis of the Framework A great agent doesn't just perform; it improves. This discipline of continuous evaluation is what separates a clever demo from an enterprise-grade system. This practice creates a powerful, self-reinforcing system we call the Agent Quality Flywheel. Think of it like starting a massive, heavy flywheel. The first push is the hardest. But the structured practice of evaluation provides subsequent, consistent pushes. Each push adds to the momentum until the wheel is spinning with unstoppable force, creating a virtuous cycle of quality and trust. This flywheel is the operational embodiment of the entire framework we've discussed. Figure 6: The Agent Quality Flywheel November 2025 45

46. Agent Quality Here’s how the components from each chapter work together to build that momentum: • Step 1: Define Quality (The Target): A flywheel needs a direction. As we defined in Chapter 1, it all starts with the Four Pillars of Quality: Effectiveness, Cost-Efficiency, Safety, and User Trust. These pillars are not abstract ideals; they are the concrete targets that give our evaluation efforts meaning and align the flywheel with true business value. • Step 2: Instrument for Visibility (The Foundation): You cannot manage what you cannot see. As detailed in our chapter on Observability, we must instruct our agents to produce structured Logs (the agent’s diary) and end-to-end Traces (the narrative thread). This observability is the foundational practice that generates the rich evidence needed to measure our Four Pillars, providing the essential fuel for the flywheel. • Step 3: Evaluate the Process (The Engine): With visibility established, we can now judge performance. As explored in our Evaluation chapter, this involves a strategic "outside-in" assessment, judging both the final Output and the entire reasoning Process. This is the powerful push that spins the wheel - a hybrid engine using scalable LLM-as-a-Judge systems for speed and the Human-in-the-Loop (HITL) "gold standard" for ground truth. • Step 4: Architect the Feedback Loop (The Momentum): This is where the "evaluatable- by-design" architecture from Chapter 1 comes to life. By building the critical feedback loop, we ensure that every production failure, when captured and annotated, is programmatically converted into a permanent regression test in our "Golden" Evaluation Set. Every failure makes the system smarter, spinning the flywheel faster and driving relentless, continuous improvement. Three Core Principles for Building Trustworthy Agents If you take nothing else away from this whitepaper, let it be these three principles. They represent the foundational mindset for any leader aiming to build truly reliable autonomous systems in this new, agentic state of the art. November 2025 46

47. Agent Quality • Principle 1: Treat Evaluation as an Architectural Pillar, Not a Final Step: Remember the race car analogy from Chapter 1? You don't build a Formula 1 car and then bolt on sensors. You design it from the ground up with telemetry ports. Agentic workloads demand the same DevOps paradigm. Reliable agents are "evaluatable-by-design," instrumented from the first line of code to emit the logs and traces essential for judgment. Quality is an architectural choice, not a final QA phase. • Principle 2: The Trajectory is the Truth: For agents, the final answer is merely the last sentence of a long story. As we established in our Evaluation chapter, the true measure of an agent's logic, safety, and efficiency lies in its end-to-end "thought process" - the trajectory. This is Process Evaluation. To truly understand why an agent succeeded or failed, you must analyze this path. This is only possible through the deep Observability practices we detailed in Chapter 3. • Principle 3: The Human is the Arbiter: Automation is our tool for scale; humanity is our source of truth. Automation, from LLM-as-a-Judge systems to safety classifiers, is essential. However, as established in our deep dive on Human-in-the-Loop (HITL) evaluation, the fundamental definition of "good," the validation of nuanced outputs, and the final judgment on safety and fairness must be anchored to human values. An AI can help grade the test, but a human writes the rubric and decides what an 'A+' really means. The Future is Agentic - and Reliable We are at the dawn of the agentic era. The ability to create AI that can reason, plan, and act will be one of the most transformative technological shifts of our time. But with great power comes the profound responsibility to build systems that are worthy of our trust. Mastering the concepts in this whitepaper - what one can call "Evaluation Engineering" - is the key competitive differentiator for the next wave of AI. Organizations that continue to treat agent quality as an afterthought will be stuck in a cycle of promising demos and November 2025 47

48. Agent Quality failed deployments. In contrast, those who invest in this rigorous, architecturally-integrated approach to evaluation will be the ones who move beyond the hype to deploy truly transformative, enterprise-grade AI systems. The ultimate goal is not just to build agents that work, but to build agents that are trusted. And that trust, as we have shown, is not a matter of hope or chance. It is forged in the crucible of continuous, comprehensive, and architecturally-sound evaluation. November 2025 48

49. Agent Quality References Academic Papers, Books, & Formal Reports 1. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Rocktäschel, T. (2020). Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459-9474. 2. Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 3214–3252). 3. Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z.,... & Liu, H. (2024). From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge. arXiv preprint arXiv:2411.16594. 4. Zhuge, M., Wang, M., Shen, X., Zhang, Y., Wang, Y., Zhang, C., ... & Liu, N. (2024). Agent-as-a-Judge: Evaluate Agents with Agents. arXiv preprint arXiv:2410.10934.Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete Problems in AI Safety. arXiv preprint arXiv:1606.06565. Baysan, M. S., Uysal, S., İşlek, İ., Çığ Karaman, Ç., & Güngör, T. (2025). LLM-as-a-Judge: automated evaluation of search query parsing using large language models. Frontiers in Big Data, 8. Available at: https://doi.org/10.3389/fdata.2025.1611389. Felderer, M., & Ramler, R. (2021). Quality Assurance for AI-Based Systems: Overview and Challenges. In Software Quality: The Complexity and Challenges of Software Engineering and Software Quality in the Cloud (pp. 38-51). Springer International Publishing. Hendrycks, D., Carlini, N., Schulman, J., & Steinhardt, J. (2023). Unsolved Problems in ML Safety. arXiv preprint arXiv:2306.04944. Ji, Z., Lee, N., Fries, R., Yu, T., Su, D., Xu, Y.,... & Fung, P. (2023). AI-generated text: A survey of tasks, evaluation criteria, and methods. arXiv preprint arXiv:2303.07233. Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Proceedings of the ACL-04 workshop on text summarization branches out (pp. 74-81). National Institute of Standards and Technology. (2023). AI Risk Management Framework (AI RMF 1.0). U.S. Department of Commerce. November 2025 49

50. Agent Quality Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318). Retzlaff, C., Das, S., Wayllace, C., Mousavi, P., Afshari, M., Yang, T., ... & Holzinger, A. (2024). Human-in-the-Loop Reinforcement Learning: A Survey and Position on Requirements, Challenges, and Opportunities. Journal of Artificial Intelligence Research, 79, 359-415. Slattery, F., Costello, E., & Holland, J. (2024). A taxonomy of risks posed by language models. arXiv preprint arXiv:2401.12903. Taylor, M. E. (2023). Reinforcement Learning Requires Human-in-the-Loop Framing and Approaches. Paper presented at the Adaptive and Learning Agents (ALA) Workshop 2023. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E.,... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824-24837. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations. Web Articles, Blog Posts, & General Web Pages Bunnyshell. (n.d.). LLM-as-a-Judge: How AI Can Evaluate AI Faster and Smarter. Retrieved September 16, 2025, from https://www.bunnyshell.com/blog/when-ai-becomes-the-judge-understanding-llm-as-a-j/. Coralogix. (n.d.). OpenTelemetry for AI: Tracing Prompts, Tools, and Inferences. Retrieved September 16, 2025, from https://coralogix.com/ai-blog/opentelemetry-for-ai-tracing-prompts-tools-and-inferences/. Drapkin, A. (2025, September 2). AI Gone Wrong: The Errors, Mistakes, and Hallucinations of AI (2023 – 2025). Tech.co. Retrieved September 16, 2025, from https://tech.co/news/list-ai-failures-mistakes-errors. Dynatrace. (n.d.). What is OpenTelemetry? An open-source standard for logs, metrics, and traces. Retrieved September 16, 2025, from https://www.dynatrace.com/news/blog/what-is-opentelemetry/. Galileo. (n.d.). Comprehensive Guide to LLM-as-a-Judge Evaluation. Retrieved September 16, 2025, from https://galileo.ai/blog/llm-as-a-judge-guide-evaluation. November 2025 50

51. Agent Quality Gofast.ai. (n.d.). Agent Hallucinations in the Real World: When AI Tools Go Wrong. Retrieved September 16, 2025, from https://www.gofast.ai/blog/ai-bias-fairness-agent-hallucinations-validation-drift-2025. IBM. (2025, February 25). What is LLM Observability? Retrieved September 16, 2025, from https://www.ibm.com/think/topics/llm-observability. MIT Sloan Teaching & Learning Technologies. (n.d.). When AI Gets It Wrong: Addressing AI Hallucinations and Bias. Retrieved September 16, 2025, from https://mitsloanedtech.mit.edu/ai/basics/addressing-ai-hallucinations-and-bias/. ResearchGate. (n.d.). (PDF) A Survey on LLM-as-a-Judge. Retrieved September 16, 2025, from https://www.researchgate.net/publication/386112851_A_Survey_on_LLM-as-a-Judge. TrustArc. (n.d.). The National Institute of Standards and Technology (NIST) Artificial Intelligence Risk Management. Retrieved September 16, 2025, from https://trustarc.com/regulations/nist-ai-rmf/. November 2025 51