Agents Companion

如果无法正常显示，请先停止浏览器的去广告插件。

1. Agents Companion Authors: Antonio Gulli, Lavi Nigam, Julia Wiesinger, Vladimir Vuskovic, Irina Sigler, Ivan Nardini, Nicolas Stroppa, Sokratis Kartakis, Narek Saribekyan, Anant Nawalgaria, and Alan Bount

2. Agents Companion Acknowledgements Editors & curators Anant Nawalgaria Content contributors Steven Johnson Hussain Chinoy Designer Michael Lanning February 2025 2

3. Table of contents Introduction 6 AgentOps 8 Agent Success Metrics Agent Evaluation 12 14 Assessing Agent Capabilities 15 Evaluating Trajectory and Tool Use 17 Evaluating the Final Response 20 Human-in-the-Loop Evaluation 21 More about Agent Evaluation 22 Multiple Agents & Their Evaluation 23 Understanding Multi-Agent Architectures 24 Multi-Agent Design Patterns and Their Business Impact 25 Important components of Agents 28 Challenges in Multi-Agent systems 31 Multi-Agent Evaluation 32

4. Agentic RAG: A Critical Evolution in Retrieval-Augmented Generation 33 Agentic RAG and its Importance 34 Better Search, Better RAG 36 Agents in the enterprise 38 Manager of agents 38 Google Agentspace 40 NotebookLM Enterprise 41 Google AgentSpace Enterprise 43 From agents to contractors 46 Contracts 46 Contract Lifecycle 49 Contract execution 49 Contract Negotiation 50 Contract Feedback 51 Subcontracts 51 Automotive AI: Real World Use of Multi-Agent Architecture Specialized Agents 54 54 Conversational Navigation Agent 54 Conversational Media Search Agent 55 Message Composition Agent 56 Car Manual Agent 57 General Knowledge Agent 57 Patterns in Use 58

5. Hierarchical Pattern 58 Diamond Pattern 59 Peer-to-Peer 61 Collaborative Pattern 63 Response Mixer Agent 65 Adaptive Loop Pattern 66 Advantages of Multi-Agent Architecture for Automotive AI 67 Agent Builder 68 Summary 69 Endnotes 74

6. Agents Companion The future of AI is agentic. Introduction Generative AI agents mark a leap forward from traditional, standalone language models, offering a dynamic approach to problem-solving and interaction. As defined in the original Agents paper, an agent is an application engineered to achieve specific objectives by perceiving its environment and strategically acting upon it using the tools at its disposal. The fundamental principle of an agent lies in its synthesis of reasoning, logic, and access to external information, enabling it to perform tasks and make decisions beyond the inherent capabilities of the underlying model. These agents possess the capacity for autonomous operation, independently pursuing their goals and proactively determining subsequent actions, often without explicit instructions. February 2025 6

7. Agents Companion The architecture of an agent is composed of three essential elements that drive its behavior and decision-making: • Model: Within the agent's framework, the term "model" pertains to the language model (LM) that functions as the central decision-making unit, employing instruction- based reasoning and logical frameworks. The model can vary from general-purpose to multimodal or fine-tuned, depending on the agent's specific requirements. • Tools: Tools are critical for bridging the divide between the agent's internal capabilities and the external world, facilitating interaction with external data and services. These tools empower agents to access and process real-world information. Tools can include extensions, functions, and data stores. Extensions bridge the gap between an API and an agent, enabling agents to seamlessly execute APIs. Functions are self-contained modules of code that accomplish specific tasks. Data stores provide access to dynamic and up-to-date information, ensuring a model’s responses remain grounded in factuality and relevance. • Orchestration layer: The orchestration layer is a cyclical process that dictates how the agent assimilates information, engages in internal reasoning, and leverages that reasoning to inform its subsequent action or decision. This layer is responsible for maintaining memory, state, reasoning, and planning. It employs prompt engineering frameworks to steer reasoning and planning, facilitating more effective interaction with the environment and task completion. Reasoning techniques such as ReAct, Chain-of-Thought (CoT), and Tree-of-Thoughts (ToT) can be applied within this layer. Building on these foundational concepts, this companion paper is designed for developers and serves as a "102" guide to more advanced topics. It offers in-depth explorations of agent evaluation methodologies and practical applications of Google agent products for enhancing agent capabilities in solving complex, real-world problems. February 2025 7

8. Agents Companion While exploring these theoretical concepts, we'll examine how they manifest in real-world implementations, with a particular focus on automotive AI as a compelling case study. The automotive domain exemplifies the challenges and opportunities of multi-agent architectures in production environments. Modern vehicles demand conversational interfaces that function with or without connectivity, balance between on-device and cloud processing for both safety and user experience, and seamlessly coordinate specialized capabilities across navigation, media control, messaging, and vehicle systems. Through this automotive lens, we'll see how different coordination patterns -- hierarchical, collaborative, and peer-to- peer -- come together to create robust, responsive user experiences in environments with significant constraints. This case study illustrates the practical application of multi-agent systems that businesses across industries can adapt to their specific domains. Anyone who has built with gen AI quickly realizes it’s easy to get from an idea to a proof-of- concept, but it can be quite difficult to ensure high quality results and get to production - gen AI agents are no exception. Quality and Reliability are the most cited concerns for deploying to production, and the “AgentOps” process is a solution to optimize agent building. AgentOps Over the past two years, the field of Generative AI (GenAI) has undergone significant changes, with enterprise customers focusing on how to operationalize related solutions. This has resulted in various terms describing the operationalization of GenAI, such as MLOps for GenAI, LLMOps, FMOps, and GenAIOps. Agent and Operations (AgentOps) is a subcategory of GenAIOps that focuses on the efficient operationalization of Agents. Its main additional components include internal and external tool management, agent brain prompt (goal, profile, instructions) and orchestration, memory, and task decomposition. February 2025 8

9. Agents Companion Figure 1. Relationship between DevOps, MLOps, and AgentOps. 13 Each of these “Ops” requires capabilities like version control, automated deployments through CI/CD, testing, logging, security and (critically) metrics. Each system often implements some form of optimization based on metrics – measuring what your system is and isn’t doing, measuring the outcomes and business metrics, and automating the processes for more holistic metrics, and incrementally improving step by step. This practice might be called “A/B experimentation” or “ML Ops” or “Metrics Driven development", but they derive from the same general approach and we will rely on those principles for AgentOps as well. Remember that new practices don't replace the old. DevOps and MLOps best practices are still necessary for AgentOps, as they are dependencies. For example, Agent tool use, where APIs are invoked based on agent orchestration, often uses the same APIs you would February 2025 9

10. Agents Companion invoke with non-agentic software. Authentication and secret management, security, privacy, exception handling, throttling, quotas, and scalability are still critical and require careful API design in addition to Agent design. Let’s go ahead and define these “ops” terms to help distinguish between them: • Development and Operations (DevOps) is the practice of efficiently productionizing deterministic software applications by integrating the elements of people, processes, and technology. DevOps serves as the foundation for all the following terms. • Machine Learning Operations (MLOps) builds upon the capabilities of DevOps and concentrates on the efficient productionization of ML models. The primary distinction is that the output of an ML model is non-deterministic and relies on the input data (garbage in, garbage out). • Foundation Model Operations (FMOps) expands upon the capabilities of MLOps and focuses on the efficient productionization of pre-trained (trained from scratch) or customized (fine-tuned) FMs. • Prompt and Operations (PromptOps) is a subcategory of GenAIOps that focuses on operationalizing prompts effectively. Its main additional capabilities include prompt storage, lineage, metadata management (including evaluation scores), a centralized prompt template registry, and a prompt optimizer. • RAG and Operations (RAGOps) is a subcategory of GenAIOps that centers on efficiently operationalizing RAG solutions. Its primary additional capabilities include the retrieval process through offline data preparation (encompassing cleaning, chunking, vectorization, similarity search, and re-ranking) and the generation process through prompt augmentation and grounding. February 2025 10

11. Agents Companion • Agent and Operations (AgentOps) is a subcategory of GenAIOps that focuses on the efficient operationalization of Agents. Its main additional components include internal and external tool management, agent brain prompt (goal, profile, instructions) and orchestration, memory, and task decomposition. Figure 2. Each of these “Ops” are about technology, processes, and people 14 All of these “Ops” are, in essence, the harmonious blend of people, processes, and technologies working together to efficiently deploy machine learning solutions into a live production environment. It’s crucial to recognize that Ops extends beyond mere technology; it’s not simply about constructing and delivering a ML pipeline. Successful Ops implementations delve deeper, considering the customer’s operational model, their existing business units, and their overall organizational structure. This holistic approach ensures that the technology is tailored to their specific needs, seamlessly integrating into the business and maximizing value. February 2025 11

12. Agents Companion The next section will cover Agent Evaluation in detail, which is a significant part of the story for AgentOps and automation to capture useful metrics. Before we go there, let’s start with a thought experiment; imagine setting up an A/B experiment in production for your new Agent. The treatment arm gets your new agent and the control arm does not. In that scenario, what metrics are you measuring to determine if the treatment arm is doing better? What metrics are you measuring to determine ROI for the project? Is it a goal being accomplished, or sales totals, or a set of critical steps in a user journey? Those metrics must be understood, instrumented and easily analyzed in addition to more detailed Agent Evaluation metrics. Agent Success Metrics Metrics are critical to building, monitoring, and comparing revisions of Agents. Business metrics, like revenue or user engagement, are probably outside of the scope of the agent itself but these should be the north star metric for your agents. Most Agents are designed around accomplishing goals, so goal completion rate is a key metric to track. Similarly, a goal might be broken down into a few critical tasks or critical user interactions. Each of these critical tasks and interactions should be independently instrumented and measured. So before we get into the details of the Agent itself, we already have several metrics identified which you should be able to easily track on a dashboard. Each business metric, goal, or critical interaction, will be aggregated in a familiar fashion: attempts, successes, rates, etc. Additionally, metrics you should be able to get from any application telemetry system are very important to track for agents as well, metrics like latency, errors, etc. February 2025 12

13. Agents Companion None of these metrics are specific to Agents, you could track them for any software, but they are even more important for Agent builders. Deterministic code does only what you tell it to do, whereas an agent can do a lot more, relying on LLMs which are trained on huge amounts of data. Instrumentation of these high level metrics is an important part of observability. You can think of them as Key Performance Indicators (KPI) for the agent, and they allow for observability in the aggregate, a higher level perspective of your agents. Human feedback is one of the more critical metrics to track as well. A simple 👍👎 or user feedback form, within the context of an agent or task can go a long way to understanding where your agent does well and where it needs improvement. This feedback can come from end users of a consumer system, but also employees, QA testers, and process or domain experts reviewing the agent. More detailed observability is also very important for agent building, being able to see and understand what the agent is doing and why it’s doing that. An agent can be instrumented with “trace” to log all of the inner workings of the agent, not only the critically important tasks and user interactions. You could conceptually measure every internal step as metrics, but that is rarely done. Instead these detailed traces are used to debug an agent when metrics or manual testing show a problem, you can dig into details and see what went wrong. February 2025 13

14. Agents Companion Figure 3: An example of Cloud Observability showing traces for an agent with tools and LLM OpenTelemetry spans. 15 So far we’ve been talking about business metrics, goals, tasks, human feedback, and traces – those are all ways of understanding the actions and impact of your agents, in production. While developing an agent, in addition to manual testing, automated testing will be much more efficient in the long run and provide greater insights into the behavior of agents. Agent Evaluation To bridge the gap between a proof-of-concept and a production-ready AI agent, a robust and automated evaluation framework is essential. Unlike evaluating generative models, where the focus is primarily on the final output, agent evaluation requires a deeper understanding of the decision-making process. Agent evaluation can be broken down into three components that we discuss in this chapter: 1. Assessing Agent Capabilities: Evaluating an agent's core abilities, such as its capacity to understand instructions and reason logically. February 2025 14

15. Agents Companion 2. Evaluating Trajectory and Tool Use: Analyzing the steps an agent takes to reach a solution, including its choice of tools, strategies, and the efficiency of its approach. 3. Evaluating the Final Response: Assessing the quality, relevance, and correctness of the agent's final output. Assessing Agent Capabilities Before evaluating your specific agentic use cases, publicly available benchmarks and technical reports can provide insight into core capabilities and limitations to consider when building out your agentic use cases. Public benchmarks exist for most fundamental agentic capabilities like model performance, hallucinations, tool calling and planning. For example, tool calling, the ability to select and use appropriate tools, is demonstrated by benchmarks like the Berkeley Function-Calling Leaderboard (BFCL) 16 and τ-bench 17 that also outlines common mistakes. Another example, PlanBench 18 aims to assess planning and reasoning, across several domains and specific capabilities. But tool calling and planning is not the only capability you should consider. Agents inherit behaviors from their LLMs and each of their other components. Likewise, agent and user interactions have a history in traditional conversational design systems and workflow systems, and therefore can inherit the set of metrics and measurements that are used to determine the efficacy of these systems. February 2025 15

16. Agents Companion Figure 4: A high level diagram of AgentBench. 19 Holistic agent benchmarks like AgentBench 19 try to capture end-to-end performance across several scenarios. This is more realistic, when the scenario reflects your agent use case, but not if it’s testing capabilities your agent doesn’t implement. It is difficult to simulate the environment, tools, instructions, and use case requirements in ways that are both specific and general at the same time. Companies and organizations are setting up public benchmarks for specialized use cases, like Adyen’s Data Analyst leaderboard DBAStep 20 which may give you a more targeted evaluation - if you understand both the evaluation approach and the agents who are on the leaderboard. Public benchmarks are a valuable starting point, to get a feeling for what is possible and identify pitfalls to look out for. Most benchmark assessments include discussions of common failure modes that can guide you in setting up your own, use-case specific evaluation framework. February 2025 16

17. Agents Companion Beyond public evaluations, you will want to evaluate the behavior of your agent across a variety of scenarios to ensure it does what you want it to do. You are simulating interactions with your agent and evaluating how it responds. This includes the evaluating final response and also the set of steps it takes along the way (trajectory). Those are the 2 most common and practical approaches we recommend you start with. There are many other evaluation techniques you can use beyond these, either for finer details on sub-components or broader approaches. Software engineers will compare this to automated testing of code. Investing in automated tests saves you time and gives you confidence in the software you are building. With agents, this automation pays off faster, in both time and confidence. Curating the evaluation data set will be extremely important for accurately representing the use case your agent will encounter, even more so than in software testing. Evaluating Trajectory and Tool Use An agent usually does several actions before it responds back to the user. It might compare the user input with session history to disambiguate a term, or lookup a policy document, search a knowledge base or invoke an API to save a ticket. Each of those actions is a step on a path, also called a “trajectory” of actions. Every time your agent does something, there’s a trajectory of actions under the hood. Comparing the trajectory that you expect the agent to take vs the trajectory that the agent actually took, is particularly useful for developers who want to debug their application, identifying errors or inefficiencies, and ultimately improving performance. February 2025 17

18. Agents Companion Figure 5: LangSmith diagram of final evaluation and trajectory. 21 The following six ground-truth-based automated trajectory evaluations provide different lenses to assess the performance of an agent: 1. Exact match: Requires the AI agent to produce a sequence of actions (a "trajectory") that perfectly mirrors the ideal solution. This is the most rigid metric, allowing no deviation from the expected path. 2. In-order match: This metric assesses an agent's ability to complete the expected trajectory, while accommodating extra, unpenalized actions. Success is defined by completing the core steps in order, with flexibility for additional actions. 3. Any-order match: Compared to in-order match, this metric now disregards the order. It asks if the agent included all necessary actions, but does not look into the order of actions taken and also allows for extra steps. February 2025 18

19. Agents Companion 4. Precision: How many of the tool calls in the predicted trajectory are actually relevant or correct according to the reference trajectory? 5. Recall: How many of the essential tool calls from the reference trajectory are actually captured in the predicted trajectory? 6. Single-tool use: Understand if a specific action is within the agent's trajectory. This metric is useful to understand if the agent has learned to utilize a particular tool yet. Figure 6: A radar chart plotting a single trajectory evaluation with a few metrics. 24 February 2025 19

20. Agents Companion Consider these metrics as different lenses for analyzing and debugging your agent's trajectory. Each metric offers a unique perspective, but not all will be relevant to every situation. For instance, some use cases demand strict adherence to the ideal trajectory, while others allow for more creative deviations. A clear limitation of this evaluation approach is that you need to have a reference trajectory in place for this to work. While ground-truth-based automated trajectory evaluations that are discussed here are prevalent in popular libraries. Research is advancing the use of agent autoraters for more efficient evaluation, for example Agent as a Judge, 2024 22 . Evaluating the Final Response The final response evaluation boils down to a simple question: Does your agent achieve its goals? You can define custom success criteria, tailored to your specific needs, to measure this. For example, you could assess whether a retail chatbot accurately answers product questions, or whether a research agent effectively summarizes findings with the appropriate tone and style. To automate this process, you can use autorater. An autorater is an LLM that acts as a judge. Given the input prompts and the generated response, it mirrors human evaluation by assessing the response against a set of user-provided criteria. For this evaluation to work, it is crucial to consider that given the absence of ground-truth, you need to be very precise in defining your evaluation criteria, as this is the core of what your evaluation is looking at. You find a number of predefined criteria in various libraries, treat them as a starting point and tweak them to provide your definition of good. February 2025 20

21. Agents Companion Human-in-the-Loop Evaluation The fields of agent development and agent evaluation are rapidly evolving. Evaluating AI agents presents significant challenges, including defining clear objectives, designing realistic environments, managing stochastic behavior, and ensuring fairness and bias mitigation, particularly in socially impactful applications. Therefore, it's crucial to incorporate a human-in-the-loop approach alongside the automated evaluations discussed previously (which involve predefined metrics and autoraters). Human-in-the-loop is valuable for tasks requiring subjective judgment or creative problem-solving, it can also serve to calibrate and double-check if your automated evaluation approaches actually work and align with your preferences. Key benefits include: • Subjectivity: Humans can evaluate qualities that are difficult to quantify, such as creativity, common sense, and nuance. • Contextual Understanding: Human evaluators can consider the broader context of the agent's actions and their implications. • Iterative Improvement: Human feedback provides valuable insights for refining the agent's behavior and learning process. • Evaluating the evaluator: Human feedback can provide a signal to calibrate and refine your autoraters. To implement human-in-the-loop evaluation, consider these methods: • Direct Assessment: Human experts directly rate or score the agent's performance on specific tasks. • Comparative Evaluation: Experts compare the agent's performance to that of other agents or your previous iterations. February 2025 21

22. Agents Companion • User Studies: Participants interact with the agent and provide feedback on its behavior, usability, and overall effectiveness. More about Agent Evaluation In this section we cover agent evaluation from the practical perspective. But this is just the tip of the iceberg. Agent evaluation presents many challenges. Evaluation data for your agents may be difficult to find. While synthetic data or LLMs as judges can be used, evaluations may still be incomplete. Also, LLM-as-a-Judge metrics, for example, may prioritize final outcomes over the agent's reasoning and intermediate actions, potentially missing key insights. Additionally, as evaluations for agent systems have a history in conversational and workflow systems, there is so much to explore on how to inherit methods and metrics to evaluate agent's capabilities, such as the ability to improve task performance over multiple interactions. Evaluations for multi-modal generations pose additional complexities; images, audio, and video evaluations require their own evaluation methods and metrics. Finally, real- world environments pose further challenges, as they are dynamic and unpredictable, making it difficult to evaluate agents in controlled settings. Looking ahead, to solve these open challenges, the field of agent evaluation is evolving rapidly. Key trends include a shift towards process-based evaluation, prioritizing the understanding of agent reasoning; an increase in AI-assisted evaluation methods for improved scalability; and a stronger focus on real-world application contexts. The development of new standardized benchmarks is also gaining traction, facilitating objective comparisons between agents, while increased emphasis on explainability and interpretability aims to provide deeper insights into agent behavior. February 2025 22

23. Agents Companion Evaluation Method Human Evaluation Strengths Weaknesses Captures nuanced behavior, Subjective, time-consuming, considers human factors expensive, difficult to scale LLM-as-a-Judge Scalable, efficient, consistent Automated Metrics Objective, scalable, efficient May overlook intermediate steps, limited by LLM capabilities May not capture full capabilities, susceptible to gaming Table 1: A table comparing strengths and weaknesses of automated evaluations for Agents. At this point it should be clear that only by continually refining evaluation methods, we will ensure that AI agents are developed and deployed responsibly, effectively, and ethically in the coming future. Multiple Agents & Their Evaluation Agent evaluation, which assesses the effectiveness, reliability, and adaptability of autonomous AI agents, as seen in the previous section, has emerged as a key focus area. We have seen a significant evolution in AI systems, transitioning towards multi-agent architectures—where multiple specialized agents collaborate to achieve complex objectives. A multi-agent system is like a team of experts, each specializing in a particular area, working together to solve a complex problem. Each agent is an independent entity, potentially using a different LLM, and with its own unique role and context. Agents communicate and collaborate to achieve a common goal. This approach differs from traditional single-agent systems, where one LLM handles all aspects of a task. February 2025 23

24. Agents Companion Multi-agent systems offer several advantages over single-agent systems: • Enhanced Accuracy: Agents can cross-check each other's work, leading to more accurate results. • Improved Efficiency: Agents can work in parallel, speeding up task completion. • Better Handling of Complex Tasks: Large tasks can be broken down into smaller, more manageable subtasks, with each agent focusing on a specific aspect. • Increased Scalability: The system can be easily scaled by adding more agents with specialized capabilities. • Improved Fault Tolerance: If one agent fails, others can take over its responsibilities. • Reduced Hallucinations and Bias: By combining the perspectives of multiple agents, the system can reduce the effects of hallucinations and bias, leading to more reliable and trustworthy outputs. Understanding Multi-Agent Architectures Unlike traditional monolithic AI systems, multi-agent architectures break down a problem into distinct tasks handled by specialized agents. Each agent operates with defined roles, interacting dynamically with others to optimize decision-making, knowledge retrieval, and execution. These architectures enable more structured reasoning, decentralized problem- solving, and scalable task automation, offering a paradigm shift from single-agent workflows. At their core, multi-agent systems leverage principles of modularity, collaboration, and hierarchy to create a robust AI ecosystem. Agents within these systems can be categorized based on their function for example: February 2025 24

25. Agents Companion • Planner Agents: Responsible for breaking down high-level objectives into structured sub-tasks. • Retriever Agents: Optimize knowledge acquisition by dynamically fetching relevant data from external sources. • Execution Agents: Perform computations, generate responses, or interact with APIs. • Evaluator Agents: Monitor and validate responses, ensuring coherence and alignment with objectives. Through these components, multi-agent architectures extend beyond simple prompt-based interactions, enabling AI-driven workflows that are adaptive, explainable, and efficient. Multi-Agent Design Patterns and Their Business Impact To design effective multi-agent architectures, specific design patterns have emerged. These patterns define interaction protocols, delegation mechanisms, and role distributions, allowing businesses to implement AI-driven automation in structured ways. Some common design patterns include: February 2025 25

26. Agents Companion Type of Multi-Agent System Description Agents work in a sequential manner, with each agent Sequential completing its task before passing the output to the next agent. Hierarchical Collaborative Example An assembly line, where each worker performs a specific operation before passing the product to the next worker. Agents are organized in a A system with a leader agent hierarchical structure, with a responsible for making strategic "manager" agent coordinating the decisions and follower agents workflow and delegating tasks to executing tasks based on the "worker" agents. leader's instructions. Agents work together A team of researchers working collaboratively, sharing on a project, where each member information and resources to contributes their expertise achieve a common goal. and insights. LLMs act as individual players in a Agents may compete with Competitive each other to achieve the best outcome. game like Overcooked-AI, where they must coordinate their actions to achieve a shared goal while competing for resources and optimizing individual performance. Table 2: A table comparing types of multi-agent systems. February 2025 26

27. Agents Companion Figure 7: An image depicting different multi-agent topologies, from LangGraph documentation. 23 The choice of design pattern depends on the specific application and the desired level of interaction between agents. Businesses benefit from these patterns by reducing operational bottlenecks, improving knowledge retrieval, and enhancing automation reliability. Multi-agent systems enable companies to scale AI deployments while ensuring agility in decision-making and workflow execution. February 2025 27

28. Agents Companion Important components of Agents The architecture of LLM-based AI agents consists of several interrelated components essential for autonomous operation and intelligent interaction: • Interaction Wrapper: This component serves as the interface between the agent and its environment, managing communication and adapting to various input and output modalities. • Memory Management: This includes short-term working memory for immediate context, cache, and sessions . It also can include long-term storage for learned patterns and experiences, as episodes, examples, skills or reference data. It also includes “reflection” to decide which short term items (eg: user preference) should be copied into long term memory (eg: user profile), and if that can be shared across agents, tasks, or sessions. • Cognitive Functionality: This is often underpinned by Chain-of-Thought (CoT), ReAct, reasoning, thinking, or a planner subsystem - it allows agents to decompose complex tasks into logical steps and engage in self-correction. In some cases this also includes user intent refinement, to ask a clarifying question if uncertain. • Tool Integration: This subsystem enables agents to utilize external tools, expanding their capabilities beyond natural language processing. Dynamic tool registries allowing discovery, registration, and “Tool RAG”. • Flow / Routing: This governs connections with other agents, facilitating dynamic neighbor discovery and efficient communication within the multi-agent system. This might be implemented as a delegation of a task to a background agent, or handoff of the user interaction to an agent, or the use of an agent as a tool. February 2025 28

29. Agents Companion • Feedback Loops / Reinforcement Learning: These enable continuous learning and adaptation by processing interaction outcomes and refining decision-making strategies. For gen AI agents this rarely takes the form of traditional RL training, but the performance metrics of the past can be incorporated into future decision making. • Agent Communication: Effective communication between agents is crucial for the success of multi-agent systems. The Agent to Agent communication protocol facilitates structured and efficient communication among agents, enabling them to achieve consensus and address complex problems collaboratively • Remote Agent Communication: Agent to Agent communication within an organization is critical to allows agents to share messages, tasks, and knowledge. Once your multi- agent system includes a remote agent, the communication protocol becomes even more important. Asynchronous tasks and sessions need to be durable, and updated with notifications while end users are not in session. Negotiations between Agents must allow for bringing a user into session and for supported UX capabilities. • Agent & Tool Registry (mesh): As you go beyond a handful of tools or a handful of agents, you need a robust system to discover, register, administer, select and utilize from a “mesh” of tools or agents. Critically important is the ontology and description of the tools and agents, their capabilities and requirements, and their performance metrics. Your agents will make a plan and choose which tool or which agent to use from such a system, and those choices are informed by the data in the system February 2025 29

30. Agents Companion Figure 8: An image demonstrating the process of a user interacting with multiple, self-coordinating agents. 24 These architectural elements provide a robust foundation for the autonomous operation and collaborative capabilities of LLM agents within multi-agent systems. February 2025 30

31. Agents Companion Challenges in Multi-Agent systems While multi-agent systems offer numerous advantages, they also face several challenges: • Task Communication: Today most agent frameworks communicate in messages, not structured async tasks. • Task Allocation: Efficiently dividing complex tasks among different agents can be challenging, and feedback loops are often left up to the developer to implement. • Coordinating Reasoning: Getting agents to debate and reason together effectively requires sophisticated coordination mechanisms. • Managing Context: Keeping track of all the information, tasks, and conversations between agents can be overwhelming. • Time and Cost: Multi-agent interactions can be computationally expensive and time- consuming. This results in higher runtime prices and more user latency. • Complexity: In the same way that microservice architecture allows each microservice more flexibility and simplicity, the system as a whole usually becomes more complex. Addressing these challenges is crucial for developing robust and efficient multi-agent systems. February 2025 31

32. Agents Companion Multi-Agent Evaluation Luckily, the evaluation of multi-agent systems is a clear progression of evaluating single agent systems. Agent Success Metrics are unchanged, business metrics as your north star, goals and critical task success metrics, application telemetry metrics like latency and errors. Instrumenting the multi-agent system with trace will help debug and understand what is happening during complex interactions. In the Agent Evaluation section we discussed Evaluating Trajectories and Evaluating the Final Response as the 2 best approaches to automated evaluation of an agent, and this remains the case for multi-agent systems. For a multi-agent system, a trajectory of actions might include several or even all of your agents. Even though several agents may collaborate on a task, a single final answer is returned to the user at the end and can be evaluated in isolation. Because a multi-agent system probably has more steps, you can drill down and evaluate at every step. You can evaluate each of your agents in isolation and the system as a whole. Trajectory evaluations are a scalable approach to do exactly this. There are some questions you need to ask, which are unique to multi-agent systems, including: • Cooperation and Coordination: How well do agents work together and coordinate their actions to achieve common goals? • Planning and Task Assignment: Did we come up with the right plan, and did we stick to it? Did child agents deviate from the main plan or get lost in a cul-de-sac? • Agent Utilization: How effectively do agents select the right agent and choose to use the agent as a tool, delegate a background task, or transfer the user? February 2025 32

33. Agents Companion • Scalability: Does the system's quality improve as more agents are added? Does the latency go down? Are we being more efficient or less? These types of questions can guide developers to identify areas for improvement in the multi-agent system. You will answer these questions using the same tools you use for single agent systems, but the complexity of the analysis goes up. Agentic RAG: A Critical Evolution in Retrieval-Augmented Generation A significant advancement in multi-agent architectures is Agentic Retrieval-Augmented Generation (Agentic RAG). Traditional RAG pipelines rely on a static approach—retrieving knowledge from vector databases and feeding it into an LLM for synthesis. However, this approach often fails when dealing with ambiguous, multi-step, or multi-perspective queries. Agentic RAG introduces autonomous retrieval agents that actively refine their search based on iterative reasoning. These agents enhance retrieval in the following ways: • Context-Aware Query Expansion: Instead of relying on a single search pass, agents generate multiple query refinements to retrieve more relevant and comprehensive results. • Multi-Step Reasoning: Agents decompose complex queries into smaller logical steps, retrieving information sequentially to build structured responses. • Adaptive Source Selection: Instead of fetching data from a single vector database, retrieval agents dynamically select the best knowledge sources based on context. • Validation and Correction: Evaluator agents cross-check retrieved knowledge for hallucinations and contradictions before integrating it into the final response. February 2025 33

34. Agents Companion This approach significantly improves response accuracy, explainability, and adaptability, making it a crucial innovation for enterprises dealing with complex knowledge retrieval tasks in areas like legal research, scientific discovery, and business intelligence. Agentic RAG and its Importance Agentic RAG (Retrieval-Augmented Generation) is an advanced approach that combines the strengths of RAG with the autonomy of AI agents. Traditional RAG systems retrieve relevant information from external knowledge sources to enhance LLM responses. Agentic RAG takes this a step further by employing intelligent agents to orchestrate the retrieval process, evaluate the retrieved information, and make decisions about how to best utilize it. Agentic RAG offers several advantages over traditional RAG: • Improved Accuracy: Agents can evaluate the quality of retrieved information and make decisions about which sources to trust, leading to more accurate and reliable responses. • Enhanced Contextual Understanding: Agents can consider the context of the user's query and the retrieved information to generate more relevant and meaningful responses. • Increased Adaptability: Agents can adapt to changing information needs and dynamically adjust their retrieval strategies to provide the most up-to-date and relevant information. This adaptability is crucial in complex domains where information is constantly evolving, such as healthcare, finance, and legal research. February 2025 34

35. Agents Companion Figure 9: A diagram of Agentic RAG from Vectorize.io. 25 Agentic RAG is particularly valuable in complex domains where information is constantly evolving, such as healthcare, finance, and legal research. For example, in healthcare, agentic RAG can help navigate complicated medical databases, research papers, and patient records, providing doctors with comprehensive and accurate information. February 2025 35

36. Agents Companion Better Search, Better RAG Almost all RAG approaches require a search engine to index and retrieve relevant data. The introduction of agents allows for refinement of query, filtering, ranking, and the final answer. Agentic RAG agents are executing several searches to retrieve information. For developers who are trying to optimize existing RAG implementations, it is usually most valuable to improve search results (measured in recall) prior to introducing agents. Some of the main techniques to improve search performance are: • Parse source documents and chunk them: Vertex AI Layout Parser can handle complex document layouts, embedded tables, and embedded images like charts, and uses a semantic chunker to keep chunks on topic with a hierarchy of headings. • Add metadata to your chunks: synonyms, keywords, authors, dates, tags and categories allow your searches to boost, bury, and filter; these allow your users or your agents more control over search results. • Fine tune the embedding model or add a search adaptor which changes embedding space: these allow the searchable index of vectors to represent your domain better than a general purpose embedding model. • A faster vector database can improve search quality: to search embeddings, you must make a tradeoff between speed and accuracy, upgrading to an ultra-fast Vertex AI Vector Search can improve both latency and quality • Use a ranker: vector searches are fast but approximate, they should return dozens or hundreds of results which need to be re-ranked by a more sophisticated system to ensure the top few results are the most relevant or best answer. • Implement check grounding: as a safeguard on grounded generation, you can ensure each phrase is actually citable by retrieved chunks. February 2025 36

37. Agents Companion Figure 10: A diagram of common RAG and search components, showing Vertex AI Search 26 , search builder APIs 27 , and RAG Engine. 28 Vertex AI Search 26 is a powerful search engine providing Google quality search for your data and can be used with any RAG or Agentic RAG implementation. Each of the above components is automatically available within Vertex AI Search, without any development time at all. For developers who want to build their own search engine, each of the above components is exposed as a standalone API 27 , and RAG Engine 28 can orchestrate the whole pipeline easily, with a LlamaIndex-like Python-based interface. February 2025 37

38. Agents Companion Agents in the enterprise Manager of agents 2025 is the year of the agents, and this will particularly be true in the context of enterprises that will develop and use agents that will assist employees to perform specific tasks or autonomously run in the background performing automation. Business analysts can effortlessly uncover industry trends and create compelling, data-driven presentations fueled by AI-generated insights. HR teams can revolutionize the employee experience with streamlined onboarding, even for complex tasks like 401k selection. Software engineers can proactively identify and resolve bugs, enabling them to build and iterate with greater efficiency, and accelerate deployment cycles. Marketers can unlock deeper performance analysis, optimize content recommendations, and fine-tune campaigns effortlessly to achieve better results. We see two types of agents emerging: 1. “Assistants”: Agents that interact with the user, take a task, execute it, and come back to the user. Conversational agents popularized by frameworks like Gems or GPTs usually belong to this category. Assistants can be general - able to help on a variety of tasks - or specialized to a particular domain, or tasks. Examples include agents that help schedule meetings, analyze data, write code, write marketing press releases, help sellers with sales opportunities, or even agents that perform deep research on a particular topic as requested by the user. These agents can be synchronous and return the requested information or perform the requested task fast, or they run for a longer period of time before returning (like the deep research agents). February 2025 38

39. Agents Companion 2. "Automation agents”: Agents that run in the background, listen to events, monitor changes in systems or data, and then make smart decisions and act. Action might include acting on backend systems, performing some tests to validate the observation, fixing problems, notifying the right employees, etc. This can really be seen as the backbone of the automation of the future. While in the past we had to write special code to specify the logic of automations, now we can start relying on smart and general decision making abilities of AI agents. Rather than simply invoking agents to perform a task and wait for the output, knowledge workers will increasingly become managers of agents. They will be assigning tasks to multiple agents and manage them, check if any of agents need help or require approval to proceed, use the output of specific agents to start new tasks, monitor execution of long running tasks to review and steer them in the right direction. Novel user interfaces to allow this type of virtual team management will emerge to allow orchestration, monitoring and managing a multi-agent system with agents performing tasks, calling (or even creating) other agents. Google Agentspace aims at providing this type of experience and allow to: • Create new agents by using a no/low code interface or a full code framework to implement both specialized assistants and automation agents • Configure and manage the agents access for company administrators • Invoke the right agents when appropriate • Monitor, manage, and orchestrate multiple agents in a UI suited for team management February 2025 39

40. Agents Companion Google Agentspace Google Agentspace 29 offers a suite of AI-driven tools designed to elevate enterprise productivity by facilitating access to pertinent information and automating intricate, agentic workflows. It harnesses the advanced reasoning capabilities of Gemini, the power of Google's search infrastructure, and secure access to enterprise data, irrespective of its physical location. Agentspace directly addresses the limitations inherent in traditional knowledge management systems, which frequently fall short in areas such as personalization, automated answer generation, contextual comprehension, and comprehensive information retrieval. It empowers employees with efficient information access by consolidating disparate content sources, generating grounded and personalized responses, and streamlining operational workflows. Key functionalities include the capacity to ingest a wide variety of data formats, synchronize data from Software-as-a-Service (SaaS) platforms, deliver access-controlled search results and AI-generated answers, and integrate AI assistance and intelligent agents into cohesive workflows. The architecture of Agentspace Enterprise is founded upon several core principles. Paramount among these is built-in trust, which prioritizes security, explainability, and governance through features such as single sign-on (SSO) authentication, an integrated permissions model, and user-level access controls. Google's advanced intelligence is leveraged to discern user behavior and content patterns through machine learning, resulting in the delivery of highly relevant results via semantic understanding, knowledge graphs, and LLMs. Universal connectivity is achieved by connecting to a diverse array of enterprise systems with on-demand and automated data refreshes, thereby eliminating information silos. Enterprise-level customization and user-level personalization are facilitated through granular controls for search functionality, recommendations, LLMs, and knowledge graphs, providing tailored experiences based on individual user roles, permissions, and search February 2025 40

41. Agents Companion history. Real-time feedback and adaptation mechanisms enable the continuous refinement of results through machine learning and user input. Blended Retrieval Augmented Generation (RAG) allows for customizable data blending, powering generative AI applications grounded in enterprise data. Finally, scalability is a critical design consideration, with the system engineered to accommodate growth across geographical regions, languages, and peak usage demands. Security is always top of mind. Google Agentspace is built on the Google Cloud secure-by- design infrastructure, giving you the peace of mind to confidently deploy AI agents across your organization. It provides granular IT controls, including role-based access control (RBAC), VPC Service Controls, and IAM integration, ensuring your data remains protected and compliant at all times.Security is a foundational principle of Agentspace. Built upon the secure infrastructure of Google Cloud, it provides a robust environment for the deployment of AI agents. Granular IT controls, encompassing role-based access control (RBAC), Virtual Private Cloud (VPC) Service Controls, and Identity and Access Management (IAM) integration, guarantee data protection and regulatory compliance. These security measures are essential for the safeguarding of sensitive information and give users the peace of mind to confidently deploy AI agents across their organization. NotebookLM Enterprise NotebookLM 30 is a research and learning tool designed to streamline the process of understanding and synthesizing complex information. It empowers users to upload various source materials, including documents, notes, and other relevant files, and leverages the power of artificial intelligence to facilitate deeper comprehension. Imagine researching a multifaceted topic; NotebookLM allows you to consolidate all your scattered resources into February 2025 41

42. Agents Companion a single, organized workspace. In essence, NotebookLM functions as a dedicated research assistant, accelerating the research process and enabling users to move beyond mere information collection to genuine understanding. Figure 11: NotebookLM Enterprise 30 configured with a few mock earning reports. NotebookLM Plus builds upon this foundation, offering a premium tier with enhanced features and capabilities. While the core functionality of uploading sources, asking questions, and generating summaries remains, NotebookLM Plus typically adds features like increased storage for source materials, enabling work with larger and more complex projects. It February 2025 42

43. Agents Companion may also include more sophisticated AI-powered analysis tools, such as more nuanced summarization options, enhanced question-answering capabilities, or the ability to identify connections and relationships between different sources more effectively. Building upon the foundation of NotebookLM Plus, NotebookLM Enterprise 30 brings these powerful capabilities to the enterprise environment, streamlining how employees interact with and derive insights from their data. This enterprise-grade version enables users to not only upload and synthesize information, but also to uncover hidden patterns and engage with data in innovative ways. A prime example is the AI-generated audio summary feature, which enhances comprehension and facilitates efficient knowledge absorption by allowing users to "listen" to their research. Technically, NotebookLM, both in its consumer and enterprise forms, employs LLMs to process uploaded documents, extract key concepts, and generate summaries. The audio summary feature uses text-to-speech (TTS) technology incorporating advanced prosody control to ensure clarity and naturalness. Critically, NotebookLM Enterprise incorporates enterprise-grade security and privacy features, ensuring that sensitive company data is handled with the utmost care and protected in accordance with organizational policies.` Google AgentSpace Enterprise Google Agentspace furnishes employees with a unified, company-branded, multimodal search agent that serves as the definitive source of enterprise information. Drawing upon Google's extensive search capabilities, Agentspace offers unparalleled conversational assistance. Employees get answers to complex queries, proactive recommendations, and unified access to information from any source. This functionality extends to both unstructured data, such as documents and emails, and structured data, like tables. Integrated translation capabilities ensure comprehensive understanding, regardless of February 2025 43

44. Agents Companion the original language of the information. Pre-built connectors for widely used third-party applications, including Confluence, Google Drive, Jira, Microsoft SharePoint, ServiceNow, and others, empower employees to seamlessly access and query relevant data sources, facilitating more informed decision-making. Figure 12: Agentspace 29 customized for Cymbal bank, connected to several data stores. Much more than just information retrieval, agents can take actions in the real world, manage async tasks and workflows, and help employees get work done. A gallery of agents can be configured based on an organization's unique needs and will help with deep research, idea generation and merit based refinement, creative asset generation, data analytics and more. February 2025 44

45. Agents Companion Agentspace Enterprise Plus facilitates the creation of custom AI agents tailored to specific business functions. This platform enables the development and deployment of contextually aware generative AI agents, empowering employees across departments—marketing, finance, legal, engineering, and more—to conduct more effective research, rapidly generate content, and automate repetitive tasks, including multi-step workflows. A centralized interface streamlines the discovery and access of these specialized agents, promoting scalable AI adoption. Custom agents will connect with internal and external systems and data, be tailored to company domain and policies, and potentially even utilize machine learning models trained on proprietary business data. The platform provides builders tools for agent development, deployment, and lifecycle management. February 2025 45

46. Agents Companion From agents to contractors The common interface to define AI agents across various tools and platforms today is very simple, and usually includes specifying the goal, some textual instructions, the tools that the agent can use, and a set of examples. While this might be sufficient to prototype demos, it leads to potentially underspecified definitions, and might be one of the leading reasons that AI agents can struggle to get from prototype-to-production. We propose to evolve the Agent interface to evolve them into ”Contract adhering agents” which are aimed at solving complex tasks using AI Agents, more specifically in contexts where stakes are high. Contracts The key idea of contractors is to specify and standardize the contracts between the requester and the agents, making it possible to: 1. Define the outcomes as precisely as possible, similarly to what we would do in a real world scenario when agreeing on a service with a company we are contracting to do something for us. This allows the agent (contractor) to validate against the desired outcomes and iterate until the desired objective is achieved. 2. Make it possible to negotiate the task as well as clarifying and refining it, in order to avoid any ambiguity in the definition, and fill any gap in common understanding of the goals. 3. Define the rules for the contractors to generate new subcontracts needed to solve the bigger one in a standard fashion (cf. section below on subcontracts). February 2025 46

47. Agents Companion Contract, initial definition 4. Fields Task/Project description Description Required Provide a detailed description of what we expect the contractor to achieve. It should be as specific and as Yes unambiguous as possible. Describe precisely the expected outcomes and Deliverables & Specifications deliverables from the contractor's task, including a list of specifications clarifying what makes the deliverable Yes acceptable as outcome and details on how to verify that the deliverable is fulfilling the expectation. Clarify the scope of the tasks that the contractor is Scope responsible for completing, going into separate detail about every aspect of the task. Also used to clarify what is out No of scope. Gives expectation in terms of cost for the task completion. Expected Cost This is usually a function of the complexity of the task Expected Duration Gives expectation in terms of duration for the task completion. Input Sources Reporting and Feedback Yes combined with what tools will be used. Specify what input sources can be used and considered to be useful to complete the task. Yes No Specifies how the feedback loop should look like: how often we expect updates on the progress, and what mechanism/surface Yes is used to provide feedback (emails, APIs, etc.). Table 3: An example data model for the contracts. February 2025 47

48. Agents Companion Contract Iteration: Feedback & Negotiation Fields Underspecification Description Highlight aspects that are underspecified or need clarification from the task initiator. Required No Cost negotiation Cost considered too high to complete the task. No Risk Highlights potential risks in fulfilling the contract. No Additional input needed Express the kinds of additional data or information that would be useful to fulfill the contract. No Table 4: An example data model for the iteration of messaging between contractors. February 2025 48

49. Agents Companion Contract Lifecycle The following illustration shows the lifecycle of defining, negotiating and executing the contract: Figure 13: A diagram of the agent as contractor lifecycle from Agentspace. Contract execution This requires the contractor runtime to be able to fulfill the contracts and solve the tasks behind contracts according to their defined specifications. Prioritizing quality and completeness over latency enables to fully leverage the capabilities of LLMs, for example February 2025 49

50. Agents Companion by generating different solutions, and reviewing, scoring, and evolving them. An example of this methodology is shown below in the Co-Scientist study. The engine can iterate and self- validate the results and deliverables based on the provided expectations, and can improve and self-correct until the validators are fulfilled. The ability to concretely validate a solution against a set of objective criteria has proven to work extremely well in the context of AI and has been at the core of successful AI systems such as for example Alpha-Code. Contract Negotiation One core hypothesis behind contracts in the context of automation agents specifically is that many tasks in the enterprise world can draw significant benefit from leveraging the power of LLMs when used in a less-constrained manner (latency and cost wise). Being able to tackle more and more complex tasks and making it possible for customers to be able to rely on and trust the results of contractors will ensure real value for companies. Even that being said, we need to have a notion of relative priority in order to make sure that tasks are appropriately prioritized as well as resources fairly allocated. We thus introduce a notion of cost (typically relative per customer or contract initiator) which can be discussed and negotiated between the contract initiator and the contractor, in order to make sure that the contract receives the adequate resources relative to the other contracts initiated by the contract initiator. The contractors can also negotiate other aspects of the contracts, for example in terms of specification and deliverables (cf. also section below on feedback). February 2025 50

51. Agents Companion Contract Feedback Contracts give a vehicle to provide feedback and in particular resolve ambiguities. As tasks become more and more complex, it is critical to be able to raise ambiguities or other issues related to the tasks specifications as early as possible. Contractors can give feedback on the contract just after having received the contract (initial contract assessment), and then at a frequency predefined in the contract. This feedback will contain clarification requests, or other types of feedback about the underspecification or misspecification of tasks (inconsistencies, conflicting specs, clarification, etc.). Subcontracts Although not part of the contract definition and specification directly, the ability to decompose a task into subtasks by generating subcontracts is a core concept that will be used to power the contractors’ engine. When a task is considered too complex to be tackled directly, contractors can decide to decompose the task into smaller and easier tasks, which will be added to the execution queue for solving. This is made possible only through the contract formalization described above, which makes it possible for the contractors to generate, process and manipulate other contracts in a uniform and standardized way. February 2025 51

52. Agents Companion Google's Co-Scientist: A Case Study in Multi-Agent Intelligence Google's AI co-scientist is a prime example of a multi-agent LLM system applied to scientific research. This system utilizes a team of specialized agents, each with its own role and expertise, to accelerate the pace of scientific discovery. These agents collaborate to generate, evaluate, and refine hypotheses, mirroring the iterative process of scientific inquiry. The co-scientist system employs a "generate, debate, and evolve" approach, drawing inspiration from the scientific method. This approach involves generating diverse hypotheses, critically evaluating their potential, and refining them through ongoing feedback and analysis. The system leverages the strengths of different LLMs, each specializing in a particular aspect of the research process, to achieve a more comprehensive and robust outcome. For instance, in a study on liver fibrosis treatments, the co-scientist not only identified existing drugs but also proposed new mechanisms and promising drug candidates, demonstrating its potential to generate novel insights. Some of its major components are: • Data Processing Agents: aggregate and structure large volumes of experimental data. • Hypothesis Generators: propose potential explanations based on existing research and new findings. • Validation Agents: run simulations and verify results before presenting them to researchers. • Collaboration Agents: communicate findings across different research teams, enhancing interdisciplinary cooperation. February 2025 52

53. Agents Companion Figure 14: A process diagram of the AI co-scientist system 31 This approach exemplifies how multi-agent architectures move beyond static AI interactions, instead fostering dynamic, evolving intelligence capable of handling sophisticated tasks in research, enterprise automation, and knowledge management. February 2025 53

54. Agents Companion Automotive AI: Real World Use of Multi-Agent Architecture To illustrate how multi-agent concepts manifest in practice, let's examine a comprehensive multi-agent system designed for automotive conversational AI, where several specialized agents collaborate to provide an intuitive and seamless in-car experience. First, we’ll look at some examples of specialized agents, followed by some common patterns for combining them to form a multi-agent system. Specialized Agents Let’s first look at some examples of specialized agents. Conversational Navigation Agent The Conversational Navigation Agent is specialized in helping users find locations, suggest places, and navigate using APIs like Google Places and Maps. This agent understands navigation intents like "Find me a sushi restaurant nearby," queries the Places API for relevant results, re-ranks these results based on user preferences and historical data, and sends navigation commands to the car's built-in system. February 2025 54

55. Agents Companion Example Interaction: User: "I need to take a break for lunch. What are good restaurants on my route to Munich?" Navigation Agent: "I found three well-rated restaurants along your route on the A9. In 25 kilometers, there's 'Gasthaus Alte Mühle' at the Ingolstadt exit with 4.5 stars - reviewers praise their traditional Bavarian dishes and beer garden. In 45 kilometers, 'La Strada' at the service area has 4.1 stars with good reviews for quick Italian meals. Or if you can wait 70 kilometers, 'Zeitlos' near Allershausen has 4.7 stars - it's a bit fancier with farm-to-table cuisine. Which one interests you most?" User: "The Bavarian place sounds good. Is there convenient parking there, and how far is it from the motorway?" Navigation Agent: "Gasthaus Alte Mühle has a dedicated parking area with about 40 spaces. It's only 1.2 kilometers from the A9 exit at Ingolstadt-Nord - about a 2-minute drive. They're known for their Schweinebraten and homemade dumplings. It gets busy around noon, but it's currently showing moderate occupancy. Would you like me to set this as your destination and guide you there? You'll arrive around 12:15 if we head there now." Conversational Media Search Agent The Conversational Media Search Agent is specialized in helping users find and play music, audiobooks, and podcasts. This agent retrieves media from local databases or streaming services, uses web search for additional content when necessary, suggests media based on contextual factors like mood, weather, and time of day, and handles voice commands such as "Play a relaxing playlist for my drive." It also identifies similar artists or content when users ask questions like "Who sounds like John Coltrane?" February 2025 55

56. Agents Companion Example Interaction: User: "Play something upbeat to help me stay alert on this long motorway drive." Conversational Media Search Agent: "I'll play an upbeat driving playlist to help keep you energized. Starting with 'Don't Stop Me Now' by Queen, followed by some Daft Punk and Stromae tracks. I've added some popular European radio hits that match your preferences. I've adjusted the volume for optimal road awareness. Just say 'next' if you want to skip a song." Message Composition Agent The Message Composition Agent is specialized in helping users draft, summarize, and send messages or emails while driving. This agent parses voice commands like "Tell Alex I'm running late," uses natural language models to compose contextually appropriate messages, offers draft previews with user modification options, and integrates with various messaging apps including SMS, WhatsApp, and email. Example Interaction: User: "There's heavy traffic on the E40. Let my team know I'll be 20 minutes late to the morning standup." Message Composition Agent: "I've prepared a message to your work team: 'Stuck in traffic on the E40. Will be about 20 minutes late to our morning standup. Please start without me.' Should I send this to your 'Work Team' chat group, or would you like to edit it first?" February 2025 56

57. Agents Companion Car Manual Agent The Car Manual Agent is specialized in providing answers to car-related questions by leveraging a Retrieval-Augmented Generation (RAG) system. This agent retrieves relevant sections from the car manual based on user queries, summarizes and contextualizes this information using an LLM, and links users to detailed documentation or instructional videos when appropriate. Example Interaction: User: "How do I turn off the lane-keeping assist feature in this Volkswagen?" Car Manual Agent: "To turn off lane-keeping assist in your Volkswagen, press the button with the car-and-lane icon located on the left side of your steering wheel. You'll see a confirmation in the driver display when it's deactivated. You can also disable it permanently in the infotainment system under Settings > Driver Assistance > Lane Keeping. Would you like me to walk you through the settings menu instead?" General Knowledge Agent The General Knowledge Agent is specialized in answering factual questions about the world, history, science, culture, and other general topics. This agent accesses a broad knowledge base to answer factual questions, provides biographical information about people, offers contextual explanations of concepts and phenomena, maintains grounding in factual information while avoiding hallucinations, and responds to follow-up questions by maintaining contextual awareness. Example Interaction: User: "Tell me an interesting fact about Salzburg." February 2025 57

58. Agents Companion General Knowledge Agent: "Salzburg, Austria is the birthplace of Mozart, one of history's most famous composers. Beyond that well-known fact, Salzburg's old town is a UNESCO World Heritage site, and parts of the beloved musical 'The Sound of Music' were filmed there in 1964. The Salzburg Festival, one of the world's most prestigious music and drama festivals, has been held there annually since 1920." Patterns in Use Let’s look at the some common ways used to combine these agents. Hierarchical Pattern A central Orchestrator Agent classifies queries and routes them to specialized agents. For example, when a user asks "Find me a sushi restaurant nearby," the Orchestrator detects this as a navigation request and routes it to the Navigation Agent, which handles location searches and interaction with maps APIs. Figure 15. Hierarchical pattern. February 2025 58

59. Agents Companion The Orchestrator Agent is specialized in routing queries to the most appropriate specialized agents and managing the overall conversation flow. This agent analyzes user queries to determine domain and intent, directs requests to the most suitable specialized agent, maintains conversation context across multiple turns, and manages fallback strategies when specialized agents cannot provide satisfactory responses. Example Interaction: First Turn: User: "Is there a service area ahead with clean toilets?" Orchestrator Agent: [internally routes to Navigation Agent] Navigation Agent: "There's a service area 20 kilometers ahead on the E19. It's Aire de Waremme, which was renovated last year and has good ratings for facilities and cleanliness. It also has a Paul bakery and a quick-service restaurant." Second Turn: User: "Will it rain at my destination this evening?" Orchestrator Agent: [recognizes weather query and routes to Weather Information Agent] Weather Information Agent: "At your destination in Brussels, the forecast shows a 60% chance of light rain in the evening, starting around 19:00. Temperatures will be around 14°C. I'd recommend bringing an umbrella if you plan to be outdoors." February 2025 59

60. Agents Companion Diamond Pattern In this variation of the hierarchical pattern, responses from specialized agents pass through a central moderation agent before reaching the user. For instance, the Navigation Agent generates a factual response about a nearby restaurant, which then goes through a Rephraser Agent that adjusts the tone and style based on user preferences before text-to-speech synthesis. The Rephraser Agent is specialized in adapting the tone, style, and presentation of responses to match user preferences and context. This agent transforms technical information into conversational language, adjusts response length based on driving conditions and urgency, personalizes communication style to match user preferences, and formats complex information for easier comprehension while driving. Example Interaction: [Raw response from Car Manual Agent before rephrasing]: "In the event of a tire pressure monitoring system alert, the vehicle has detected that one or more tires have fallen below the recommended inflation pressure of 2.2 bar. The affected tire's pressure is approximately 1.8 bar, which is 18.75% below recommended levels. This does not constitute an immediate safety hazard, but prolonged driving with underinflated tires can lead to decreased fuel efficiency, reduced tire longevity, and potentially compromised handling characteristics. It is advisable to proceed to a service station at your earliest convenience to reinflate the tire to proper specifications." Rephraser Agent: [Adapts for driving context, adjusts information density and urgency] Final Response to User: "Your front right tire is a bit low - down to 1.8 bar. It's safe to keep driving, but you'll want to add some air at the next petrol station. Low tire pressure can affect your fuel economy and make the tire wear out faster. I can help you find a nearby place to fill up if you'd like." February 2025 60

61. Agents Companion Figure 16. Diamond pattern. Example transformation: Initial Response (factual & neutral): "I found a highly-rated sushi restaurant nearby. It's called Sakura Sushi, located at 123 Main Street. It has a 4.7-star rating and is open until 10 PM. Would you like me to start navigation?" After rephrasing (playful style): "Sushi craving? Say no more! Head over to Sakura Sushi at 123 Main Street, where the fish is fresh and the soy sauce flows freely! Rated 4.7 stars, open till 10 PM. Ready for an umami adventure?” February 2025 61

62. Agents Companion Peer-to-Peer Agents can hand off queries to one another when they detect that the orchestration made a routing mistake. This creates a more resilient system that can recover from initial misclassifications. Figure 17. Peer-to-peer. Example scenario: 1. User asks: "Find a place to eat sushi nearby" 2. The Orchestrator correctly routes this to the Conversational Navigation Agent, which provides information about nearby sushi restaurants. 3. User follows up: "How big is New York's Central Park?" 4. The Orchestrator might initially route this to the Conversational Navigation Agent again (based on the previous navigation-related conversation). February 2025 62

63. Agents Companion 5. However, the Conversational Navigation Agent recognizes this as a general knowledge question rather than a navigation request, and hands it off to the General Knowledge Agent, which can provide factual information about Central Park's size. Advantages of peer-to-peer hand-off compared to centralized orchestration: 1. Resilience to misclassification: Even if the central orchestrator makes an error in routing, specialized agents can recognize when a query falls outside their domain and redirect appropriately. 2. Domain expertise in routing: Specialized agents often have better understanding of the boundaries of their own domains. The Media Search Agent knows exactly what kinds of music-related queries it can handle better than a general orchestrator would. 3. Reduced orchestration complexity: The central orchestrator doesn't need perfect accuracy in initial routing, reducing the complexity of its decision-making logic. Collaborative Pattern The Collaborative Pattern involves multiple agents working on complementary aspects of the same task, with a Response Mixer Agent that combines elements from different agent responses to create a comprehensive answer. This approach recognizes that different agents contribute valuable pieces to a complete solution based on their specialized expertise. The pattern is particularly valuable when: 1. Different aspects of a query require different types of expertise (e.g., technical specifications, practical advice, and conceptual explanations) 2. No single agent has complete information to fully address the user's needs 3. The user would benefit from multiple perspectives on the same question February 2025 63

64. Agents Companion 4. Different specialized agents have access to distinct knowledge bases or reasoning capabilities Unlike the competitive approach where responses compete, the collaborative pattern assumes that responses from different agents are complementary rather than redundant. The Response Mixer Agent identifies the most valuable information from each source and synthesizes it into a cohesive answer that leverages the unique strengths of each specialist. For example, when asked about handling hydroplaning, the Car Manual Agent contributes vehicle-specific safety system information, the Driving Tips Agent provides practical driving techniques, and the General Knowledge Agent explains the physics behind the phenomenon. Together, they create a more complete and useful response than any single agent could provide alone. February 2025 64

65. Agents Companion Figure 18. Collaborative pattern. February 2025 65

66. Agents Companion Response Mixer Agent The Response Mixer is specialized in picking the best responses from multiple agents and merging them when needed. This agent evaluates answers based on their accuracy and relevance, removes incorrect information, and combines useful parts from different responses to create the most helpful answer for the user. Example Interaction: User: "How do I handle aquaplaning?" [Behind the scenes, multiple agents generate responses:] Car Manual Agent Response: "Your Audi is equipped with Electronic Stability Control (ESC) and Anti-lock Braking System (ABS) which help prevent aquaplaning by maintaining traction control." [Confidence score: 71%] General Knowledge Agent Response: "Aquaplaning occurs when your tires lose contact with the road surface due to water buildup between the tires and the road." [Confidence score: 65%] Safety Tips Agent Response: "If you experience aquaplaning, ease off the accelerator, avoid sudden braking or steering, and gently steer in the direction you want to go until you regain traction." [Confidence score: 94%] Response Mixer Agent: [Picks the most relevant parts and merges them into one complete answer] Final Response to User: "If you experience aquaplaning, which happens when your tires lose contact with the road due to water buildup, ease off the accelerator, avoid sudden braking or steering, and gently steer in the direction you want to go until you regain traction. Your Audi's Electronic Stability Control and ABS systems will help maintain control, but proper driving technique is still essential." February 2025 66

67. Agents Companion This competitive approach ensures that critical safety information isn't missed due to misrouting, and the user receives the most helpful possible response even when their query has multiple valid interpretations. Adaptive Loop Pattern The Adaptive Loop Pattern involves iterative refinement through repeated attempts, progressively improving results until they meet desired criteria. The Conversational Navigation Agent can implement a loop pattern to iteratively improve search results when initial queries don't yield satisfactory outcomes. Example Interaction: 1. User asks: "Find a nice Italian restaurant that serves vegan options" 2. The Navigation Agent queries Places API with these parameters 3. The agent evaluates the results and finds no restaurants meeting all criteria 4. Instead of returning "no results," the agent automatically reformulates the query: • First loop: Searches for "Italian restaurants with vegetarian options" • Second loop: Broadens to "Italian restaurants" and then filters for those mentioning plant-based options • Third loop: If still unsuccessful, searches for "vegan restaurants" and filters for Italian-influenced cuisine 5. The agent presents the best matches from these progressive searches, explaining how they relate to the original request February 2025 67

68. Agents Companion This looping behavior enables more robust search capabilities that adapt to availability and context, providing useful results even when exact matches aren't available. Advantages of Multi-Agent Architecture for Automotive AI Multi-agent systems bring powerful advantages to automotive AI by breaking down complex tasks into specialized roles. Think of it as assembling a team of experts rather than relying on one generalist. In this approach, each agent specializes in what it does best. The Navigation Agent focuses solely on finding locations and planning routes. The Media Search Agent becomes an expert in music and podcasts. The Car Manual Agent specializes in vehicle features and troubleshooting. By focusing on specific domains, each agent develops deeper capabilities in its area. This specialization makes the entire system more efficient. When an agent handles a narrower set of tasks, it becomes simpler to optimize its performance. The result is higher quality responses delivered more quickly and at lower computational cost. Performance improves because the system can match the right resources to each task. Simple requests use minimal processing power, while complex questions tap into more powerful resources only when needed. Speed matters in a vehicle, and multi-agent systems deliver where it counts. Critical functions like adjusting climate controls or opening windows run on fast, on-device agents for immediate response. Meanwhile, less urgent tasks like finding restaurant recommendationscan use cloud-based agents with more extensive knowledge. This separation ensures that essential vehicle controls remain responsive regardless of what else the system is doing. February 2025 68

69. Agents Companion This design also creates natural resilience. If internet connectivity drops the essential functions running on on-device agents continue working. You might temporarily lose restaurant recommendations, but climate control and basic media playbacks still function perfectly. Agent Builder Vertex AI Agent Builder is a collection of products and services for developers. We have put together a comprehensive platform for you to build and connect agents. The engineering excellence and security from Google Cloud, the AI research from Google Deepmind, and the best practices of AgentOps. Google Cloud is building our own agents on top of this platform, and now you can too. Expect many more exciting announcements 2025 targeting developers of agents. Vertex AI Agent Engine streamlines development, relying on Google engineering managed integrations with popular open source agent libraries. It provides a managed autoscaling runtime and many services agents will need (eg: session, examples, trace, evals). This is the very low effort and high value way to safely deploy agents you have written in any framework. Vertex AI Eval Service provides all of the evaluation tools discussed in this whitepaper, and more. LLMs, RAG, and Agent evals are stable and scalable via the Vertex AI Eval Service, with convenient integrations into monitoring and experimentation offerings. A large portfolio of agent tools, with more to watch out for: • Retrieval via Vertex AI Search 26 or RAG Engine 28 . • Non-search based retrieval from DBs via Gen AI Toolbox for Databases 32 February 2025 69

70. Agents Companion • Application integrations 33 with hundreds of APIs supporting full ACLs • Turn any API into a managed, enterprise ready tool with Apigee Hub 34 And of course the best LLMs for agents, with access to Vertex AI Model Garden 35 and also the Gemini family of models 36 which will power the agentic era. Summary This whitepaper (a companion to our earlier whitepaper on Agents) has explored the rapidly evolving landscape of generative AI agents, from their fundamental architecture to advanced evaluation techniques and the transformative potential of multi-agent systems. Key Takeaways for Developers: 1. AgentOps is Essential: Building successful agents goes far beyond the initial proof- of-concept. Embrace AgentOps principles, integrating best practices from DevOps and MLOps, but also focusing on agent-specific elements like tool management, orchestration, memory, and task decomposition. 2. Metrics Drive Improvement: Start with business-level KPIs (like goal completion, user engagement, or revenue) as your "north star." Then, instrument your agents to track granular metrics related to critical tasks, user interactions, and agent actions (traces). Human feedback (👍👎, user surveys) is invaluable. 3. Automated Evaluation is Key: Don't rely solely on manual testing. Implement automated evaluation frameworks that assess agent capabilities, trajectory (the steps taken), and the final response. Leverage techniques like exact match, in-order match, precision/recall for trajectory evaluation, and autoraters (LLMs as judges) for final response quality. February 2025 70

71. Agents Companion 4. Human-in-the-Loop is Crucial: Automated metrics are powerful, but human evaluation provides essential context, especially for subjective aspects like creativity, common sense, and nuance. Use human feedback to calibrate and validate your automated evaluation methods. Don’t outsource the domain knowledge. 5. Multi-Agent Systems Offer Advantages: Consider multi-agent architectures for complex tasks. They can improve accuracy, efficiency, scalability, and fault tolerance. Understand different design patterns (sequential, hierarchical, collaborative, competitive) and choose the right one for your application. 6. Agentic RAG Improves Relevance: Move beyond traditional RAG by incorporating agents that actively refine search queries, evaluate retrieved information, and adapt to evolving knowledge. This leads to more accurate and contextually relevant responses. 7. Search Optimization is Foundational to RAG: Before diving into complex agentic RAG, optimize your underlying search engine. Techniques like semantic chunking, metadata enrichment, fine-tuning embedding models, and using rankers can significantly improve retrieval quality. 8. Agent and Tool Registries are Important: As the number of Agents or Tools you are using grow, a registry to manage the capabilities, ontology, and performance becomes essential. 9. Security is Paramount: When deploying agents, especially within an enterprise, prioritize security. Leverage platforms like Google Agentspace that offer built-in security features like RBAC, VPC Service Controls, and IAM integration. 10. Efficient use of developer cycles: The classic build vs buy design choices remain front of mind, as the industry of gen AI agents is rapidly evolving. Consider platforms and products as alternatives to building everything from scratch. This will buffer some of the churn of a fast changing industry and allow you to focus on your data, domain, and users. February 2025 71

72. Agents Companion 11. Agents in the enterprise: Agents are transforming the way we work by making us much more productive, and the way automation can be accomplished. Knowledge workers will increasingly be managing fleets of agents and novel UX will emerge. Google Agentspace is a powerful tool allowing to put Enterprise Search, AI and AI Agents on top of company’s data and workflows Future Directions for Agent Research and Development: The field of AI agents is undergoing rapid evolution. Key areas of ongoing research and development include: • Advanced Evaluation Methods: Developing more robust and scalable evaluation techniques, including process-based evaluation (focusing on reasoning), AI-assisted evaluation, and standardized benchmarks. • Multi-Agent Coordination: Improving the coordination and communication mechanisms within multi-agent systems to enable more effective collaboration, task handling, and reasoning. • Real-World Adaptation: Creating agents that can adapt and learn in dynamic, unpredictable real-world environments. Production systems like automotive AI illustrate how agents must balance between on-device performance for critical functions and cloud-based capabilities for complex tasks, often adapting to changing connectivity conditions. • Explainability and Interpretability: Making agent behavior more transparent and understandable, allowing developers and users to gain deeper insights into their decision-making processes. • Long-Term Memory and Learning: Developing more sophisticated memory mechanisms that allow agents to retain and utilize information over extended periods, enabling continuous learning and adaptation. February 2025 72

73. Agents Companion • Agent Communication Protocols: Better defining how agents share tasks, knowledge, and messages, especially across remote systems which are opaque. • From Agents to contractors: In order for agents to get to next level of reliability and utility, we will need to step up the definition of tasks, making them into contracts with clear deliverables, validation mechanisms, and ability to negotiate ambiguities, similarly to how we contract work from other companies. Call to Action: The future of AI is agentic. We encourage developers to embrace these concepts and begin building the next generation of intelligent applications. Start experimenting with the tools and techniques discussed in this whitepaper. Explore the resources available, such as Google Agentspace, NotebookLM Enterprise, Vertex Eval Service, Cloud Observability, and Vertex AI Search, to accelerate your development process. Dive into the provided code examples, tutorials, and documentation to gain hands-on experience. Build, evaluate, iterate, and contribute to the growing community of agent developers. The possibilities are limitless, and the time to build is now! Specifically, get started with the code and Colab notebooks in the references. February 2025 73

74. Agents Companion Endnotes 1. Shafran, I., Cao, Y. et al., 2022, 'ReAct: Synergizing Reasoning and Acting in Language Models'. Available at: https://arxiv.org/abs/2210.03629. 2. Wei, J., Wang, X. et al., 2023, 'Chain-of-Thought Prompting Elicits Reasoning in Large Language Models'. Available at: https://arxiv.org/pdf/2201.11903.pdf. 3. Wang, X. et al., 2022, 'Self-Consistency Improves Chain of Thought Reasoning in Language Models'. Available at: https://arxiv.org/abs/2203.11171. 4. Diao, S. et al., 2023, 'Active Prompting with Chain-of-Thought for Large Language Models'. Available at: https://arxiv.org/pdf/2302.12246.pdf. 5. Zhang, H. et al., 2023, 'Multimodal Chain-of-Thought Reasoning in Language Models'. Available at: https://arxiv.org/abs/2302.00923. 6. Yao, S. et al., 2023, 'Tree of Thoughts: Deliberate Problem Solving with Large Language Models'. Available at: https://arxiv.org/abs/2305.10601. 7. Long, X., 2023, 'Large Language Model Guided Tree-of-Thought'. Available at: https://arxiv.org/abs/2305.08291. 8. Google. 'Google Gemini Application'. Available at: http://gemini.google.com. 9. Swagger. 'OpenAPI Specification'. Available at: https://swagger.io/specification/. 10. Xie, M., 2022, 'How does in-context learning work? A framework for understanding the differences fromtraditional supervised learning'. Available at: https://ai.stanford.edu/blog/understanding-incontext/. 11. Google Research. 'ScaNN (Scalable Nearest Neighbors)'. Available at: https://github.com/google-research/google-research/tree/master/scann. 12. LangChain. 'LangChain'. Available at: https://python.langchain.com/v0.2/docs/introduction/. 13. Sokratis Kartakis, 2024, 'GenAI in Production: MLOps or GenAIOps?'. Available at: https://medium.com/google-cloud/genai-in-production-mlops-or-genaiops-25691c9becd0. 14. Sokratis Kartakis, 2024 'Gen AI Ops, Operationalize Generative AI, A practical Guide'. Available at: https://medium.com/google-cloud/genaiops-operationalize-generative-ai-a-practical- guide-d5bedaa59d78. February 2025 74

75. Agents Companion 15. Cloud Trace overview. Available at: https://cloud.google.com/trace/docs/overview. 16. Berkeley Function-Calling Leaderboard (BFCL). Available at: https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html. 17. Karthik Narasimhan, et al. 2024, 'τ-bench'. Available at https://arxiv.org/abs/2406.12045. 18. Karthik Valmeekam, et al., 2023, 'PlanBench'. Available at: https://arxiv.org/abs/2206.10498. 19. Xiao Liu, et al., 2023, 'AgentBench'. Available at: https://arxiv.org/abs/2308.03688. 20. Martin Iglesias, et al., 2025, 'DABStep' Available at: https://huggingface.co/spaces/adyen/DABstep. 21. LangSmith platform for agent observability. Available at: https://docs.smith.langchain.com/evaluation/concepts#agents. 22. Mingchen Zhuge, et al., 2024, 'Agent-as-a-Judge: Evaluate Agents with Agents'. Available at: https://arxiv.org/abs/2410.10934. 23. Multi-agent documentation from LangGraph. Available at: https://langchain-ai.github.io/langgraph/concepts/multi_agent/. 24. LangChain blog 2024, 'Multi-agent workflows'. Available at: https://blog.langchain.dev/langgraph-multi-agent-workflows/. 25. Vectorize blog 2024, 'How I finally got agentic RAG to work right'. Available at: https://vectorize.io/how-i-finally-got-agentic-rag-to-work-right/. 26. Vertex AI Search, product documentation. Available at: https://cloud.google.com/enterprise-search. 27. Vertex AI Search Builder APIs, product documentation. Available at: https://cloud.google.com/generative-ai-app-builder/docs/builder-apis. 28. Vertex AI RAG Engine, product documentation. Available at: https://cloud.google.com/vertex-ai/generative-ai/docs/rag-overview. 29. Agentspace product documentation. Available at: https://cloud.google.com/agentspace/agentspace-enterprise/docs/overview. 30. NotebookLM Enterprise product documentation. Available at: https://cloud.google.com/agentspace/notebooklm-enterprise/docs/overview. February 2025 75

76. Agents Companion 31. Juraj Gottweis, et al., 2025, 'Accelerating scientific breakthroughs with an AI co-scientist'. Available at: https://research.google/blog/accelerating-scientific-breakthroughs-with-an-ai-co-scientist/. 32. Hamsa Buvaraghan, et al. 2025, 'Announcing public beta of Gen AI Toolbox for Databases'. Available at: https://cloud.google.com/blog/products/ai-machine-learning/announcing-gen-ai-toolbox -for-databases-get-started-today?e=48754805. 33. Google Cloud Integration Connectors, product documentation. Available at: https://cloud.google.com/integration-connectors/docs. 34. Apigee API Hub, product documentation. Available at: https://cloud.google.com/apigee/docs/apihub/what-is-api-hub. 35. Vertex AI Model Garden, product documentation. Available at: https://cloud.google.com/model-garden. 36. Gemini family of LLMs, product documentation. Available at: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models. 37. Get Started Evaluating Agents with the Vertex Eval Service. Available at: https://cloud.google.com/vertex-ai/generative-ai/docs/models/evaluation-agents. 38. Irina Sigler, Ivan Nardini. Jan 2025 'Introducing Agent Evaluation in Vertex AI'. Available at: https://cloud.google.com/blog/products/ai-machine-learning/introducing-agent-evaluation-in-vertex -ai-gen-ai-evaluation-service?e=48754805. 39. Review sample agent evaluation notebooks for LangGraph, CrewAI, and LangChain. Available at: https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/evaluation/. 40. Review many sample agents, primarily beginner and intermediate level. Available at: https://github.com/GoogleCloudPlatform/generative-ai/. 41. Review many sample agents, intermediate and advanced levels. Available at: https://github.com/GoogleCloudPlatform/applied-ai-engineering-samples. February 2025 76