Prototype to Production

如果无法正常显示，请先停止浏览器的去广告插件。

1. Prototype to Production Authors: Sokratis Kartakis, Gabriela Hernandez Larios, Ran Li, Elia Secchi, and Huang Xia

2. Prototype to Production Acknowledgements Content contributors Derek Egan Chase Lyall Anant Nawalgaria Lavi Nigam Kanchana Patlolla Michael Vakoc Curators and editors Anant Nawalgaria Kanchana Patlolla Designer Michael Lanning November 2025 2

3. Table of contents Abstract 5 Introduction: From Prototype to Production 6 People and Process 8 The Journey to Production 11 Evaluation as a Quality Gate 12 The Automated CI/CD Pipeline 13 Safe Rollout Strategies 15 Building Security from the Start 17 Operations in-Production 19 Observe: Your Agent's Sensory System 19 Act: The Levers of Operational Control 20 Managing System Health: Performance, Cost, and Scale 20 Managing Risk: The Security Response Playbook 22 Evolve: Learning from Production The Engine of Evolution: An Automated Path to Production 22 23

4. Table of contents The Evolution Workflow: From Insight to Deployed Improvement 24 Evolving Security: The Production Feedback Loop 25 Beyond Single-Agent Operations 26 A2A - Reusability and Standardization 27 A2A Protocol: From Concept to Implementation 28 How A2A and MCP Work Together 33 Registry Architectures: When and How to Build Them 35 Putting It All Together: The AgentOps Lifecycle 36 Conclusion: Bridging the Last Mile with AgentOps 38 Endnotes 40

5. Prototype to Production Building an agent is easy. Trusting it is hard. Abstract This whitepaper provides a comprehensive technical guide to the operational life cycle of AI agents, focusing on deployment, scaling, and productionizing. Building on Day 4's coverage of evaluation and observability, this guide emphasizes how to build the necessary trust to move agents into production through robust CI/CD pipelines and scalable infrastructure. It explores the challenges of transitioning agent-based systems from prototypes to enterprise- grade solutions, with special attention to Agent2Agent (A2A) interoperability. This guide offers practical insights for AI/ML engineers, DevOps professionals, and system architects. November 2025 5

6. Prototype to Production Introduction: From Prototype to Production You can spin up an AI agent prototype in minutes, maybe even seconds. But turning that clever demo into a trusted, production-grade system that your business can depend on? That's where the real work begins. Welcome to the "last mile" production gap, where we consistently observe in practice with customers that roughly 80% of the effort is spent not on the agent's core intelligence, but on the infrastructure, security, and validation needed to make it reliable and safe. Skipping these final steps could cause several problems. For example: • A customer service agent is tricked into giving products away for free because you forgot to set up the right guardrails. • A user discovers they can access a confidential internal database through your agent because authentication was improperly configured. • An agent generates a large consumption bill over the weekend, but no one knows why because you didn't set up any monitoring. • A critical agent that worked perfectly yesterday suddenly stops, but your team is scrambling because there was no continuous evaluation in place. These aren't just technical problems; they are major business failures. And while principles from DevOps and MLOps provide a critical foundation, they aren't enough on their own. Deploying agentic systems introduces a new class of challenges that require an evolution in our operational discipline. Unlike traditional ML models, agents are autonomously interactive, stateful, and follow dynamic execution paths. November 2025 6

7. Prototype to Production This creates unique operational headaches that demand specialized strategies: • Dynamic Tool Orchestration: An agent's "trajectory" is assembled on the fly as it picks and chooses tools. This requires robust versioning, access control, and observability for a system that behaves differently every time. • Scalable State Management: Agents can remember things across interactions. Managing session and memory securely and consistently at scale is a complex systems design problem. • Unpredictable Cost & Latency: An agent can take many different paths to find an answer, making its cost and response time incredibly hard to predict and control without smart budgeting and caching. To navigate these challenges successfully, you need a foundation built on three key pillars: Automated Evaluation, Automated Deployment (CI/CD), and Comprehensive Observability. This whitepaper is your step-by-step playbook for building that foundation and navigating the path to production! We'll start with the pre-production essentials, showing you how to set up automated CI/CD pipelines and use rigorous evaluation as a critical quality check. From there, we'll dive into the challenges of running agents in the wild, covering strategies for scaling, performance tuning, and real-time monitoring. Finally, we'll look ahead to the exciting world of multi-agent systems with the Agent-to-Agent protocol and explore what it takes to get them communicating safely and effectively. November 2025 7

8. Prototype to Production Practical Implementation Guide Throughout this whitepaper, practical examples reference the Google Cloud Platform Agent Starter Pack 1 —1a Python package providing production-ready Generative AI agent templates for Google Cloud. It includes pre-built agents, automated CI/CD setup, Terraform deployment, Vertex AI evaluation integration and built-in Google Cloud observability. The starter pack demonstrates the concepts discussed here with working code you can deploy in minutes. People and Process After all that talk of CI/CD, observability, and dynamic pipelines, why the focus on people and process? Because the best technology in the world is ineffective without the right team to build, manage, and govern it. That customer service agent isn't magically prevented from giving away free products; an AI Engineer and a Prompt Engineer design and implement the guardrails. The confidential database isn't secured by an abstract concept; a Cloud Platform team configures the authentication. Behind every successful, production-grade agent there is a well-orchestrated team of specialists, and in this section, we'll introduce the key players. November 2025 8

9. Prototype to Production Figure 1: A diagram showing that "Ops" is the intersection of people, processes, and technology In a traditional MLOps landscape, this involves several key teams: • Cloud Platform Team: Comprising cloud architects, administrators, and security specialists, this team manages the foundational cloud infrastructure, security, and access control. The team grants engineers and service accounts least-privilege roles, ensuring access only to necessary resources. • Data Engineering Team: Data engineers and data owners build and maintain the data pipelines, handling ingestion, preparation, and quality standards. • Data Science and MLOps Team: This includes data scientists who experiment with and train models, and ML engineers who automate the end-to-end ML pipeline (e.g., preprocessing, training, post-processing) at scale using CI/CD. MLOps Engineers support this by building and maintaining the standardized pipeline infrastructure. November 2025 9

10. Prototype to Production • Machine Learning Governance: This centralized function, including product owners and auditors, oversees the ML lifecycle, acting as a repository for artifacts and metrics to ensure compliance, transparency, and accountability . Generative AI introduces a new layer of complexity and specialized roles to this landscape: • Prompt Engineers: While this role title is still evolving in the industry, these individuals blend technical skill in crafting prompts with deep domain expertise. They define the right questions and expected answers from a model, though in practice this work may be done by AI Engineers, domain experts, or dedicated specialists depending on the organization's maturity. • AI Engineers: They are responsible for scaling GenAI solutions to production, building robust backend systems that incorporate evaluation at scale, guardrails, and RAG/tool integration . • DevOps/App Developers: These developers build the front-end components and user- friendly interfaces that integrate with the GenAI backend. The scale and structure of an organization will influence these roles; in smaller companies, individuals may wear multiple hats, while mature organizations will have more specialized teams. Effectively coordinating all these diverse roles is essential for establishing a robust operational foundation and successfully productionizing both traditional ML and generative AI initiatives. November 2025 10

11. Prototype to Production Figure 2: How multiple teams collaborate to operationalize both models and GenAI applications The Journey to Production Now that we've established the team, we turn to the process. How do we translate the work of all these specialists into a system that is trustworthy, reliable, and ready for users? The answer lies in a disciplined pre-production process built on a single core principle: Evaluation-Gated Deployment. The idea is simple but powerful: no agent version should reach users without first passing a comprehensive evaluation that proves its quality and safety. This pre-production phase is where we trade manual uncertainty for automated confidence, and it consists of three pillars: a rigorous evaluation process that acts as a quality gate, an automated CI/CD pipeline that enforces it, and safe rollout strategies to de-risk the final step into production. November 2025 11

12. Prototype to Production Evaluation as a Quality Gate Why do we need a special quality gate for agents? Traditional software tests are insufficient for systems that reason and adapt. Furthermore, evaluating an agent is distinct from evaluating an LLM; it requires assessing not just the final answer, but the entire trajectory of reasoning and actions taken to complete a task. An agent can pass 100 unit tests for its tools but still fail spectacularly by choosing the wrong tool or hallucinating a response. We need to evaluate its behavioral quality, not just its functional correctness. This gate can be implemented in two primary ways: 1. The Manual "Pre-PR" Evaluation: For teams seeking flexibility or just beginning their evaluation journey, the quality gate is enforced through a team process. Before submitting a pull request (PR), the AI Engineer or Prompt Engineer (or whoever is responsible for agent behavior in your organization) runs the evaluation suite locally. The resulting performance report—comparing the new agent against the production baseline—is then linked in the PR description. This makes the evaluation results a mandatory artifact for human review. The reviewer—typically another AI Engineer or the Machine Learning Governor—is now responsible for assessing not just the code, but also the agent's behavioral changes against guardrail violations and prompt injection vulnerabilities. 2. The Automated In-Pipeline Gate: For mature teams, the evaluation harness—built and maintained by the Data Science and MLOps Team—is integrated directly into the CI/ CD pipeline. A failing evaluation automatically blocks the deployment, providing rigid, programmatic enforcement of quality standards that the Machine Learning Governance team has defined. This approach trades the flexibility of manual review for the consistency of automation. The CI/CD pipeline can be configured to automatically trigger an evaluation job that compares the new agent's responses against a golden dataset. The deployment is programmatically blocked if key metrics, such as "tool call success rate" or "helpfulness," fall below a predefined threshold. November 2025 12

13. Prototype to Production Regardless of the method, the principle is the same: no agent proceeds to production without a quality check. We covered the specifics of what to measure and how to build this evaluation harness in our deep dive on Day 4: Agent Quality: Observability, Logging, Tracing, Evaluation, Metrics, which explored everything from crafting a 'golden dataset' (a curated, representative set of test cases designed to assess an agent's intended behavior and guardrail compliance) to implementing LLM-as-a-judge techniques, to finally using a service like Vertex AI Evaluation 2 to power evaluation. The Automated CI/CD Pipeline An AI agent is a composite system, comprising not just source code but also prompts, tool definitions, and configuration files. This complexity introduces significant challenges: how do we ensure a change to a prompt doesn’t degrade the performance of a tool? How do we test the interplay between all these artifacts before they reach users? The solution is a CI/CD (Continuous Integration/Continuous Deployment) pipeline. It is more than just an automation script; it’s a structured process that helps different people in a team collaborate to manage complexity and ensure quality. It works by testing changes in stages, incrementally building confidence before the agent is released to users. A robust pipeline is designed as a funnel. It catches errors as early and as cheaply as possible, a practice often called "shifting left." It separates fast, pre-merge checks from more comprehensive, resource-intensive post-merge deployments. This progressive workflow is typically structured into three distinct phases: 1. Phase 1: Pre-Merge Integration (CI). The pipeline's first responsibility is to provide rapid feedback to the AI Engineer or Prompt Engineer who has opened a pull request. Triggered automatically, this CI phase acts as a gatekeeper for the main branch. It runs fast checks like unit tests, code linting, and dependency scanning. Crucially, this is the November 2025 13

14. Prototype to Production ideal stage to run the agent quality evaluation suite designed by Prompt Engineers. This provides immediate feedback on whether a change improves or degrades the agent's performance against key scenarios before it is ever merged. By catching issues here, we prevent polluting the main branch. The PR checks configuration template 3 generated with the Agent Starter Pack 1 (ASP) is a practical example of implementing this phase with Cloud Build. 4 2. Phase 2: Post-Merge Validation in Staging (CD). Once a change passes all CI checks— including the performance evaluation—and is merged, the focus shifts from code and performance correctness to the operational readiness of the integrated system. The Continuous Deployment (CD) process, often managed by the MLOps Team, packages the agent and deploys it to a staging environment—a high-fidelity replica of production. Here, more comprehensive, resource-intensive tests are run, such as load testing and integration tests against remote services. This is also the critical phase for internal user testing (often called "dogfooding"), where humans within the company can interact with the agent and provide qualitative feedback before it reaches the end user. This ensures that the agent as an integrated system performs reliably and efficiently under production- like conditions before it is considered for release. The staging deployment template 5 from ASP shows an example of this deployment. 3. Phase 3: Gated Deployment to Production. After the agent has been thoroughly validated in the staging environment, the final step is deploying to production. This is almost never fully automatic, typically requiring a Product Owner to give the final sign-off, ensuring human-in-the-loop. Upon approval, the exact deployment artifact that was tested and validated in staging is promoted to the production environment. This production deployment template 6 generated with ASP shows how this final phase retrieves the validated artifact and deploys it to production with appropriate safeguards. November 2025 14

15. Prototype to Production Figure 3: Different stages of the CI/CD process Making this three-phase CI/CD workflow possible requires robust automation infrastructure and proper secrets management. This automation is powered by two key technologies: • Infrastructure as Code (IaC): Tools like Terraform define environments programmatically, ensuring they are identical, repeatable, and version-controlled. For example, this template generated with Agent Starter Pack 7 provides Terraform configurations for complete agent infrastructure including Vertex AI, Cloud Run, and BigQuery resources. • Automated Testing Frameworks: Frameworks like Pytest execute tests and evaluations at each stage, handling agent-specific artifacts like conversation histories, tool invocation logs, and dynamic reasoning traces. Furthermore, sensitive information like API keys for tools should be managed securely using a service like Secret Manager 8 and injected into the agent's environment at runtime, rather than being hardcoded in the repository. Safe Rollout Strategies While comprehensive pre-production checks are essential, real-world application inevitably reveals unforeseen issues. Rather than switching 100% of users at once, consider minimizing risk through gradual rollouts with careful monitoring. November 2025 15

16. Prototype to Production Here are four proven patterns that help teams build confidence in their deployments: • Canary: Start with 1% of users, monitoring for prompt injections and unexpected tool usage. Scale up gradually or roll back instantly. • Blue-Green: Run two identical production environments. Route traffic to "blue" while deploying to "green," then switch instantly. If issues emerge, switch back—zero downtime, instant recovery. • A/B Testing: Compare agent versions on real business metrics for data-driven decisions. This can happen either with internal or external users. • Feature Flags: Deploy code but control release dynamically, testing new capabilities with select users first. All these strategies share a foundation: rigorous versioning. Every component—code, prompts, model endpoints, tool schemas, memory structures, even evaluation datasets— must be versioned. When issues arise despite safeguards, this enables instant rollback to a known-good state. See this as your production "undo" button! You can deploy agents using Agent Engine 9 or Cloud Run 10 , then leverage Cloud Load Balancing 11 for traffic management across versions or connect to other microservices. The Agent Starter Pack 1 provides ready-to-use templates with GitOps workflows—where every deployment is a git commit, every rollback is a git revert, and your repository becomes the single source of truth for both current state and complete deployment history. November 2025 16

17. Prototype to Production Building Security from the Start Safe deployment strategies protect you from bugs and outages, but agents face a unique challenge: they can reason and act autonomously. A perfectly deployed agent can still cause harm if it hasn't been built with proper security and responsibility measures. This requires a comprehensive governance strategy embedded from day one, not added as an afterthought. Unlike traditional software that follows predetermined paths, agents make decisions. They interpret ambiguous requests, access multiple tools, and maintain memory across sessions. This autonomy creates distinct risks: • Prompt Injection & Rogue Actions: Malicious users can trick agents into performing unintended actions or bypassing restrictions. • Data Leakage: Agents might inadvertently expose sensitive information through their responses or tool usage. • Memory Poisoning: False information stored in an agent's memory can corrupt all future interactions. Fortunately, frameworks like Google's Secure AI Agents approach 12 and the Google Secure AI Framework (SAIF) 13 address these challenges through three layers of defense: 1. Policy Definition and System Instructions (The Agent's Constitution): The process begins by defining policies for desired and undesired agent behavior. These are engineered into System Instructions (SIs) that act as the agent's core constitution. 2. Guardrails, Safeguards, and Filtering (The Enforcement Layer): This layer acts as the hard-stop enforcement mechanism. • Input Filtering: Use classifiers and services like the Perspective API to analyze prompts and block malicious inputs before they reach the agent. November 2025 17

18. Prototype to Production • Output Filtering: After the agent generates a response, Vertex AI's built-in safety filters provide a final check for harmful content, PII, or policy violations. For example, before a response is sent to the user, it is passed through Vertex AI's built-in safety filters 14 , which can be configured to block outputs containing specific PII, toxic language, or other harmful content. • Human-in-the-Loop (HITL) Escalation: For high-risk or ambiguous actions, the system must pause and escalate to a human for review and approval. 3. Continuous Assurance and Testing: Safety is not a one-time setup. It requires constant evaluation and adaptation. • Rigorous Evaluation: Any change to the model or its safety systems must trigger a full re-run of a comprehensive evaluation pipeline using Vertex AI Evaluation. • Dedicated RAI Testing: Rigorously test for specific risks either by creating dedicated datasets or using simulation agents, including Neutral Point of View (NPOV) evaluations and Parity evaluations. • Proactive Red Teaming: Actively try to break the safety systems through creative manual testing and AI-driven persona-based simulation. November 2025 18

19. Prototype to Production Operations in-Production Your agent is live. Now the focus shifts from development to a fundamentally different challenge: keeping the system reliable, cost-effective, and safe as it interacts with thousands of users. A traditional service operates on predictable logic. An agent, in contrast, is an autonomous actor. Its ability to follow unexpected reasoning paths means it can exhibit emergent behaviors and accumulate costs without direct oversight. Managing this autonomy requires a different operational model. Instead of static monitoring, effective teams adopt a continuous loop: Observe the system's behavior in real-time, Act to maintain performance and safety, and Evolve the agent based on production learnings. This integrated cycle is the core discipline for operating agents successfully in production. Observe: Your Agent's Sensory System To trust and manage an autonomous agent, you must first understand its process. Observability provides this crucial insight, acting as the sensory system for the subsequent "Act" and "Evolve" phases. A robust observability practice is built on three pillars that work together to provide a complete picture of the agent's behavior: • Logs: The granular, factual diary of what happened, recording every tool call, error, and decision. • Traces: The narrative that connects individual logs, revealing the causal path of why an agent took a certain action. • Metrics: The aggregated report card, summarizing performance, cost, and operational health at scale to show how well the system is performing. November 2025 19

20. Prototype to Production For example, in Google Cloud, this is achieved through the operations suite: a user's request generates a unique ID in Cloud Trace 15 that links the Vertex AI Agent Engine 9 invocation, model calls, and tool executions with visible durations. Detailed logs flow to Cloud Logging 16 , while Cloud Monitoring 17 dashboards alert when latency thresholds are exceeded. The Agent Development Kit (ADK) 18 provides built-in Cloud Trace integration for automatic instrumentation of agent operations. By implementing these pillars, we move from operating in the dark to having a clear, data- driven view of our agent's behavior, providing the foundation needed to manage it effectively in production. (For a full discussion of these concepts, see Agent Quality: Observability, Logging, Tracing, Evaluation, Metrics). Act: The Levers of Operational Control Observations without action are just expensive dashboards. The "Act" phase is about real- time intervention—the levers you pull to manage the agent's performance, cost, and safety based on what you observe. Think of "Act" as the system's automated reflexes designed to maintain stability in real-time. In contrast, "Evolve", which will be covered later, is the strategic process of learning from behavior to create a fundamentally better system. Because an agent is autonomous, you cannot pre-program every possible outcome. Instead, you must build robust mechanisms to influence its behavior in production. These operational levers fall into two primary categories: managing the system's health and managing its risk. Managing System Health: Performance, Cost, and Scale Unlike traditional microservices, an agent's workload is dynamic and stateful. Managing its health requires a strategy for handling this unpredictability. November 2025 20

21. Prototype to Production • Designing for Scale: The foundation is decoupling the agent's logic from its state. • Horizontal Scaling: Design the agent as a stateless, containerized service. With external state, any instance can handle any request, enabling serverless platforms like Cloud Run 10 or the managed Vertex AI Agent Engine Runtime 9 to scale automatically. • Asynchronous Processing: For long-running tasks, offload work using event- driven patterns. This keeps the agent responsive while complex jobs process in the background. On Google Cloud, for example, a service can publish tasks to Pub/Sub 19 , which can then trigger a Cloud Run service for asynchronous processing. • Externalized State Management: Since LLMs are stateless, persisting memory externally is non-negotiable. This highlights a key architectural choice: Vertex AI Agent Engine provides a built-in, durable Session and memory service, while Cloud Run offers the flexibility to integrate directly with databases like AlloyDB 20 or Cloud SQL 21 . • Balancing Competing Goals: Scaling always involves balancing three competing goals: speed, reliability, and cost. • Speed (Latency): Keep your agent fast by designing it to work in parallel, aggressively caching results, and using smaller, efficient models for routine tasks. • Reliability (Handling Glitches): Agents must handle temporary failures. When a call fails, automatically retry, ideally with exponential backoff to give the service time to recover. This requires designing "safe-to-retry" (idempotent) tools to prevent bugs like duplicate charges. • Cost: Keep the agent affordable by shortening prompts, using cheaper models for easier tasks, and sending requests in groups (batching). November 2025 21

22. Prototype to Production Managing Risk: The Security Response Playbook Because an agent can act on its own, you need a playbook for rapid containment. When a threat is detected, the response should follow a clear sequence: contain, triage, and resolve. The first step is immediate containment. The priority is to stop the harm, typically with a "circuit breaker"—a feature flag to instantly disable the affected tool. Next is triage. With the threat contained, suspicious requests are routed to a human-in-the- loop (HITL) review queue to investigate the exploit's scope and impact. Finally, the focus shifts to a permanent resolution. The team develops a patch—like an updated input filter or system prompt—and deploys it through the automated CI/CD pipeline, ensuring the fix is fully tested before blocking the exploit for good. Evolve: Learning from Production While the "Act" phase provides the system's immediate, tactical reflexes, the "Evolve" phase is about long-term, strategic improvement. It begins by looking at the patterns and trends collected in your observability data and asking a crucial question: "How do we fix the root cause so this problem never happens again?" This is where you move from reacting to production incidents to proactively making your agent smarter, more efficient, and safer. You turn the raw data from the "Observe" phase into durable improvements in your agent's architecture, logic, and behavior. November 2025 22

23. Prototype to Production The Engine of Evolution: An Automated Path to Production An insight from production is only valuable if you can act on it quickly. Observing that 30% of your users fail at a specific task is useless if it takes your team six months to deploy a fix. This is where the automated CI/CD pipeline you built in pre-production (Section 3) becomes the most critical component of your operational loop. It is the engine that powers rapid evolution. A fast, reliable path to production allows you to close the loop between observation and improvement in hours or days, not weeks or months. When you identify a potential improvement—whether it's a refined prompt, a new tool, or an updated safety guardrail—the process should be: 1. Commit the Change: The proposed improvement is committed to your version-controlled repository. 2. Trigger Automation: The commit automatically triggers your CI/CD pipeline. 3. Validate Rigorously: The pipeline runs the full suite of unit tests, security scans, and the agent quality evaluation suite against your updated datasets. 4. Deploy Safely: Once validated, the change is deployed to production using a safe rollout strategy. This automated workflow transforms evolution from a slow, high-risk manual project into a fast, repeatable, and data-driven process. November 2025 23

24. Prototype to Production The Evolution Workflow: From Insight to Deployed Improvement 1. Analyze Production Data: Identify trends in user behavior, task success rates, and security incidents from production logs. 2. Update Evaluation Datasets: Transform production failures into tomorrow's test cases, augmenting your golden dataset. 3. Refine and Deploy: Commit improvements to trigger the automated pipeline—whether refining prompts, adding tools, or updating guardrails. This creates a virtuous cycle where your agent continuously improves with every user interaction. An Evolve Loop in Action A retail agent's logs (Observe) show that 15% of users receive an error when asking for 'similar products.' The product team Acts by creating a high-priority ticket. The Evolve phase begins: production logs are used to create a new, failing test case for the evaluation dataset. An AI Engineer refines the agent's prompt and adds a new, more robust tool for similarity search. The change is committed, passes the now- updated evaluation suite in the CI/CD pipeline, and is safely rolled out via a canary deployment, resolving the user issue in under 48 hours. November 2025 24

25. Prototype to Production Evolving Security: The Production Feedback Loop While the foundational security and responsibility framework is established in pre-production (Section 3.4), the work is never truly finished. Security is not a static checklist; it is a dynamic, continuous process of adaptation. The production environment is the ultimate testing ground, and the insights gathered there are essential for hardening your agent against real-world threats. This is where the Observe → Act → Evolve loop becomes critical for security. The process is a direct extension of the evolution workflow: 1. Observe: Your monitoring and logging systems detect a new threat vector. This could be a novel prompt injection technique that bypasses your current filters, or an unexpected interaction that leads to a minor data leak. 2. Act: The immediate security response team contains the threat (as discussed in Section 4.2). 3. Evolve: This is the crucial step for long-term resilience. The security insight is fed back into your development lifecycle: • Update Evaluation Datasets: The new prompt injection attack is added as a permanent test case to your evaluation suite. • Refine Guardrails: A Prompt Engineer or AI Engineer refines the agent's system prompt, input filters, or tool-use policies to block the new attack vector. • Automate and Deploy: The engineer commits the change, which triggers the full CI/ CD pipeline. The updated agent is rigorously validated against the newly expanded evaluation set and deployed to production, closing the vulnerability. November 2025 25

26. Prototype to Production This creates a powerful feedback loop where every production incident makes your agent stronger and more resilient, transforming your security posture from a defensive stance to one of continuous, proactive improvement. To learn more about Responsible AI and securing AI Agentic Systems, please consult the whitepaper Google's Approach for Secure AI Agents 12 and the Google Secure AI Framework (SAIF) 13 . Beyond Single-Agent Operations You've mastered operating individual agents in production and can ship them at high velocity. But as organizations scale to dozens of specialized agents—each built by different teams with different frameworks—a new challenge emerges: these agents can't collaborate. The next section explores how standardized protocols can transform these isolated agents into an interoperable ecosystem, unlocking exponential value through agent collaboration. November 2025 26

27. Prototype to Production A2A - Reusability and Standardization You've built dozens of specialized agents across your organization. The customer service team has their support agent. Analytics built a forecasting system. Risk management created fraud detection. But here's the problem: these agents can't talk to each other - whether that be because they were created in different frameworks, projects or different clouds altogether. This isolation creates massive inefficiency. Every team rebuilds the same capabilities. Critical insights stay trapped in silos. What you need is interoperability—the ability for any agent to leverage any other agent's capabilities, regardless of who built it or what framework they used. To solve this, a principled approach to standardization is required, built on two distinct but complementary protocols. While the Model Context Protocol (MCP 22 ), which we covered in detail on Agent Tools and Interoperability with MCP, provides a universal standard for tool integration, it is not sufficient for the complex, stateful collaboration required between intelligent agents. This is the problem the Agent2Agent (A2A 23 ) protocol, now governed by the Linux Foundation, was designed to solve. The distinction is critical. When you need a simple, stateless function like fetching weather data or querying a database, you need a tool that speaks MCP. But when you need to delegate a complex goal, such as "analyze last quarter's customer churn and recommend three intervention strategies," you need an intelligent partner that can reason, plan, and act autonomously via A2A. In short, MCP lets you say, "Do this specific thing," while A2A lets you say, "Achieve this complex goal." November 2025 27

28. Prototype to Production A2A Protocol: From Concept to Implementation The A2A protocol is designed to break down organizational silos and enable seamless collaboration between agents. Consider a scenario where a fraud detection agent spots suspicious activity. To understand the full context, it needs data from a separate transaction analysis agent. Without A2A, a human analyst must manually bridge this gap—a process that could take hours. With A2A, the agents collaborate automatically, resolving the issue in minutes. The first step of the collaboration is discovering the right agent to delegate to - this is made possible through Agent Cards, 24 which are standardized JSON specifications that act as a business card for each agent. An Agent Card describes what an agent can do, its security requirements, its skills, and how to reach out to it (url), allowing any other agent in the ecosystem to dynamically discover its peers. See example Agent Card below: November 2025 28

29. Prototype to Production Python { "name": "check_prime_agent", "version": "1.0.0", "description": "An agent specialized in checking whether numbers are prime", "capabilities": {}, "securitySchemes": { "agent_oauth_2_0": { "type": "oauth2", } "defaultInputModes": ["text/plain"], "defaultOutputModes": ["application/json"], "skills": [ { "id": "prime_checking", "name": "Prime Number Checking", "description": "Check if numbers are prime using efficient algorithms", "tags": ["mathematical", "computation", "prime"] } ], "url": "http://localhost:8001/a2a/check_prime_agent" } Snippet 1: A sample agent card for the check_prime_agent Adopting this protocol doesn't require an architectural overhaul. Frameworks like the ADK simplify this process significantly (docs 25 ). You can make an existing agent A2A-compatible with a single function call, which automatically generates its AgentCard and makes it available on the network. November 2025 29

30. Prototype to Production Python # Example using ADK: Exposing an agent via A2A from google.adk.a2a.utils.agent_to_a2a import to_a2a # Your existing agent root_agent = Agent( name='hello_world_agent', # ... your agent code ... ) # Make it A2A-compatible a2a_app = to_a2a(root_agent, port=8001) # # # # # # # # Serve with uvicorn uvicorn agent:a2a_app --host localhost --port 8001 Or serve with Agent Engine from vertexai.preview.reasoning_engines import A2aAgent from google.adk.a2a.executor.a2a_agent_executor import A2aAgentExecutor a2a_agent = A2aAgent( agent_executor_builder=lambda: A2aAgentExecutor(agent=root_agent) ) Snippet 2: Using the ADK's to_a2a utility to wrap an existing agent and expose it for A2A communication Once an agent is exposed, any other agent can consume it by referencing its AgentCard. For example, a customer service agent can now query a remote product catalog agent without needing to know its internal workings. November 2025 30

31. Prototype to Production Python # Example using ADK: Consuming a remote agent via A2A from google.adk.agents.remote_a2a_agent import RemoteA2aAgent prime_agent = RemoteA2aAgent( name="prime_agent", description="Agent that handles checking if numbers are prime.", agent_card="http://localhost:8001/a2a/check_prime_agent/ .well-known/agent-card.json" ) Snippet 3: Using the ADK's RemoteA2aAgent class to connect to and consume a remote agent This unlocks powerful, hierarchical compositions. A root agent can be configured to orchestrate both a local sub-agent for a simple task and a remote, specialized agent via A2A, creating a more capable system. November 2025 31

32. Prototype to Production Python # Example using ADK: Hierarchical agent composition # ADK Local sub-agent for dice rolling roll_agent = Agent( name="roll_agent", instruction="You are an expert at rolling dice." ) # ADK Remote A2A agent for prime checking prime_agent = RemoteA2aAgent( name="prime_agent", agent_card="http://localhost:8001/.well-known/agent-card.json" ) # ADK Root orchestrator combining both root_agent = Agent( name="root_agent", instruction="""Delegate rolling dice to roll_agent, prime checking to prime_agent.""", sub_agents=[roll_agent, prime_agent] ) Snippet 4: Using a remote A2A agent (prime_agent) as a sub-agent within a hierarchical agent structure in the ADK However, enabling this level of autonomous collaboration introduces two non-negotiable technical requirements. First is distributed tracing, where every request carries a unique trace ID, which is essential for debugging and maintaining a coherent audit trail across multiple agents. Second is robust state management. A2A interactions are inherently stateful, requiring a sophisticated persistence layer for tracking progress and ensuring transactional integrity. November 2025 32

33. Prototype to Production A2A is best suited for formal, cross-team integrations that require a durable service contract. For tightly coupled tasks within a single application, lightweight local sub-agents often remain a more efficient choice. As the ecosystem matures, new agents should be built with native support for both protocols, ensuring every new component is immediately discoverable, interoperable, and reusable, compounding the value of the whole system. How A2A and MCP Work Together Figure 4: A2A and MCP collaboration with a single glance November 2025 33

34. Prototype to Production A2A and MCP are not competing standards; they are complementary protocols designed to operate at different levels of abstraction. The distinction depends on what an agent is interacting with. MCP is the domain of tools and resources—primitives with well-defined, structured inputs and outputs, like a calculator or a database API. A2A is the domain of other agents—autonomous systems that can reason, plan, use multiple tools, and maintain state to achieve complex goals. The most powerful agentic systems use both protocols in a layered architecture. An application might primarily use A2A to orchestrate high-level collaboration between multiple intelligent agents, while each of those agents internally uses MCP to interact with its own specific set of tools and resources. A practical analogy is an auto repair shop staffed by autonomous AI agents. 1. User-to-Agent (A2A): A customer uses A2A to communicate with the "Shop Manager" agent to describe a high-level problem: "My car is making a rattling noise." 2. Agent-to-Agent (A2A): The Shop Manager engages in a multi-turn diagnostic conversation and then delegates the task to a specialized "Mechanic" agent, again using A2A. 3. Agent-to-Tool (MCP): The Mechanic agent now needs to perform specific actions. It uses MCP to call its specialized tools: it runs scan_vehicle_for_error_codes() on a diagnostic scanner, queries a repair manual database with get_repair_procedure(), and operates a platform lift with raise_platform(). 4. Agent-to-Agent (A2A): After diagnosing the issue, the Mechanic agent determines a part is needed. It uses A2A to communicate with an external "Parts Supplier" agent to inquire about availability and place an order. November 2025 34

35. Prototype to Production In this workflow, A2A facilitates the higher-level, conversational, and task-oriented interactions between the customer, the shop's agents, and external suppliers. Meanwhile, MCP provides the standardized plumbing that enables the mechanic agent to reliably use its specific, structured tools to do its job. Registry Architectures: When and How to Build Them Why do some organizations build registries while others don't need them? The answer lies in scale and complexity. When you have fifty tools, manual configuration works fine. But when you reach five thousand tools distributed across different teams and environments, you face a discovery problem that demands a systematic solution. A Tool Registry uses a protocol like MCP to catalog all assets, from functions to APIs. Instead of giving agents access to thousands of tools, you create curated lists, leading to three common patterns: • Generalist agents: Access the full catalog, trading speed and accuracy for scope. • Specialist agents: Use predefined subsets for higher performance. • Dynamic agents: Query the registry at runtime to adapt to new tools. The primary benefit is human discovery—developers can search for existing tools before building duplicates, security teams can audit tool access, and product owners can understand their agents' capabilities. November 2025 35

36. Prototype to Production An Agent Registry applies the same concept to agents, using formats like A2A's AgentCards. It helps teams discover and reuse existing agents, reducing redundant work. This also lays the groundwork for automated agent-to-agent delegation, though this remains an emerging pattern. Registries offer discovery and governance at the cost of maintenance. You can consider starting without one and only build it when your ecosystem's scale demands centralized management! Decision Framework for Registries Tool Registry: Build when tool discovery becomes a bottleneck or security requires centralized auditing. Agent Registry: Build when multiple teams need to discover and reuse specialized agents without tight coupling. Putting It All Together: The AgentOps Lifecycle We can now assemble these pillars into a single, cohesive reference architecture! The life cycle begins in the developer's inner loop—a phase of rapid local testing and prototyping to shape the agent's core logic. Once a change is ready, it enters the formal pre-production engine, where automated evaluation gates validate its quality and safety against a golden dataset. From there, safe rollouts release it to production, where comprehensive observability captures the real-world data needed to fuel the continuous evolution loop, turning every insight into the next improvement. November 2025 36

37. Prototype to Production For a comprehensive walkthrough of operationalizing AI agents, including evaluation, tool management, CI/CD standardization, and effective architecture designs, watch the AgentOps: Operationalize AI Agents video 26 on the official Google Cloud YouTube channel. Figure 5: AgentOps core capabilities, environments, and processes November 2025 37

38. Prototype to Production Conclusion: Bridging the Last Mile with AgentOps Moving an AI prototype to a production system is an organizational transformation that requires a new operational discipline: AgentOps. Most agent projects fail in the "last mile" not due to technology, but because the operational complexity of autonomous systems is underestimated. This guide maps the path to bridge that gap. It begins with establishing People and Process as the foundation for governance. Next, a Pre-Production strategy built on evaluation-gated deployment automates high- stakes releases. Once live, a continuous Observe → Act → Evolve loop turns every user interaction into a potential insight. Finally, Interoperability protocols scale the system by transforming isolated agents into a collaborative, intelligent ecosystem. The immediate benefits—like preventing a security breach or enabling a rapid rollback— justify the investment. But the real value is velocity. Mature AgentOps practices allow teams to deploy improvements in hours, not weeks, turning static deployments into continuously evolving products. Your Path Forward • If you're starting out, focus on the fundamentals: build your first evaluation dataset, implement a CI/CD pipeline, and establish comprehensive monitoring. The Agent Starter Pack is a great place to start—it creates a production-ready agent project in minutes with these foundations already built-in. • If you're scaling, elevate your practice: automate the feedback loop from production insight to deployed improvement and standardize on interoperable protocols to build a cohesive ecosystem, not just point solutions. November 2025 38

39. Prototype to Production The next frontier is not just building better individual agents, but orchestrating sophisticated multi-agent systems that learn and collaborate. The operational discipline of AgentOps is the foundation that makes this possible. We hope this playbook empowers you to build the next generation of intelligent, reliable, and trustworthy AI. Bridging the last mile is therefore not the final step in a project, but the first step in creating value! November 2025 39

40. Prototype to Production Endnotes 1. https://github.com/GoogleCloudPlatform/agent-starter-pack 2. https://cloud.google.com/vertex-ai/docs/evaluation/introduction 3. https://github.com/GoogleCloudPlatform/agent-starter-pack/blob/example-agent/example-agent /.cloudbuild/pr_checks.yaml 4. https://cloud.google.com/build 5. https://github.com/GoogleCloudPlatform/agent-starter-pack/blob/example-agent/example-agent /.cloudbuild/staging.yaml 6. https://github.com/GoogleCloudPlatform/agent-starter-pack/blob/example-agent/example-agent /.cloudbuild/deploy-to-prod.yaml 7. https://github.com/GoogleCloudPlatform/agent-starter-pack/blob/example-agent/example-agent /terraform 8. https://cloud.google.com/secret-manager 9. https://cloud.google.com/agent-builder/agent-engine/overview 10. https://cloud.google.com/run 11. https://cloud.google.com/load-balancing/docs/https/traffic-management 12. https://research.google/pubs/an-introduction-to-googles-approach-for-secure-ai-agents/ 13. https://safety.google/cybersecurity-advancements/saif/ 14. https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/configure-safety-attributes 15. https://cloud.google.com/trace 16. https://cloud.google.com/logging 17. https://cloud.google.com/monitoring 18. https://google.github.io/adk-docs/observability/cloud-trace/ 19. https://cloud.google.com/pubsub November 2025 40

41. Prototype to Production 20. https://cloud.google.com/alloydb 21. https://cloud.google.com/sql 22. https://modelcontextprotocol.io/ 23. https://a2a-protocol.org/latest/specification/ 24. https://a2a-protocol.org/latest/specification/#5-agent-discovery-the-agent-card 25. https://google.github.io/adk-docs/a2a/ 26. https://www.youtube.com/watch?v=kJRgj58ujEk November 2025 41