Debugging with Production Neighbors – Powered by SLATE

Software development is an iterative and staged process that needs validation and testing at function, component, and service levels. In the case of microservice-based architecture, it becomes far more important to develop in conjunction with dependent services. Microservice-based architecture provides distinct advantages that allow us to scale, maintain, and abstract responsibilities. The more abstraction, the easier it is for us to develop and define business logic.

SLATE is an E2E testing tool that bridges the gap by allowing services under test to be deployed and work along with production upstream and downstream services. This allows developers to generate test requests mirroring production call flow yet target services under test. Such functionality facilitates various use cases, including feature development within a production environment or replicating production bugs, which often entail troubleshooting both code and configuration. To aid or simplify the process of troubleshooting and make it nearer to the local experience we have developed features to enable debugging of services deployed in the SLATE environment.

In this blog we’ll explore different debugging options developed on SLATE that emulates the behavior of services under test with production upstream and downstream.

Let us check the following three high-level options developed in detail:

Remote debugging a SLATE deployed instance
Local Debugging in laptop/dev pod machine
Debug issues by filtered monitoring

Debugging using logs is a fundamental practice that provides insights into a program’s execution. Logs enables developers to identify issues. However, inefficient logging can clutter the system with irrelevant information, leading to complicating rather than aiding the debugging process.

Staging environment is developer controlled environment that mirrors the production setup.

While staging environments are very beneficial, they may still differ from the live environment and provide false confidence with a longer turnaround time.

Local debugging is essential for faster iteration to test service in isolation. However, debugging user scenarios can be challenging due to constraints in simultaneously debugging multiple services together.

Testing and debugging on SLATE relies on logs from the service being tested. Depending solely on logs for understanding complex processes isn’t practical. Additionally, adding new logs requires a new deployment, causing delays. Remote debugging can address these issues by letting developers step through statements and monitor variables, eliminating the need for commit and deployment iterations. Co-working with production infra, needs to balance security and developer experience.

This brings a need to enhance visibility into the code for runtime debugging, achieved through breakpoints, step-ins, or dynamic tracepoints. Remote debugging is limited to SLATE instances handling test requests to ensure production security.

Deploy a debuggable binary/code on a SLATE container
Ability to add breakpoints and tracepoints to a service under test
Ability to see values of different params on control hitting a breakpoint
Create a seamless developer experience similar to remote debugging
Design solutions to be compliant with security and privacy issues

SLATE leverages the production infrastructure to generate containers, compile code, and execute services. However, modifications were required to the build and deployment infrastructure to facilitate debugging functionalities for services deployed on SLATE. This involved three significant enhancements. Firstly, enabling the generation of builds with integrated debugging tools and functionalities. Secondly, configuring software execution with remote debugging options. Thirdly, facilitating developer access to remote containers by allocating and exposing ports from said containers.

The current deployment pipeline is not flexible to support different options to generate both debuggable and production binaries. To be able to generate and deploy debuggable binary, multiple components of the pipeline should realize the type of binary and configure their features accordingly. This diagram indicates the components that would be involved during the feature development.

Fig 1: Modifications to the deployment pipeline to support debugging for SLATE.

The SLATE Container gets created alongside the production host. To be able to connect to the debugger, we have to expose a new debug port, similar to a gRPC/HTTP port. Currently UP is responsible for allocating random ports and mapping the same to the host port. The exposure of the new port will be opened only for debuggable SLATE deployments and SLATE implicitly handles the test requests by design. This new port exposure needs a security review. The below diagram indicates the high-level interactions.

Fig 2: Allocation and safely exposing debug ports.

To improve the security and avoid malicious access, SLATE debugging needs to be access controlled. This would ensure that only the service owners would be able to connect the debugger. The diagram below indicates the access control that would limit access to only the LDAP users of the service.

Fig 3: Password-based SSH tunneling to the remote host from the developer machines.

The debugger runs the application within a dedicated debugging server
The process blocks, awaiting attachment by the debugger client
The debugger process listens on a specific TCP/IP network port, referred to as a debug port

Debugging clients (e.g., VSCode, GoLand, JetBrains) connect via the debug port. Clients issue commands for various debugging tasks like setting breakpoints, displaying local variables and function arguments, printing CPU register contents, etc.

Remote debugging enables debugging on diverse environments, configurations, or architectures. Useful for troubleshooting specific scenarios or hardware/software related issues that cannot be replicated locally.

During debugging sessions, users attach the debugger to the application to intervene in program execution and gather debug information. This would also mean trying to identify and resolve bugs in the program. This means to get access to some sensitive service information if allowed for every user. So restricting to LDAP users (service developers/owners) is important to ensure minimum security.

For remote debugging, a secure SSH connection is established between local and remote systems. This will allow for local port forwarding and redirects debug requests through an SSH tunnel. This tunnel would ensure encrypted communication and secure data transmission.

To begin an SSH connection, users need the correct password linked to the “slatedev” account. This password is a randomly generated 16-digit code in the file within the service container. The password is generated during the container’s startup before the main service application runs. This Password is accessible only to the container access group, which is service owner LDAP. LDAP users can access the password through Compute CLI, enabling them to establish SSH connections and perform debugging tasks. Compute CLI ensures restricted access to Non-LDAP users, which doesn’t allow password access.

Remote debugging on production infra has limitations about dynamic modifications, so it’s limited to read-only
Large iteration time, as each change involves build, deploy, and test

Remote Debugging allows for read-only debugging on production infrastructure. Being in production infra allows for seamless connections with upstream and downstream services/tools. For a developer it’s very important to experience a debuggable environment with faster iteration to fix and test the same. This gap can be filled by creating a local debugging experience in connection with production upstream and downstream services. SLATE Attach fills this gap and allows for rapid development on attaching local environments.

The main goal is to reduce the code-deploy-test cycle (and hence, the time to validate iterative changes), by providing E2E testing with local development instances (laptop or dev machine), ensuring production isolation and safety.

The iteration cycle in this context is the time between making the code change and validating them. The smaller the iteration cycle, the more efficient the use of developers’ time for end-to-end validation of subsequent changes.

Fig 4: Iteration steps for development using SLATE.

Iterative development generates a build binary at a faster pace
Reduced code-deploy-test cycle
Faster identification and resolution of local, E2E failures
Faster setup time
Avoid the need for service changes or onboarding

This design aims to introduce a SLATE proxy that handles all the test requests aimed at SLATE instances for local debugging. These requests will then be redirected to the appropriate local developer machine for debugging and development. This allows users to iterate faster and improve developers’ productivity.

This feature could be enabled mainly in 2 contexts in SLATE environment lifecycle:

SLATE Control plane that maps local laptop/devpod to a slate environment
Test Request Data plane that redirects the requests to developers’ laptops

The main feature of the control plane is to enable services running in local laptops or devpods to attach to a SLATE environment. The local laptop/devpod that intends to run the service has to attach local environment credentials to a SLATE environment so that test requests are routed locally. The prerequisite for this attachment is to create a SLATE environment. This will allow mapping updates in routing control DB and local routing DB.

Fig 5: Request Call flow for testing code running on developer machines

User initiates SLATE attach from local laptop/devpod
The SLATE CLI calls Attach() API of SLATE Backend
SLATE Backend fetches the Proxy information (host:port) from SLATE Proxy
SLATE Backend updates the routing override in routing control DB using the fetched proxy info
User initiates the SSH Session using the Cerberus CLI
Cerberus gateway adds the mapping of deputized tenancy/UUID to the laptop credentials in Flipr DB and creates a SSH session for the laptop

Routing Control DB maps test tenancy to routing overrides and user account UUIDs to test tenancy. It stores the SLATE Proxy host:port against the service under test and ensures that all requests targeting a particular SLATE environment, reaches SLATE Proxy. SLATE Proxy finally routes the request to the development instance running in the user’s machine.

Local routing DB contains the development instance’s credentials that have been attached to the SLATE environment. SLATE Proxy interacts with the local routing DB to fetch routing credentials and finally routes the request to service-under-test running in local environment

This section mainly talks about the flow of test requests from different clients (mobile, studio, web, etc.). This data plane mainly involves 2 entities: routing override header and host tenancy mapping. The below diagram indicates how different test requests reach a local laptop through the SLATE proxy. The control plane ensures routing override and host mapping maintained in different databases.

Fig 6: Proxy setup for routing test requests to developer machines.

Above is the test request flow targeted for local laptop with production upstreams and downstreams:

Test account request originates from mobile client
E2E test proxy retrieves routing override and injects the routing header to the test request
The test request propagates through production services via Mutley until the request has service 3 target
The request redirects to SLATE proxy as the routing override has slate proxy host:port against service 3
SLATE proxy forwards the request to an open port on Cerberus-gateway based on a host:port config in the Cerberus-deputy Flipr namespace, set by the user when running the Cerberus CLI
The Cerberus-gateway forwards the request to the user’s local development machine for the user to debug
From the local laptop, the request will be finally forwarded to production downstreams through Cerberus

Running a service locally may not be feasible for some complex services, as they need support for some dependencies like spanners that can only exist in production infra
This is limited to test requests as it enables dynamic changing of requests and in turn secures production traffic
Requests timeout on longer wait for a debug request in local
Plug-and-play development environment to improve developer productivity
Ability to create local experiences that co-work with production for developers
Increase Developer Velocity: Production debugging can help developers identify and fix issues more efficiently

Fig 7. Impact figures for improving developer velocity using SLATE attach feature.

SLATE Sniffer to debug issues by monitoring

The remote and local debugging mainly allow for test requests to debug. There is a need for observability on production, beyond logs that come up in uMonitor Tool. We aim to create this observability precisely and on-demand using SLATE Sniffer. The main goals of SLATE Sniffer include:

Capture the request and responses as a filter of a service and UUID
Ability to support and filter on Production and Test requests

Our objective is to enhance the SLATE platform, positioning it as the primary tool for debugging production issues. The debugging features integrated into SLATE strike a balance between security and developers’ requirements. SLATE has introduced a new paradigm for developers’ code-related activities and service bootstrapping. We are looking forward to collaborating with different teams to shift the quality left and create visibility on potential issues at the early stage of development.