AUTHNZ — Authentication and Authorization in Micro-Service Architecture

Published in

Myntra Engineering

10 min readMar 29, 2022

Have you ever wondered how a big organization having hundreds of microservices works together by establishing trust between them? Services running within the production infrastructure of an organization need strict control on who is accessing what. This handling of Authentication & Authorisation across services in a heterogeneous environment is a challenging problem to solve at a scale.

In this article, we are going to look at the Authentication and Authorization(AUTHNZ) system in a distributed microservice architecture of one of India’s leading fashion e-commerce destinations, Myntra. The highly dynamic and elastic environment broke accepted perimeter security concepts, requiring better service level interaction that was agnostic of the underlying network. As organizations adopt new technologies, such as containers, microservices, cloud computing, and serverless functions, one trend is clear: there are a larger number of smaller pieces of software. This both increases the number of potential vulnerabilities that an attacker can exploit, and also makes managing perimeter defenses increasingly impractical, thus making the Zero trust policy an integral part of the solution.

We will talk in detail about how the AUTHNZ system works, its components, and how it introduces a Zero Trust Policy in Myntra to defend it against any modern adversaries.

What is AUTHNZ and its impact on Myntra architecture?

AUTHNZ is a one-stop solution for authentication and authorization of service to service interaction in Myntra. This platform removes the trust issues between two services and the underlying infra by introducing a Zero Trust Policy and enforcing the same for each service level call being made. It removes the overhead of using some static API token to authorize any calls, which always have chances of getting spoofed.

Capabilities of the platform:

Registration of all services through a self-serve tool.
Attestation of Nodes (on which services are running)
Authentication of the services.
Centralized policy management for the resources.
Authorization of access request for the resources.

High-Level Architecture

The entire platform can be divided into two main components:

Authentication: Responsible for validating the identity of the nodes and the services
- Spire Server
- Spire Agent
- Service Registrar
Authorization: Responsible for resource access control.
- Policy Portal
- Policy DB
- Policy Distributor
- Authorization agent(AuthZ)
- Open Policy Agent(OPA)

High-Level Architecture of the AuthNZ platform

Authentication using SPIFFE SPIRE

The scope for Authentication covers the following:

To provide identity to the node.
To provide identity to a service residing on a node.
Verify and validate the identity of the interacting service or source.
Expiry and rotation of the identities.

Key Terminologies

SPIRE is a production-ready implementation of the SPIFFE APIs that performs node and workload attestation in order to securely issue identities(SVIDs) to workloads, and verify the identities(SVIDs) of other workloads, based on a predefined set of conditions.

A SPIRE deployment is composed of a SPIRE Server and one or more SPIRE Agents.

A server acts as a signing authority for identities issued to a set of workloads via agents. It also maintains a registry of workload identities and the conditions that must be verified in order for those identities to be issued.
Agents expose the SPIFFE Workload API locally to workloads and must be installed on each node on which a workload is running.

SPIFFE ID

A SPIFFE ID is a string that uniquely identifies a workload which can take the following format:

               spiffe://trustdomain.com/serviceA

The trust domain corresponds to the trusted root of a system.

SVID

An SVID is a document with which a workload proves its identity to a resource or caller. An SVID is considered valid if it has been signed by an authority within the SPIFFE ID’s trust domain. The supported types are X.509 certificates and JWT tokens. At Myntra, we are using JWT tokens since the load balancer being used, viz Haproxy switches to L4 load balancing if used with X.509 certificates and can no longer do routing based on the request headers.

Workload API

The workload API is an API exposed locally by each agent that allows the workloads to retrieve their SPIFFE ID, SVIDs, and trust bundles. Workload API is platform-agnostic and can identify running services at a process level as well as a kernel level — which makes it suitable for use with container schedulers such as Kubernetes.

Now that the key terminologies are clearly defined, let’s see how a workload is able to fetch its SVID in the form of a JWT.

Workflow for fetching SVIDs

Registration API is called by a self-serve platform to populate the identity registry with the required SPIFFE IDs and relevant selectors for services.
Node agent gets authenticated with the SPIRE server node attestation plugin using Azure MSI. The spire server nodes require a `reader` RBAC role in Azure.
Node attestor in the SPIRE server validates the provided identification document based on the used mechanism. Upon successful validation, SPIRE server sends back a set of SPIFFE IDs that can be issued to the node along with their process selector policies.
When workload starts to run in the node, it first makes a call to the node agent asking ‘Who am I?’.
Based on the process selectors (unix user) node agent received in the previous step, and using the workload attestors, the agent decides on the SPIFFE ID to be given to the workload. It generates a key pair based on that and sends the CSR(Certificate Signing Request) to the SPIRE server.
SPIRE server responds to the node agent with the signed SVID for the workload along with the trust bundles, indicating which other loads can be trusted by this workload.
Upon receiving the response from SPIRE server, node agent, handover the received SVID, trust bundles, and the generated private key to the workload. This private key never leaves the node to its workload belongs.

What is the authorization problem?

Let’s suppose there is a client service, Service-B, that wants to access some resources of Service-A. Now, Service-A will need some system to check if Service-B is allowed to access that particular resource or not. Since all resources belong to Service-A itself, it needs some mechanism to define the policies for access control based on which decisions of authorization can be taken.

Generalizing the problem to find a common solution, we can say that, we need a way to define and enforce rules(policy) that read whether Client C, can/cannot access Resource R of Server S, for all sets (C, R, S) in the ecosystem. In an ecosystem that totally works on microservice architecture, we can imagine the number of combinations possible.

Need for a separate Authorization System?

Services can define rules in their codebase itself by adding different authorization keys for different resources or by defining users for role-based access. However, defining these policies at the application end is expensive and can be painful to maintain. There is also the need for the keys to be secured in a vault and rotated in order to avoid malicious use. This has become a fundamental problem for any organization, that’s why we need a separate Authorization system that is platform agnostic.

Let’s see how we are going to solve the authorization problem using the Open Policy Agent(OPA).

Authorization using Open Policy Agent

Functionalities of AuthZ

AuthZ components allow the service owners to define policies for their services. These policies are stored in a secure database. It decouples the policy management from the services.

Upon receiving any request for a resource, the service will connect with the local AuthZ agent that uses OPA to make authorization decisions. OPA uses general policies written in the declarative language Rego for decisions on individual requests.

Components involved in the process of Authorization

The above diagram depicts the main components involved in this process of authorization. Let’s get into the details of each of these components:

Policy Creation Platform: A self-serve platform for the creation of service policies.
A policy is a set of conditions that need to be satisfied to get authorized. By defining policies, service owners get to decide who can talk to their services at what level. The platform currently supports restful and grpc resource policy generation. A policy consists of mainly three parameters: — request method: GET, POST, PUT, gRPC
— resource path : API endpoints or method names in case of grpc
— Client Service : Caller service
The system also handles endpoints that contain path params and query parameters. All the defined policies are stored in a highly available and secured database.
Policy Distributor: A platform to distribute service policies to the respective nodes where services are running. It stores the policies in Redis-cache for faster access and distributes these to the AuthZ Agent whenever they try to fetch it.
Policydistributor regularly checks and updates policies in the cache so that the latest version of any policy is available for the services. Whenever Policy Distributor receives a call from AuthZ Agent, it fetches the service policies from the cache and creates the policy rego template for the service, and sends this template back to the AuthZ agent. This template is used for decision-making. Policy Distributor is scalable and HA to ensure its availability and its capability to handle requests originating from thousands of AuthZ agents.

3. AuthZ Agent: An interface between OPA and the service which is. responsible for encapsulating the OPA layer and sending back the authorization response to the service.

Key features:

Runs in the same environment as the service.
Fetches policy template from the Policy Distributor scoped to the service running in the same node.
Reloads the policy template in OPA whenever a new policy is added for the service.
Exposes a REST endpoint for the service to call in order to get the authorization decision response.

4. Open Policy Agent (OPA): An open-source general-purpose policy engine that unifies policy enforcement across any tech stack. It is a host local cache for policy decisions in which policy templates and data are stored in memory. These policies are used for making authorization decisions.

OPA provides a high-level declarative language (Rego) that lets you specify policy as code and simple APIs to offload policy decision-making from your software. E.g. whether client C can access resource R of service S or not. So in this example, OPA solves policy enforcement for all sets of (C, R, S). It decouples decisions from enforcement.
It receives a query from AuthZ Agent and generates decisions by evaluating the query against the policies and data it has.

Infrastructure key points for AuthNZ

Scale: Solving the problem of AuthNZ at a scale having two independent data centers with more than a thousand hosts per datacenter required a decentralized system to be developed. The rate at which the signing of the SVIDs takes place is around tens of thousands of signings per day per datacenter. Each host has its own dedicated agents for AuthNZ.
Complexity: As infrastructure grows heterogeneously, a trend can be seen where organizations have adopted a variety of new technologies by running services over multiple platforms. We support a heterogeneous environment of VMs as well as Kubernetes to run our production services. A solution consisting of the heterogeneous environment has been chalked out that consists of a single Spire server cluster, policy distributor cluster catering to both the platforms while having sidecar containers for agents for K8s clusters.
Availability: All AuthNZ components are running in a HA cluster to eliminate a single point of failure and achieve high availability.
Disaster Recovery: The need for a decoupled system for authentication of Services, Nodes, and authorization of resources arises in our DR region as well. AuthNZ platform has isolated clusters serving independently in the DR region to ensure availability at all times.
Centralized logging, monitoring, and backup: The platform supports an effective centralized logging mechanism using scribe and EFK where real-time logs can be viewed for any agent. There are various plugins for monitoring available for example, Prometheus, statsd, which can be used for monitoring the agents as well as the central servers. The metrics are being visualized over grafana for efficiently identifying anomalies.
Improving Latency that might arise due to addition of AuthNZ: Though the spire agent caches SVID tokens in memory, for faster access clients can choose to cache SVIDs locally as part of their process. A similar caching mechanism is also used for caching the Authorization policy decisions in order to save the AuthZ call overheads.
Rollout Propagation: The platform has been rolled out in a planned manner with backward compatibility currently integrated with 100+ services.

What’s next?

Now that we have seen the platform architecture and understood it in detail, there are a couple of things planned for future scope:

Enabling AuthNZ for messaging systems like Kafka.
Extending AuthNZ for datastores and azure blobs.
Securing SSH connections.
Enabling mTLS for service to service communication for securing SVID transfer.

Thank you for reading. We would love to hear your thoughts. Please leave us comments on your feedback.

I’m also excited to hear from the rest of the community on what are some of the challenges they’ve faced and how they’ve overcome them in the comments section below. Stay tuned for further updates…

Co-author: Mohammad Basit
Credits: Thanks to Vivekanand and Ashutosh Sharma for their review and support.

References: