Meet Ottr: A Serverless Public Key Infrastructure Framework
Ottr is a serverless Public Key Infrastructure framework that handles end-to-end certificate rotations without the use of an agent. The purpose of the blog is to provide an overview on Ottr with sample reference architecture, logical and network flows, and highlight the benefits of the solution. For installation instructions, skip to the Open Source section of the article.
Introduction
Managing certificates for Public Key Infrastructure (PKI) is a difficult problem to solve at scale for any organization. While there are a number of agent-based solutions to automate certificate rotations for Linux and Windows distributions, the process to broker certificates for network infrastructure commonly involves either manual intervention from engineering teams or use of enrollment protocols such as Certificate Management Protocol (CMP), Simple Certificate Enrollment Protocol (SCEP), or Enrollment over Secure Transport (EST), which all have their security issues.
We built Ottr at Airbnb to be a scalable and configurable serverless framework on AWS with little operational overhead or reliance on enrollment protocols. Ottr can be extended to handle end-to-end certificate rotations for any hosts (e.g., network infrastructure, Linux, Windows) capable of managing their own X.509 certificates from a remote session (e.g., API, SSH, SSM Agent).
Background
PKI governs the issuance of digital certificates to protect sensitive data, provide unique digital identities, and ensure secure end-to-end communication. Certificate Authorities (CA) are responsible for brokering these X.509 certificates and own the policies, practices, and procedures for vetting recipients and the issuing process. The CA used to generate the digital certificate can be from a Private CA, which your organization manages, or a Public CA, such as Let’s Encrypt, which is managed by the Internet Security Research Group (ISRG).
At Airbnb, engineers are responsible for ensuring that end-to-end encryption is in place for compute nodes as well for firewalls, load balancers, and other network devices. The diagram below illustrates the typical process for certificate reissue.
As you can see, this is a heavily manual process requiring an approval step, which creates operational overhead for multiple teams. The details are broken down below:
- Generate Private Key and Certificate Signing Request (CSR): A CSR is a cryptographically signed request that contains information around organization details as well as the Common Name (CN) and Subject Alternative Names (SANs) for which the certificate will be valid. A CSR is typically generated using OpenSSL, where a Private Key is created on the target device (and never leaves the host) and the associated Public Key is embedded within the CSR.
- Send CSR to Certificate Authority (CA): In order for a CSR to be signed by the CA, the domain must be validated. This can be done through a number of different ways (e.g., HTTP-01, DNS-01 Challenges). The CA can either be a Private CA for which your organization controls the trust chain, or a Public CA such as Let’s Encrypt whose chain of trust is outside of your control.
- Approve CSR: Due to the sensitive nature of the certificate request process, an approval will typically be required from a security team to allow the CA to generate the certificate for the CSR that was submitted.
- Download Certificate: After approval, the certificate, intermediate certificate, root certificate, or full chain will be available from your CA and can be downloaded in a base64 format (e.g., .pem, .cer, .p7b).
- Upload Certificate: Depending on the platform, the full chain certificate will then be uploaded to the target device in a format that is supported (e.g., .pem) and restarted if applicable.
Why Ottr?
When we were first designing Ottr, Airbnb needed a framework to manage X.509 certificates for hosts that could not run agents to manage their X.509 certificates; we needed a solution that would be customizable and scalable, while still emphasizing security. The diagram below illustrates how certificate reissue works with Ottr.
There are many advantages of this new framework:
- Serverless: No underlying infrastructure to manage, which means we do not have to patch or harden new servers.
- Limited Dependencies: Only major dependency is upon the ACME Client (acme.sh), which is well-maintained.
- Customizable: Ottr is modular in design, meaning it provides developers the ability to build custom integrations when additional platforms are introduced to the infrastructure. Developers can use Certificate Authorities outside of Let’s Encrypt, so long as they support the ACME protocol.
- Scalable: Ability to perform thousands of certificate rotations per day (based off the rate-limit the CA sets).
- Security: Infrastructure security is a default; the Terraform modules that build Ottr have hardened configurations and follow the principle of least privilege.
- Automated: Ottr handles the end-to-end certificate rotation lifecycle without any manual intervention.
- Portability: Ottr builds 100+ resources through Terraform that are easily configurable through modules and deployable across any AWS environment.
- Cost: Ottr can be used with a Private or Public CA (e.g., Let’s Encrypt) running ACME at no additional cost.
- Error Handling: Provides instantaneous feedback through Slack on any potential errors during runtime.
- Open Source: Anyone can contribute, and new platform support can be introduced as the framework matures.
Getting Under the Hood
In this section, we’ll dive into the different components that comprise Ottr and explain how they connect together to abstract the complexities of PKI from the end user.
High Level Diagram
Ottr Architecture
Let’s take a look at the architecture of Ottr and how each component works in relation to the overall flow:
- CloudWatch Event: Automated entrypoint that triggers the Lambda Router at a configurable interval (e.g., once per day).
- Ottr API: Alternative entrypoint that can be used to execute one-off certificate rotations.
- Lambda Synchronizer: Aggregates host metadata from datacenters and/or AWS used to update the DynamoDB database via the Ottr API.
- Lambda Router: Scans the DynamoDB database and determines which hosts are eligible for certificate rotations and forwards data to Step Function.
- Step Function: Processes batch of device data in parallel from Router Lambda or API and executes an ECS Container for each host that is targeted for a certificate rotation.
- ECS Container: Pulls down platform specific image from Elastic Container Registry (ECR) based on the ECS Task Definition metadata element that is retrieved from the Step Function.
- Lambda Handler: In cases where a container runtime error occurs, there is an external integration with Slack that will provide device details and a link to entry within CloudWatch Logs.
Container Runtime:
- Establish connection to device to generate a Public/Private Key Pair and CSR on the device; pull the CSR onto the container filesystem.
- ACME Client binds the organization’s ACME credentials on the container and sends the CSR to the CA (e.g., Let’s Encrypt) to begin the certificate signing flow.
- ACME Client writes DNS TXT Record(s) to the DNS Subdelegate Zone in Route53 for each Common Name (CN) and Subject Alternative Name (SAN) from the CSR.
- CA validates domain ownership through a DNS-01 Challenge; when validated, a certificate is generated and the ACME Client writes the fullchain certificate to the container filesystem.
- Depending on platform logic, the certificate is applied to the device and a number of validation checks are performed.
- Upon success, the new certificate expiration date is updated for the device in the DynamoDB database.
Database
The API is not only an alternative entrypoint for Ottr, but it is also the preferred endpoint for managing assets within the DynamoDB database. The elements within the database provide device details that are used both to determine when a certificate is expiring as well as the metadata used to map a host to a platform specific ECS Task Definition for runtime logic.
Database Asset Output Example
Task Routing
During the routing process, the database is first scanned to build a list of devices that have a certificate expiration within 30 days. That list is further narrowed down depending on if the host has valid Route53 Records within the DNS Subdelegate Zone. If these are both true, the logic moves to map each host to a corresponding ECS Task Definition based on the Routing Configuration that is set.
Following the routing configuration example below, if there is a PAN-OS device running 9.x.x with a model PA-XXXX and has the Certificate Authority set for Let’s Encrypt, an ECS Task of otter-panos-9x-lets-encrypt will be returned. By having this routing logic, it enables end users to perform different types of device rotation logic all under one platform.
Routing Configuration Example
__After the router builds the mappings between hosts and task definitions from the routing configuration, the payload is sent into the Step Function and gets processed as a Map, which is used to run a set of steps for each element. The Step Function launches an ECS Fargate Container that pulls an ECR Image defined within the ECS Task Definition. The process is performed in parallel with a max concurrency of the number of Elastic Network Interfaces (ENI) available within the subnet. If there are more containers that are required to run than the available number of ENIs, the Step Function will queue jobs until the previous executions finish and network interfaces become available.
Looking at the Step Function input below, it shows that test.example.com has an ECS Task Definition of otter-panos-9x-lets-encrypt while test.airbnb.com has an ECS Task Definition of otter-linux-aws-ssm-lets-encrypt. Although both the platforms and the domains are different, Ottr can execute both these rotations in parallel independently of each other because a dedicated container is spun up for each host.
One of the core security design decisions of the service was to limit access to Route53 (DNS). Ottr performs domain validation using a DNS-01 challenge, which means that the ACME Client needs to write a DNS TXT Record to _acme-challenge.[FQDN] in order for the Certificate Authority to validate ownership of the domain. The security concern is we do not want to provide access that allows write permissions across the organization’s primary hosted zone. While Ottr may only require the ability to write a TXT record to _acme-challenge.test.example.com, as of the time of this writing, AWS does not provide the granularity necessary to specify write access to TXT Records only, which would mean Ottr would be granted access to write any record types including A, CNAME, PTR, and MX Records to your organization’s domain(s).
To limit access, we introduced DNS subdelegation to the ACME Client. When the infrastructure for Ottr is built, there will be a new Route53 Hosted Zone that is created depending on the configuration such as example-acme.com. When the ACME Client sends the Certificate Signing Request (CSR) to the Certificate Authority (CA), the CA will subsequently look for the TXT record within the challenge-alias field, which will be example-acme.com.
What this means is that before the domain validation process occurs, DNS needs to be configured to set up a CNAME mapping between your host test.example.com to forward records to example-acme.com.
Terraform DNS Module
DNS CNAME Record Mapping
By adding this mapping, all TXT records are written and read within example-acme.com. As a result the permissions within Ottr can be limited to read only for domains such as example.com and write permissions would be granted to the subdelegate zone example-acme.com.
External Integrations
By default, Ottr includes an external integration with Slack for error handling. After each ECS Task is completed, a new certificate expiration is added to the database pending a successful run. If an error occurs in the container runtime, it results in a notification being generated in Slack to provide operational teams instantaneous feedback and a link that directly points you to the CloudWatch Logs of the failed task. While Slack is the default integration, custom logic can be written if your organization prefers to use other platforms or triaging methods for error handling.
Network Architecture
Ottr Network Data Flow
Let’s take a look at network connectivity for routes outside of the AWS infrastructure. After the Step Function determines valid hosts to perform certificate rotations on, it will execute a container for each device in parallel. These containers are launched depending on ENI availability in one of two subnets that is predefined when building the infrastructure with Terraform. After connecting to the device and retrieving the CSR, the ACME Client sends the CSR to the Certificate Authority. By default this is Let’s Encrypt, but Ottr has the capability of integrating with any CA that supports the ACME Protocol.
During the signing process, the ACME Client requires routes to both the Certificate Authority as well as Cloudflare DNS endpoints. Since Let’s Encrypt is the default CA, access to the production endpoint acme-v02.api.letsencrypt.org as well as the staging endpoint acme-staging-v02.api.letsencrypt.org are required over 443 (SSL). Cloudflare DNS cloudflare-dns.com is also used to poll DNS status using DNS over HTTPS (DoH) to determine when the DNS TXT record used for domain validation has been posted by the ACME Client. By using DoH, the DNS resolver runs queries over TLS which improves security since DNS queries are encrypted and not run via DNS over UDP. Performance is also improved since the ACME Client polls DNS compared to using sleep to wait for a set time before querying DNS to validate the domain.
Results and ROI
From deploying Ottr within Airbnb, our organization has realized several benefits. We’ve seen returns on investment due both to time saved and to the reduced operational overhead for engineering teams. Since the introduction of Ottr at the beginning of the year, thousands of certificate rotations have been performed without any human intervention. This has alleviated a pain point for multiple teams including Operations, which was responsible for monitoring and triaging tickets for expired certificates, Engineering which was responsible for the manual certificate rotation process, and Security which was involved in request approvals.
Another important return was related to improvements in security. By having Ottr act as a broker between the CA and host, engineers would no longer need to make changes to DNS records to validate domain ownership. This resulted in the reduction of AWS IAM permissions across a number of teams, improving least privilege. In addition, Ottr provides a repeatable framework in which the private key never leaves the host during its lifespan, rather than having engineers generate a private key locally and upload it to the host.
Most importantly, certificate rotations are being run in more frequent intervals instead of certificate renewals, which means the private key is switched out for each execution; this results in shorter certificate lifespans, which in the case of private key compromise lessens the timeframe data can be decrypted.
Conclusion
Although Public Key Infrastructure can be a complex problem to solve at scale, Ottr was built to abstract a number of challenges associated with certificate provisioning while also providing additional benefits around operations and security.
By open sourcing Ottr, we hope to create a community to share, collaborate, and expand the framework to help fit the needs of other organizations. If you’re interested in helping protect people and data, Airbnb Security is hiring. Check out our open positions and apply today!
Open Source
**Setup
**Ottr is now open sourced on Github. You can begin building the infrastructure by going to the Setup resource page and learn more about our current implementations through the Supported Platforms link.
**Contributing
**Please feel free to reach out or submit pull requests with any suggestions. If you are currently leveraging Ottr and running rotations against platforms that currently aren’t supported, please view the contributions page and consider helping add your implementations to the platform!
**Credits and Contributions:
**Ben Paradis (Staff Security Engineer, Airbnb)
Aaron von Hungen (Senior Security Program Manager)
John Borromeo (Senior Network Engineer, Airbnb)
Ryan Diers (Security Engineer, Airbnb)
Sean Corcran (Senior Systems Engineer, Airbnb)
Jeff Nanney (Staff Network Architect, Airbnb)
Mark Vlcek (Security Engineer, Airbnb)
Zeeshan Khadim (Former Manager, Airbnb)
Tina Nguyen (Senior Project Manager, Airbnb)
Development Community Supporting acme.sh
All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.
Appendix
All trademarks are the property of their registered owners; Airbnb claims no responsibility for nor proprietary interest in them.