Sisyphus and the CVE Feed: Vulnerability Management at Scale

Keziah Plattner

Published in

The Airbnb Tech Blog

13 min readAug 10, 2022

Authors
Keziah Perez Sonder Plattner, Senior Software Engineer
Kadia Mashal, Engineering Manager

Introduction

Every engineer knows that security is a never-ending problem. Until we delete all our code and move into a cottage in the woods, we have to accept that there is no such thing as 100% secure software. You could be doing everything perfectly, and a publicly known vulnerability (CVE) could emerge for the most updated version of a third party library in your infrastructure. Things are secure until they are not. Like with Sisyphus, the boulder will never reach the top of the hill.

Rather than eliminating vulnerabilities, the goal of a vulnerability management program should be to quickly and effectively detect and respond to the barrage of threats that surface every day. There are many scanners and vendor tools that purport to solve the problem. But with the scanners comes the problem of a never-ending flood of CVE reports, thus slowing down our ability to remediate in a timely manner.

Vulnerability Management Lifecycle

If you are new to vulnerability management, here are the basics of the lifecycle.

Detection

Find potential vulnerabilities in our infrastructure, anywhere from CVEs to insecure misconfigurations.

Risk Assessment

Apply a risk framework to the findings to identify true positives and weed out non-applicable vulnerabilities.

Reporting

Find the team and/or person best suited to address it and track progress in a methodical way. In addition, centrally track all vulnerabilities in order to have a full view of our attack surface.

Remediation and Prevention

Promptly remediate the vulnerability and invest in work to prevent the vulnerability from being introduced in the first place.

Objectives

In building a vulnerability management program, we want to focus on the following:

Visualize Known Attack Surface: We can’t properly assess risk if we don’t know our vulnerability status in the first place.
Speed: Detection, reporting, and remediation should be completed in a timely manner.
Prioritization: Focus on the highest-priority vulnerabilities before tackling less important or harder to exploit ones.
Scaling: Support a constantly evolving and expanding cloud infrastructure.

Gaps and Challenges with Standard Industry Solutions

Standard industry advice leans on out-of-the-box vendor deployments, manual risk assessment, and operationally-heavy reporting processes. Automation, if it exists, relies on limited vendor functionality without the flexibility to adjust to unique attributes of our environment. This led to major challenges in accomplishing our objectives.

Lack of Vendor Agnostic Solutions

As our infrastructure expands, the number of vulnerability types we want to track grows along with it. A variety of scanning solutions are needed to cover our bases, but they come with different setups and reporting processes. And there may be future scanning solutions that will work better for us, so we don’t want to lock ourselves into a single vendor or solution.

Noisy and Inaccurate Severity Ratings

In our experience, the majority of scanners provide inaccurate risk scoring. Vulnerability bulletins and assessments like the basic Common Vulnerability Scoring System (CVSS) may describe a worst-case scenario that is difficult to exploit or sensationalize an issue, leading to inflated severity. And when a major zero-day vulnerability is found, it’s more likely to be identified by an anonymous Twitter user than a scanner.

Additionally, internal mitigations can lower the impact of a vulnerability, and generic scanners rarely have ways to add internal context to customize risk ratings. Asset information and the location of the vulnerability play a massive role in determining its severity.

Operational Work

Many vulnerability management solutions assume the need for human intervention in the process. However, humans make mistakes, and every manual step leads to slower remediation times. Spending time on onerous tasks like manually assessing risk severity or making tickets takes away from our time to focus on root-cause remediation.

Guiding Principles

Before developing our solution, we wanted to establish guiding principles. While there will always be exceptions, our goal is to keep this as our north star.

Limit the need for human intervention by reducing false positives.
It can be tempting to design a solution that catches close to 100% of true positives.

However, when maximizing true positives, it’s inevitable that the false positive rate will increase too. False positives create onerous manual work for both the security team and the owning engineering team. We don’t want to “cry wolf” on vulnerabilities that are not worth addressing. Doing so is both unscalable and breaks trust in the security org. Not to mention that relying on manual triage to verify alert severity slows down response time, leaving vulnerabilities open for longer.

In addition, the idea that manually checking vulnerabilities will result in higher accuracy doesn’t take into account human error and alert fatigue. No human or automated process will ever get 100% true positive accuracy, and that knowledge must be built into the solution instead of fighting a losing battle against the barrage of noisy alerts.

Pair detection with preventative measures.
Now that we have established that true positives will slip through the cracks, we need to address how to handle those cases. We pair our detection and reporting workflows with preventative solutions that address the root cause of the vulnerabilities. For example, if we make sure to have a regular patch cadence, then all vulnerabilities–including lower-priority ones–will be addressed within a reasonable timeframe. If we fix a flawed design, we will reduce the number of vulnerabilities in the first place.

Build relationships.

All the automation in the world won’t be useful if we antagonize other engineering teams, rather than empowering and incentivizing them to remediate problems. Developer productivity matters — we don’t want to create an onerous system that frustrates developers and makes security the enemy. We want to avoid blocking solutions unless the priority calls for it, and we want to focus engineering efforts on our highest priority gaps rather than spreading engineers thin across many low and medium level vulnerabilities. Of course, there are always exceptions here–sometimes a vulnerability is severe enough to require being paged in the middle of the night. But we want to be sparing with that approach.

It helps to get teams on board by pairing security fixes with other benefits, or working it into existing workflows so the process is essentially invisible to outside engineers.

We have tried top-down solutions in the past, but nothing has been as effective as treating engineering teams as our partners and seeking their input as stakeholders, making it a mutually beneficial process.

Maintain Accountability.
Detecting vulnerabilities doesn’t help if there is no accountability to fix them. We want to make sure they are correctly attributed to business units so that leadership is held responsible for fixing vulnerabilities. Keeping business units accountable means that they will be incentivized to allocate resources towards the problem instead of keeping security issues on the backburner.

There should be a company-wide Service-Level Agreement (SLA) based on the severity of the vulnerability that is part of the success metrics for each org. If an SLA cannot be met, we offer automated SLA extension requests to allow teams to make adjustments. We avoid exceptions except when absolutely necessary and periodically review the reasoning.

Building an Automated, Vendor-Agnostic Vulnerability Management Pipeline

Given the challenges, the standard industry advice just didn’t work for our use case. So, we decided to create our own engineering solution.

*Fig. 2: Automated Vulnerability Pipeline*

Step 1: Aggregate and Process Vulnerabilities

First, we turned the barrage of alerts from our scanners into a standardized format, centralized in a single place, instead of having each scanning tool siloed from the others.

In order to cleanly track vulnerabilities throughout the pipeline, we have developed a UUID generating process for every vulnerability type we encounter. This UUID is mappable to the vulnerability and the asset it is found in. This can change per vulnerability type–for example, for our third-party packages, we track vulnerabilities by asset + package name + package version. It is less important to individually track every CVE present in the asset, given that fixing a package version will address every related CVE. When we detect that an asset is no longer using a specific package or version, we can feel confident that the vulnerability is no longer present. And unifying multiple CVEs into one UUID is easier to manage.

This also helps when scanners have some overlap. Instead of using the limited vendor features directly, we leverage their APIs to ingest the results so that we can process the alerts according to our needs, and allow us to deduplicate repetitive alerts and verify fixes. We can then combine results as needed or add additional information.

Step 2: Contextualize Risk

Our next step was to automate risk assessment by taking the default severity calculation provided by the scanner and integrating additional context.

All companies have different infrastructure setups and mitigation strategies. For example, vulnerabilities involving DDOS aren’t as impactful if the load balancers have existing mitigations. On the other hand, a vulnerability in an application that handles PII can be much more severe than initially assessed by a scanner.

To improve the accuracy of our risk assessment, we take into account the following:

Internal mitigations: Certain vulnerability types may not be relevant in our infrastructure.
Common Vulnerability Scoring System (CVSS) vector: The type of exploit is important. For example, if the attack requires local privileges and user interaction, the odds of exploitation are significantly lower, even if the impact is severe. Breaking down the CVSS base score by attack vector allowed us to gain a better understanding of the vulnerability risk.
Asset risk: Is the asset public facing? Or is it an internal service that handles low-priority metadata? Does it work with PII? How critical is it to production flows?
Multiple external scoring systems and metadata: Is there a difference between the National Vulnerability Database (NVD) rating, the Red Hat rating, and the vendor rating? Is this a package that hasn’t been updated in 3 years? Is this an old, unmaintained third-party package?

It’s critical that this step be automated as much as possible. We don’t want to have engineers reading through every CVE guide to determine how severe a vulnerability is.

Depending on a company’s situation, tracking asset risk context may require engineering resourcing from both the security org and other engineering teams. This could involve engineering work to gather metadata from the codebase or cloud infra provider, or sorting through large existing datasets and compiling the most useful information for reference.

While that information may not always be available depending on the stage of the company, keeping track of asset information is a long-term investment that will pay out in dividends as a security program matures.

There’s no need to wait until all the relevant information is available–even a partial dataset is helpful to start, and the risk algorithm can be tuned as more data comes in. For example, in a microservice architecture, owners could fill out a yml file with attributes of the service on creation, and eventually graduate to more automated evaluation of service metadata.

Step 3: Reporting and Remediation

Once we’d standardized the format of the vulnerabilities and tracked them via UUIDs, we created a generic reporting service. Shared logic for ticket creation, closing, and metadata tracking is handled in the service so that we don’t need to constantly reinvent the wheel. We will go into more detail on the reporting service in the Implementation section.

Step 4. Verification

Once the vulnerability has been marked as fixed by the owner, we want to programmatically verify that it is truly gone. As our guiding principle states, humans are always the weak link in a process. Anyone can mistakenly close a ticket as fixed, and we can’t have 100% confidence unless we verify that the vulnerability is truly gone.

For scanners that report state daily, once a UUID is no longer present, we can mark the vulnerability as verified. For more complicated ones, we can write separate jobs that pass the UUID status to the reporting service so tickets can be closed and/or verified. For example, if we are tracking a vulnerability ad-hoc, we can collect information from deployment pipelines to ensure that the patch has been successfully deployed.

Implementation and Scaling

We want to share our specific implementation for our pipeline, however, it is worth noting that there are many different technologies that could be leveraged to handle similar logic.

Detection

Vulnerability scanners can be performance heavy. Moreover, deploying many vendor solutions and/or agents can increase an organization’s attack surface. It is key to understand what data the vulnerability scanner will provide and consider if an already rolled-out agent (i.e. Osquery, AWS Systems Manager Agent (SSM), etc) could provide similar output. So aside from a few necessary traditional vendor solutions, we primarily leveraged agents that were already being used for other purposes to identify security vulnerabilities.

Airflow

Our process is primarily implemented in Airflow, an open-source data processing tool released by Airbnb a few years ago. Airflow allows us to create scheduled jobs with upstream and downstream dependency management. Datasets can be tracked on any timeframe, and it is easy to backfill or rerun jobs on specific dates (in contrast to a standard cronjob).

The steps are implemented in Directed Acyclic Graphs (DAGs) that can be chained together. We use this to create a data processing pipeline to ingest the alerts and process them as mentioned in our pipeline explanation. Shared data like internal risk context, ingested vulnerability feeds, and NVD/Red Hat scoring criteria can be easily accessed at any point of the pipeline for writing contextualized risk logic.

Reporting Service

Now that we had the process for a single type of vulnerability, we wanted to be able to easily scale for any new type of vulnerability that we start tracking.

This is where the importance of the reporting service comes in. Our reporting service doesn’t care about the details of the vulnerability. All it needs is the data stored in a table with the expected schema and the client-provided callbacks to create the ticket copy.

The modular nature of Airflow DAGs and the reporting service makes it simple to add a new vulnerability source into our systems. Our team doesn’t have to be responsible for writing every ingest / risk assessment process, and external teams using our pipeline don’t have to worry about managing the shared vulnerability tracking logic. Shared metadata can be reused over and over again, so things like risk assessment logic get easier every time.

*Fig. 4: How we scale our reporting service for any number of alert types.*

Results

Scaling

Before we standardized on this system, the vulnerability management team had to be much more involved in the vulnerability tracking process, creating a bottleneck. It was difficult to have a different team own a specific kind of scanner or vulnerability type, as we had to be deeply involved in how it was tracked.

After we rolled out this process, the number of vulnerabilities we were able to manage increased dramatically. Multiple security teams were able to integrate with our pipeline for their own purposes, and unifying the functionality across our org allowed us to automatically get metrics that give us insight into our full attack surface.

Operational Work

Our work with contextualizing risk also made an enormous difference in managing the operational work around risk assessment.

For example, when we first deployed a new scanner, a large percentage of the alerts were false positives. We spent a lot of time going through the tickets daily to filter out the noise and identify the highest priority CVEs. Over the course of several months, we tuned our risk assessment algorithm to take in different kinds of criteria aside from the default severity score provided by our scanner. Now, we only occasionally have to manually review tickets to validate severity (primarily criticals, which can be difficult to distinguish from highs), and we trust that alerts are most likely accurate.

Case Study: log4shell

A side effect of having a scalable vulnerability management system is that it is significantly easier to react quickly during critical incidents.

Like the rest of the internet, we had to act fast when the log4j vulnerability occurred. Several years ago, the work would have been both operationally and programmatically difficult, and would require significant time and resources. However, our new pipeline allowed us to respond much more quickly. We simply wrote a new DAG to track all services running Java and passed it into the reporting service. While our engineers were busy patching the services, we wrote a second DAG programmatically detecting if a service had been patched. We then were able to confidently verify the status of a service, while reopening tickets for incorrectly fixed ones.

Takeaways and Suggestions

Vulnerability management is a hard problem to solve and getting to a solution that works best in a custom environment takes time. It is key for organizations to prioritize the problem space using automation to allow them to quickly address known attack surfaces.

Vulnerability management should be treated as an engineering problem, not as an operational problem. If you have not yet adopted this approach, hopefully the benefits we have described convince you to take steps towards this goal. And just like every engineering solution, you will learn to adjust your approach as you gather more datasets and metrics about your environment.

On top of detection and metrics tracking, it’s important to prioritize automation to address vulnerability root causes because metrics alone will not reduce attack surface. Prevention is always better than remediation.

Lastly, be creative with your solutions. Survey your existing tools, even ones not specifically geared towards security, to see if they can provide vulnerability insights. Your vulnerability management automation pipeline should be modular and vendor-agnostic to provide flexibility to incorporate all available data sources. The more you can reuse existing tools, the less additional attack surface you’ll need to maintain while still providing valuable signal.

Acknowledgements

Thanks to Deanna Bjorkquist who has helped drive the Vulnerability Management program and automation requirements. Thanks to Derek Wang for code excellence and feature expansion. Thanks to Christopher Barcellos for reviewing and providing feedback for our blog post. Thanks to Tina Nguyen for helping drive and make this blog post possible. Thanks to Mark Vlcek for his work on some of our scanning solutions. Thanks to the internal Airbnb Airflow team for their technology support.

****************

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.