Evolution of Streaming Pipelines in Lyft’s Marketplace
The journey of evolving our streaming platform and pipeline to better scale and support new use cases at Lyft.
Background
In 2017, Lyft’s Pricing team within our Marketplace organization was using a cronjob-based Directed Acyclic Graph (DAG) to compute dynamic pricing for rides. Each unit in the DAG would run at the top of every minute, fetch the data from the previous unit, compute the result, and store it for the next unit. This paradigm had several minutes of inherent latency.
As product iteration speed increased over time, this older infrastructure became unable to support the faster development cycles, primarily because all the ML feature generation required writing custom logic which would take multiple weeks to develop.
The team needed better infrastructure to make the dynamic pricing system more reactive for the following reasons:
- Decrease end-to-end latency that would make the system more reactive to marketplace imbalances.
- Decrease development time and increase product iteration speed.
MVP
After much deliberation, we decided that streaming engines would be a better fit for our requirements and selected Apache Beam. Since this was an entirely new framework, we had to come up with a pipeline design that ensured functional parity with the existing system. The very first version (see Figure 1) was designed to consume events, convert data to ML features, orchestrate model executions, and sync decision variables to their respective services.
Figure 1: Initial pipeline
Implementing the pipeline was straightforward and required little custom logic since streaming constructs tend to provide various functionality such as data sharding, aggregation, joining etc., out of the box that can be used directly in the pipeline. This significantly reduced the overall code size (4K LOC) as compared to the legacy system (10K LOC) and reduced ML feature development time by 50%.
Before rollout, we added metric parity checks and various alarms to avoid surprises. Still, our initial rollout didn’t go entirely smoothly. We ran into data skewness and higher memory utilization issues, but we were able to overcome those issues and rolled out to smaller markets and demonstrated 100% data parity and 60% lower end-to-end latency.
Productionizing the v1 pipeline
After rolling out to larger markets, we realized that the pipeline was not scaling well. The end-to-end latency kept increasing (mainly because of the model’s higher memory consumption and CPU-intensive model execution) and it affected the pipeline’s throughput. Our team played with the configuration and tried various optimizations to achieve a higher throughput, but these efforts were not effective. The problem was architectural: The pipeline was doing too many things. We had to go back to the drawing board and redesign it for better throughput and scalability.
The pipeline was functionally segregated into two components: Feature generation and model execution. The two’s performance requirements were different, and keeping them in a single pipeline and optimizing for better performance was not the right approach. Furthermore, the operational requirements were different as well. The feature generation pipeline didn’t change very often once a feature was fully developed, whereas models were iterated on and deployed much more frequently.
We followed the microservice architecture in the new streaming pipeline design, and decided to split the pipelines into two (see Figure 2). The first pipeline was for feature generation and focused on high performance. This pipeline ingests tens of millions of events per second and processes them into machine learning features. We optimized this pipeline for higher throughput and configured it for a beefier cluster to accommodate these requirements.
The second pipeline was optimized for ML model execution and orchestration. It executes the models in an isolated environment so as not to interfere with the pipeline’s performance. This pipeline is configured on a relatively smaller cluster, since it is only consuming tens of thousands of events per minute.
Figure 2: Separate pipelines for feature generation and model execution
This new pipeline architecture worked very well, and it allowed us to scale the pipelines independently and manage different release cycles effectively. We were able to roll out these pipelines to all of our markets: The end-to-end latency was stable (< 1 minute) and it maintained a stable throughput even during peak load (New Year’s Eve, Halloween etc.).
Simplifying pipeline creation
We saw an explosion of use cases at Lyft after our systems became GA. Our platform excelled at solving real-time feature computation, but new users found the framework unfamiliar and difficult to use.
To improve usability, we created YAML-based pipelines to enable users to define UDFs with our DSL in a YAML file (see Figure 3) which our system then translated into streaming pipelines. Our users became easily acquainted with the new setup and were able to run streaming pipelines while having only minimal experience with Beam.
Creating a new pipeline now takes a few days as compared to weeks. Teams outside of the Marketplace organization also quickly adopted it, and over 15 pipelines were set up in no time.
Figure 3: YAML-based feature pipeline
Re-architecting for scale
YAML-based pipelines further accelerated pipeline development, but we realized that scaling further in terms of the number of pipelines would be operationally heavy and difficult to manage.
We wanted to centralize and own the features, which required defining simple guiding principles:
- Share the infrastructure between different feature pipelines as much as possible to reduce infrastructure overhead and cost.
- Define the standards for various features to encourage feature sharing, promote shorter development cycles, and reduce operational overhead.
We defined feature standards that encourage feature sharing across different products. Furthermore, these features are now produced from shared pipelines that drive down infrastructure and development costs.
To build such pipelines, we decomposed the feature generation pipeline into two (see Figure 4). The first type of pipeline was mainly for event ingestion, filtration, hydration, and metadata tagging. It produces high-quality signals and publishes them to Kafka topics. The second type of pipeline ingests Kafka topics and aggregates data into standard ML features.
Figure 4: Standard feature pipelines
With this system, we were able to keep feature parity with our legacy pipelines and deprecate 40% of similar features in favor of shareable features. The remaining feature sets have since been onboarded to the new system, and all new features use it as well. We have achieved our usability goals with the new pipelines: The development process requires 30% less work as compared to the earlier pipeline, and product teams can navigate our systems and contribute as needed.
Road ahead
Enabling easier pipeline creation resulted in more pipelines being created, which came with its own operational challenges:
- Pipeline DAG makes it harder to identify upstream failures, especially when they are a couple of hops away.
- Hard to quickly find pipeline failures or data failures.
- Takes several minutes to identify a root cause, which is not acceptable with regard to our SLO.
- Downstream services want to know immediately when a feature’s quality has degraded and should switch over to a fallback.
All these necessitate having smart tools to quickly identify root causes and provide a 360° view of the data and pipeline DAG. To address this, we are building a control plane and smart alerting system. The control plane notifies the downstream services regarding data degradation, and the alerting system can automatically page the downstream pipeline owners and provide insights into the failure. We are also building tools that can proactively detect issues in upstream data and notify relevant teams. This will help to reduce the operational burden on the product and platform teams.
Closing notes
It has been an exciting journey for us while building, learning, and organically evolving our systems for growth. Each iteration provided better scale, but also exposed shortcomings. To be sure, the evolution of our system is not done yet, and the future holds many opportunities and new surprises. With this in mind, we have set our long-term roadmap to build the next-generation of platforms and develop smarter tools.
We are actively looking for talented engineers who can help us build the next generation of the platform. I’d personally love to connect with you to share what we are building and discuss opportunities that excite you.
Relevant posts
- Check out our talk at Strata Data Conference 2019 (Magic behind your Lyft ride prices) to understand our thought process of choosing Apache Beam.
- Get insights on how we overcame data skewness during our initial rollout.
- Learn more about various use cases of streaming at Lyft in this blog post.