Productionizing Envoy Mobile at Lyft
Envoy Mobile is an ambitious open source initiative to bring the power of Envoy Proxy to mobile apps, leading to unparalleled observability, cutting edge technologies, control, and consistency in the mobile networking space.
We’re pleased to share that nearly all network requests from Lyft’s mobile apps are now handled by Envoy Mobile, matching or exceeding the performance of the previous libraries on Lyft’s top-level business metrics and serving billions of requests each day.
This took years of development and months of rigorous analysis comparing the performance of Envoy Mobile to the libraries it was destined to replace (URLSession on iOS and OkHttp on Android), and slowly rolling out on apps and platforms to avoid degrading the user experience or health of the Lyft ecosystem. We first rolled out on our iOS Rider app in December 2021, and incrementally enabled all our other publicly available apps on iOS and Android in the months that followed.
Our initial goal was to match the performance of our previous solutions, but in some areas we’ve seen significant improvements that simply would not have been possible without Envoy Mobile. At the end of this post, we’ll share our next set of goals for the project.
Enhanced Observability
Our ability to observe the health of our client-side networking system previously required building hooks into the respective networking libraries, sampling requests at a very low rate, and sending them back to an analytics ingestion service for processing. This meant only capturing a tiny fraction of networking metrics and delays in making this data available and actionable.
In contrast, Envoy Mobile has a rich set of stats that it inherits from the Envoy Proxy project, which is well regarded in the industry for its extensive observability features. These stats can be emitted directly to a gRPC or statsd endpoint, which allows us to observe the full scale of our mobile operations in near real-time.
In fact, on at least 3 occasions in the last few months, we identified production incidents using the stats emitted by Envoy Mobile which weren’t detected by our pre-existing observability solutions that mostly rely on costly analytics events. Outside of incidents, these stats have also been very helpful in understanding the state of our overall system.
Envoy Mobile stats we see at a glance
Reduced Footprint
A recurring but somewhat unexpected result in our experiments was seeing app hangs, ANRs, and OOM crashes become significantly reduced when using Envoy Mobile.
- OOM crashes were reduced by 69.3%, and app hangs were reduced by 47.9% when rolling out on our iOS Driver app.
- ANRs were reduced by 30% when rolling out on our Android bike and scooter partner apps.
Graph of the iOS watchdog terminating our application due to high memory pressure
Frequent Improvements
Envoy Mobile extends the Envoy Proxy project which sees frequent improvements from a diverse group of experts representing some of the world’s largest companies. Not only do we shorten our iteration cycle for shipping library improvements compared to waiting for yearly OS updates, but we also deploy these improvements to all of Lyft’s supported OS versions so that all of our users benefit. For example, Apple only issues significant URLSession updates annually with its major iOS releases, whereas we pull in frequent changes from Envoy Proxy & Envoy Mobile every week with our rolling releases.
Full Control
With Envoy Mobile being open source, when we encounter a question or a problem, we can inspect the source code, step through functions in a debugger, profile in Xcode Instruments or Android Studio, and discuss with other library contributors and users to ultimately fix the issue in the project. This is a significant upgrade both to how we debug networking issues and to the power and flexibility we have to fix them.
Cross-Platform
Now that both our servers and our iOS and Android apps use a common networking library, yet still preserve comfortable idiomatic APIs, we can confidently make changes knowing that both the client and server ends of the communication channel will be compatible since they’re using the same code. For example, when compressing HTTP payloads, using exactly the same library on the client and server can help reduce the risk of edge case incompatibilities.
Lessons Learned
Mobile Network Conditions Vary Significantly
Seasoned mobile developers know that users have varying levels of connectivity depending on where they are, if they have access to WiFi, what cellular technology they’re connected to, and how strong that signal is.
What was surprising to us was the degree of variability across cell carriers and across geographical regions.
To illustrate, here’s a graph of the average success rate over 30 days of a basic ping endpoint across 5 major US wireless communications service providers:
Success rate by carrier and networking library. For this comparison, OkHttp version 4.6.0 and URLSession from iOS 12 to 15 were used against recent versions of Envoy Mobile
A few observations:
- The success rate can vary by up to 10% across carriers with the same OS/library configuration.
- The success rate can vary by up to 10% across libraries using the same carrier.
- The best performing networking library varies across carriers.
The fact that Envoy Mobile outperforms OkHttp in 4 out of 5 carriers, and URLSession in 1 out of 5 carriers, gives us confidence that we can continue to measure, analyze, and improve these metrics.
Android Always Uses IPv6 Dual Stack Sockets
This took us quite a long time to figure out. We were seeing lower connectivity rates on Android devices for certain carriers and regions and did not have any theories as to the cause of the degraded experience for months. We kept adding more telemetry in the hopes of finding the root of the issue, and eventually some leads panned out. This additional telemetry uncovered that only IPv4 addresses were failing. Finally some progress! We decided to look at Android source code (yay OSS!) to check if the Android networking layer handled IPv4/IPv6 connections in a special way. After some source code spelunking, we found this gem buried in the Android source code: it turns out that Android always uses dual stack sockets. In other words, it translates all IPv4 addresses to IPv6 before it tries to connect to them. By applying the same address mapping transformation at the Envoy Mobile layer, we were able to mimic what the Android OS does and fix the issue for good.
By the time we rolled out a fix candidate, we had developed a good set of metrics to validate that this set of connectivity issues had been resolved.
Monitoring Networking Health is Tricky
How does one measure the health of a mobile app’s networking connectivity if networking itself is needed to gather that telemetry? How can one prevent the overhead of that telemetry from degrading the very system it’s observing?
We were mostly lucky in that the Envoy Proxy project already had a robust yet lean stats system to emit a wealth of metrics, but the fact remains that when chasing a production issue impacting less than 1% of users with no clear common pattern, it’s hard to find that needle in the networking stack of billions of requests.
This led us and others at Lyft to build some truly advanced remote debugging tools, but more on that later ??
Closing the Gap
Support for OS-level proxies is under development and we’re in the process of validating VPN support on Android, so we currently fall back to URLSession and OkHttp for those users. These configurations account for less than 2% of our user base, and we are aiming to enable Envoy Mobile for these users before the end of the year.
As far as Lyft’s rollout is concerned, we’re currently keeping a 2% holdout group on alternative libraries for comparison purposes, and we’re developing ways to remove the holdout while remaining confident that we won’t introduce regressions.
Looking Forward
Over the last year, we’ve been working closely with Google Cloud to add more rich functionality to Envoy Mobile. In the last year, Google Cloud has added features such as brotli and HTTP/3 support that we’re looking forward to experimenting with in the upcoming year.
Google Cloud sees Envoy Mobile as an exciting new way to get the latency benefits of HTTP/3 on Android and iOS, as well as to leverage existing benefits of Envoy Proxy such as its rich observability and extensibility. They’re also hard at work adding xDS support to Envoy Mobile to allow for dynamic configuration reloads such as running mobile experiments, as well as enhancing Envoy Proxy’s existing xDS functionality with client-specific features, such as local configuration caching.
At Lyft, we have some improvements planned that should make it easier for existing users of URLSession and OkHttp to integrate Envoy Mobile. Stay tuned.
How Can You Get Involved?
We recently released version 0.5 which incorporates all of the fixes and improvements that contributed to the performance we’ve seen in our apps. We encourage you to take it for a spin!
Come join us on Slack, on the weekly community calls, or come work with us at Lyft on Envoy Mobile and other exciting projects.
Additional Resources
For more insight into why we’ve invested so much into building a new mobile networking library, please see our previous posts on this blog: We first Unveiled Envoy Mobile, we shared Lyft’s Journey through Mobile Networking, did a Deep Dive on Envoy Mobile v0.2, and announced Envoy Mobile Joining the CNCF, all of which should provide helpful context on this investment.