Balancing quality and coverage with our data validation framework
At Dropbox, we store data about how people use our products and services in a Hadoop-based data lake. Various teams rely on the information in this data lake for all kinds of business purposes—for example, analytics, billing, and developing new features—and our job is to make sure that only good quality data reaches the lake.
Our data lake is over 55 petabytes in size, and quality is always a big concern when working with data at this scale. The features we build, the decisions we make, and the financial results we report all hinge on our data being accurate and correct. But with so much data to sift through, quality problems can be incredibly hard to find—if we even know they exist in the first place. It's the data engineering equivalent of looking for a black cat in a dark room.
Accelerating our A/B experiments with machine learning
Like many companies, Dropbox runs experiments that compare two product versions—A and B—against each other to understand what works best for our users. When a company generates revenue from selling advertisements, analyzing these A/B experiments can be done promptly; did a user click on an ad or not? However, at Dropbox we sell subscriptions, which makes analysis more complex. What is the best way to analyze A/B experiments when a user’s experience over several months can affect their decision to subscribe?
For example, let’s say we wanted to measure the effect of a change in how we onboard a new trial user on the first day of their trial. We could pick some metric that is available immediately—such as the number of files uploaded—but this might not be well correlated with user satisfaction. We could wait 90 days to see if the user converts and continues on a paid subscription, but that takes a long time. Is there a metric that is both available immediately and highly correlated with user satisfaction?
We found that, yes, there is a better metric: eXpected Revenue (XR). Using machine learning, we can make a prediction about the probable value of a trial user over a two-year period, measured as XR. This prediction is made a few days after the start of a trial, and it is highly correlated with user satisfaction. With machine learning we can now draw accurate conclusions from A/B experiments in a matter of days instead of months—meaning we can run more experiments every year, giving us more opportunities to make the Dropbox experience even better for our users.
Increasing Magic Pocket write throughput by removing our SSD cache disks
When Magic Pocket adopted SMR drives in 2017, one of the design decisions was to use SSDs as a write-back cache for live writes. The main motivation was that SMR disks have a reputation for being slower for random writes than their PMR counterparts. To compensate, live writes to Magic Pocket were committed to SSDs first and acknowledgements were sent to upstream services immediately. An asynchronous background process would then flush a set of these random writes to SMR disks as sequential writes. Using this approach, Magic Pocket was able to support higher disk densities while maintaining our durability and availability guarantees.
The design worked well for us over the years. Our newer generation storage platforms were able to support disks with greater density (14-20 TB per disk). A single storage host—with more than 100 such data disks and a single SSD—was able to support 1.5-2 PBs of raw data. But as data density increased, we started to hit limits with maximum write throughput per host. This was primarily because all live writes would pass through a single SSD.
We found each host's write throughput was limited by the max write throughput of its SSD. Even the adoption of NVMe-based SSD drives wasn't enough to keep up with Magic Pocket’s scale. While a typical NVMe based SSD can handle up to 15-20 Gbps in write throughput, this was still far lower than the cumulative disk throughput of hundreds of disks on a single one of our hosts.
This bottleneck only became more apparent as the density of our storage hosts increased. While higher density storage hosts meant we needed fewer servers, our throughput remained unchanged—meaning our SSDs had to handle even more writes than before to keep up with Magic Pocket’s needs.
Future-proofing our metadata stack with Panda, a scalable key-value store
Metadata is crucial for serving user requests. It also takes up a lot of space—and as we’ve grown, so has the amount of metadata we’ve had to store. This isn’t a bad problem to have, but we knew it was only a matter of time before our metadata stack would need an overhaul.
Dropbox operates two large-scale metadata storage systems powered by sharded MySQL. One is the Filesystem which contains metadata related to files and folders. The other is Edgestore, which powers all other internal and external Dropbox services. Both operate at a massive scale. They run on thousands of servers, store petabytes of data on SSDs, and serve tens of millions of queries per second with single-digit millisecond latency.
Everything in its write place: Cloud storage abstraction with Object Store
Dropbox originally used Amazon S3 and the Hadoop Distributed File System (HDFS) as the backbone of its data storage infrastructure. Although we migrated user file data to our internal block storage system Magic Pocket in 2015, Dropbox continued to use S3 and HDFS as a general-purpose store for other internal products and tools. Among these use cases were crash traces, build artifacts, test logs, and image caching.
Using these two legacy systems as generic blob storage caused many pain points—the worst of which was the cost inefficiency of using S3’s API. For instance, crash traces wrote many objects which were rarely accessed unless specifically needed for an investigation, generating a large PUT bill. Caches built against S3 burned pricey GET requests with each cache miss.
Defending against SSRF attacks (with help from our bug bounty program)
Over the past few years, server-side request forgery (SSRF) has received an increasing amount of attention from security researchers. With SSRF, an attacker can retarget a request to internal services and exploit the implicit trust within the network. It often escalates into a critical vulnerability, and in 2021 it was among the top ten web application security risks identified by the Open Web Application Security Project. At Dropbox, it’s the Application Security team’s responsibility to guard against and address SSRF in a scalable manner, so that our engineers can deliver products securely and with as little friction as possible.
We’re using TTVC to measure performance on the web—and now you can too
Nobody likes waiting for software. Snappy, responsive interfaces make us happy, and research shows there’s a relationship between responsiveness and attention1. But maintaining fast-feeling websites often requires tradeoffs. This might mean diverting resources from the development of new features, paying off technical debt, or other engineering work. The key to justifying such diversions is by connecting the dots between performance and business outcomes—something we can do through measurement.
Over the last year, we’ve been rethinking the way we track page load performance on the web at Dropbox. After identifying a few gaps in our existing metrics, we decided we needed a more objective, user-focused way to define page load performance so that we could more reliably and meaningfully compare experiences across products. We thought a relatively new page load metric called Time To Visually Complete (TTVC) could work well.
There was just one problem: Browsers don’t yet report the moment a page becomes visually complete. If we wanted to adopt TTVC as our new primary performance metric, we would have to fill that gap. So we built a small library to allow us to track TTVC as our users experience it in the real world. That library is @dropbox/ttvc—and we’re excited to be open-sourcing this work!
Fighting the forces of clock skew when syncing password payloads
A good password manager should be able to securely store, sync, and even autofill your username and password when logging into websites and apps. A password manager like…Dropbox Passwords!
When we released Dropbox Passwords in the Summer of 2020, it was important we ensured that a user’s logins would always be available—and up to date—on any device they used. Luckily, Dropbox has some experience here, and we were able to leverage our existing syncing infrastructure to copy a user’s encrypted password info, known as a payload, from one device to another. However, while implementing this crucial component, we encountered an unexpected syncing issue where, sometimes, out-of-date login items would overwrite newer, more recent changes.
Eventually we found a solution that built on prior Dropbox syncing work. But it also involved contemplating the very nature of time itself.
Extending Magic Pocket Innovation with the first petabyte scale SMR drive deployment
Magic Pocket, the exabyte scale custom infrastructure we built to drive efficiency and performance for all Dropbox products, is an ongoing platform for innovation. We continually look for opportunities to increase storage density, reduce latency, improve reliability, and lower costs. The next step in this evolution is our new deployment of specially configured servers filled to capacity with high-density SMR (Shingled Magnetic Recording) drives.
Dropbox is the first major tech company to adopt SMR technology, and we’re currently adding hundreds of petabytes of new capacity with these high-density servers at a significant cost savings over conventional PMR (Perpendicular Magnetic Recording) drives. Off the shelf, SMR drives have the reputation of being slower to write to than conventional drives. So the challenge has been to benefit from the cost savings of the denser drives without sacrificing performance. After all, our new products support active collaboration between small teams all the way up to the largest enterprise customers. That’s a lot of data to write, and the experience has to be fast.
Lossless compression with Brotli in Rust for a bit of Pied Piper on the backend
In HBO’s Silicon Valley, lossless video compression plays a pivotal role for Pied Piper as they struggle to stream HD content at high speed.
Inspired by Pied Piper, we created our own version of their algorithm Pied Piper at Hack Week. In fact, we’ve extended that work and have a bit-exact, lossless media compression algorithm that achieves extremely good results on a wide array of images. (Stay tuned for more on that!)
However, to help our users sync and collaborate faster, we also need to work with a standardized compression format that already ships with most browsers. In that vein, we’ve been working on open source improvements to the Brotli codec, which will make it possible to ship bits to our business customers using 4.4% less of their bandwidth than through gzip.
Rewriting the heart of our sync engine
Over the past four years, we've been working hard on rebuilding our desktop client's sync engine from scratch. The sync engine is the magic behind the Dropbox folder on your desktop computer, and it's one of the oldest and most important pieces of code at Dropbox. We're proud to announce today that we've shipped this new sync engine (codenamed "Nucleus") to all Dropbox users.
Rewriting the sync engine was really hard, and we don’t want to blindly celebrate it, because in many environments it would have been a terrible idea. It turned out that this was an excellent idea for Dropbox but only because we were very thoughtful about how we went about this process. In particular, we’re going to share reflections on how to think about a major software rewrite and highlight the key initiatives that made this project a success, like having a very clean data model.
Why we built a custom Rust library for Capture
Dropbox Capture is a new visual communication tool designed to make it easy for teams to asynchronously share their work using screen recordings, video messages, screenshots, or GIFs. There's no formal onboarding required, and you can start sharing your ideas in seconds. In fact, simplicity is key to the Capture experience, and it's a value that also extends down to the development of Capture’s underlying code.
Optimizing payments with machine learning
It’s probably happened to you at some point: You go to use a service for which you believe you’ve got a paid subscription, only to find that it’s been canceled for non-payment. That’s not only bad for you the customer: It causes negative feelings about the brand, it disrupts what should be a steady flow of revenue to the business, and a customer who finds themselves shut off might decide not to come back.
At Dropbox, we found that applying machine learning to our handling of customer payments has made us better at keeping subscribers happily humming along.
How image search works at Dropbox
Photos are among the most common types of files in Dropbox, but searching for them by filename is even less productive than it is for text-based files. When you're looking for that photo from a picnic a few years ago, you surely don't remember that the filename set by your camera was 2017-07-04 12.37.54.jpg.
Instead, you look at individual photos, or thumbnails of them, and try to identify objects or aspects that match what you’re searching for—whether that’s to recover a photo you’ve stored, or perhaps discover the perfect shot for a new campaign in your company’s archives. Wouldn’t it be great if Dropbox could pore through all those images for you instead, and call out those which best match a few descriptive words that you dictated? That’s pretty much what our image search does.
In this post we’ll describe the core idea behind our image content search method, based on techniques from machine learning, then discuss how we built a performant implementation on Dropbox’s existing search infrastructure.
Detecting memory leaks in Android applications
Keeping sync fast with automated performance regression detection