公司:dropbox
Dropbox 于2007年5月由麻省理工学院学生德鲁·休斯顿和阿拉什·费道斯创立,时名Evenflow, Inc.,于2009年10月更名为Dropbox,总部位于美国加利福尼亚州旧金山。
Dropbox通过免费增值模式营运,提供线上存储服务,通过云计算实现互联网文件同步,用户可以存储并共享文件和文件夹。在云存储领域的竞争对手包括谷歌公司的Google Drive、微软公司的OneDrive和亚马逊公司的AWS等。
Dropbox于2018年3月23日在美国纳斯达克上市交易,股票代码是DBX,发行3600万股股票,发行价21美元。Dropbox在2017年营业收入11.1亿美元,注册用户超过5亿,净亏损1.117亿美元。
Selecting a model for semantic search at Dropbox scale
Nautilus is our search engine for finding documents and other files in Dropbox. Here's how it created the foundation for us to build better search functionality to understand more nuanced queries.
What’s new with Robinhood, our in-house load balancing service
Robinhood is the internal Dropbox load balancing service we deployed in 2020. It is responsible for routing all our internal traffic between servers to balance service load. Before we built Robinhood,…
How we use Lakera Guard to secure our LLMs
From search to organization, rapid advancements in artificial intelligence (AI) have made it easier for Dropbox users to discover and interact with their files. However, these advancements can also introduce new security challenges. Large Language Models (LLMs), integral to some of our most recent intelligent features, are also susceptible to various threats—from data breaches and adversarial attacks to exploitation by malicious actors. While hundreds of millions of users already trust Dropbox to protect their content, ensuring the security and integrity of these models is essential for maintaining that trust.
Customizing scopes in the OAuth app authorization flow
Learn how to configure and customize which scopes your app requests during the Dropbox OAuth 2 app authorization flow.
Meet Chrono, our scalable, consistent, metadata caching solution
Efficiently storing and retrieving metadata has long been a core focus for metadata teams at Dropbox, and in recent years we've made significant improvements to our foundational layer. By building our own incrementally scalable key-value storage, we’ve been able to avoid necessary doubling of our hardware fleet. As we scale, our evolution continues—but there are still challenges we cannot ignore.
At our current rate of growth, we knew there would come a time when the volume of read queries per second (QPS) could not be sustainably served by our existing underlying storage machines. Caching was the obvious solution—but not necessarily an easy one.
Bringing AI-powered answers and summaries to file previews on the web
Dropbox offers a handful of features that use machine learning to understand content much like a human would. For example, Dropbox can generate summaries and answer questions about files when those…
Bye Bye Bye...: Evolution of repeated token attacks on ChatGPT models
Building on prior prompt injection research, we recently discovered a new training data extraction vulnerability involving OpenAI’s chat completion models.
From AI to sustainability, why our latest data centers use 400G networking
Dropbox最近推出了使用400G以太网技术的数据中心架构。他们采用了400G-DAC互连的简化版本的MDF机架和跨数据中心的数据中心互连。通过优化光纤网络,支持从100G到400G的连接,并使用密集波分复用系统,提供了更高的带宽和更低的成本。他们在采用400G技术过程中学到了一些重要的教训,例如对组件进行细致的测试,确保向后兼容性,并制定了应对供应链不稳定因素的备用计划。这些经验将有助于他们在未来的发展中应对新的挑战。
Is this a date? Using ML to identify date formats in file names
File names play a vital role in facilitating effective communication and organization among teams. Files with cryptic or nonsensical names can quickly lead to chaos—whereas a well-structured naming system can streamline workflows, improve collaboration, and ensure easy retrieval of information. Consistency in naming across different files and file types enables teams to find and share content more efficiently, saving time and reducing frustration.
To make it easier for our users to organize and find their files, Dropbox has an automated feature called naming conventions. With this feature, users can set rules around how files should be named, and files uploaded to a specific folder will automatically be renamed to match the preferred convention. For example, files could be renamed to include a keyword or date.
How Edison is helping us build a faster, more powerful Dropbox on the web
How Dropbox re-wrote its core web serving stack for the next decade—sunsetting technical debt accrued over 13 years, and migrating high-traffic surfaces to a future-proofed platform ready for the company’s multi-product evolution.
Investigating the impact of HTTP3 on network latency for search
Dropbox is well known for storing users’ files—but it’s equally important we can retrieve content quickly when our users need it most. For the Retrieval Experiences team, that means building a search experience that is as fast, simple, and powerful as possible. But when we conducted a research study in July 2022, one of the most common complaints was that search was still too slow. If search was faster, these users said, they would be more likely to use Dropbox on a regular basis.
Dont you (forget NLP): Prompt injection with control characters in ChatGPT
Like many companies, Dropbox has been experimenting with large language models (LLMs) as a potential backend for product and research initiatives. As interest in leveraging LLMs has increased in recent months, the Dropbox Security team has been advising on measures to harden internal Dropbox infrastructure for secure usage in accordance with our AI principles. In particular, we’ve been working to mitigate abuse of potential LLM-powered products and features via user-controlled input.
Injection attacks that manipulate inputs used in LLM queries have been one such focus for Dropbox security engineers. For example, an adversary who is able to modify server-side data can then manipulate the model’s responses to a user query. In another attack path, an abusive user may try to infer information about the application’s instructions in order to circumvent server-side prompt controls for unrestricted access to the underlying model.
As part of this work, we recently observed some unusual behavior with two popular large language models from OpenAI, in which control characters (like backspace) are interpreted as tokens. This can lead to situations where user-controlled input can circumvent system instructions designed to constrain the question and information context. In extreme cases, the models will also hallucinate or respond with an answer to a completely different question.
How we reduced the size of our JavaScript bundles by 33%
When was the last time you were about to click a button on a website, only to have the page shift—causing you to click the wrong button instead? Or the last time you rage-quit a page that took too long to load?
These problems are only amplified in applications as rich and interactive as ours. The more front-end code is written to support more complex features, the more bytes are sent to the browser to be parsed and executed, and the worse performance can get.
At Dropbox, we understand how incredibly annoying such experiences can be. Over the past year, our web performance engineering team narrowed some of our performance problems down to an oft-overlooked culprit: the module bundler.
Balancing quality and coverage with our data validation framework
At Dropbox, we store data about how people use our products and services in a Hadoop-based data lake. Various teams rely on the information in this data lake for all kinds of business purposes—for example, analytics, billing, and developing new features—and our job is to make sure that only good quality data reaches the lake.
Our data lake is over 55 petabytes in size, and quality is always a big concern when working with data at this scale. The features we build, the decisions we make, and the financial results we report all hinge on our data being accurate and correct. But with so much data to sift through, quality problems can be incredibly hard to find—if we even know they exist in the first place. It's the data engineering equivalent of looking for a black cat in a dark room.
Accelerating our A/B experiments with machine learning
Like many companies, Dropbox runs experiments that compare two product versions—A and B—against each other to understand what works best for our users. When a company generates revenue from selling advertisements, analyzing these A/B experiments can be done promptly; did a user click on an ad or not? However, at Dropbox we sell subscriptions, which makes analysis more complex. What is the best way to analyze A/B experiments when a user’s experience over several months can affect their decision to subscribe?
For example, let’s say we wanted to measure the effect of a change in how we onboard a new trial user on the first day of their trial. We could pick some metric that is available immediately—such as the number of files uploaded—but this might not be well correlated with user satisfaction. We could wait 90 days to see if the user converts and continues on a paid subscription, but that takes a long time. Is there a metric that is both available immediately and highly correlated with user satisfaction?
We found that, yes, there is a better metric: eXpected Revenue (XR). Using machine learning, we can make a prediction about the probable value of a trial user over a two-year period, measured as XR. This prediction is made a few days after the start of a trial, and it is highly correlated with user satisfaction. With machine learning we can now draw accurate conclusions from A/B experiments in a matter of days instead of months—meaning we can run more experiments every year, giving us more opportunities to make the Dropbox experience even better for our users.
Increasing Magic Pocket write throughput by removing our SSD cache disks
When Magic Pocket adopted SMR drives in 2017, one of the design decisions was to use SSDs as a write-back cache for live writes. The main motivation was that SMR disks have a reputation for being slower for random writes than their PMR counterparts. To compensate, live writes to Magic Pocket were committed to SSDs first and acknowledgements were sent to upstream services immediately. An asynchronous background process would then flush a set of these random writes to SMR disks as sequential writes. Using this approach, Magic Pocket was able to support higher disk densities while maintaining our durability and availability guarantees.
The design worked well for us over the years. Our newer generation storage platforms were able to support disks with greater density (14-20 TB per disk). A single storage host—with more than 100 such data disks and a single SSD—was able to support 1.5-2 PBs of raw data. But as data density increased, we started to hit limits with maximum write throughput per host. This was primarily because all live writes would pass through a single SSD.
We found each host's write throughput was limited by the max write throughput of its SSD. Even the adoption of NVMe-based SSD drives wasn't enough to keep up with Magic Pocket’s scale. While a typical NVMe based SSD can handle up to 15-20 Gbps in write throughput, this was still far lower than the cumulative disk throughput of hundreds of disks on a single one of our hosts.
This bottleneck only became more apparent as the density of our storage hosts increased. While higher density storage hosts meant we needed fewer servers, our throughput remained unchanged—meaning our SSDs had to handle even more writes than before to keep up with Magic Pocket’s needs.