Bringing AI-powered answers and summaries to file previews on the web

Dropbox offers a handful of features that use machine learning to understand content much like a human would. For example, Dropbox can generate summaries and answer questions about files when those files are previewed on the web. Instead of asking a coworker for the gist of last week’s all-hands meetings, Dropbox can provide a summary of the video and a user can ask questions about its contents—all from the file preview. We recently expanded AI-powered summarization and Q&A to handle multiple files simultaneously, too.

(As part of our commitment to responsibly using AI, Dropbox abides by a set of AI principles; you can visit our Help Center to learn more. These features are still in early access, and not yet available to all users. These features are also optional, and can be turned on or off for you or your team.)

Both our summarization and Q&A features leverage large language models (LLMs) to find, compare, and consolidate the content of the file. An LLM works by ingesting content as text, transforming the ideas contained within it into a numerical representation, and comparing those numerical representations against both the input query and an internal corpus of knowledge to answer the question. This effectively enables a computer to consume and compare information semantically, rather than lexically.

For knowledge workers suffering from information overload, we can use machine learning to get them the answers they need—without them having to remember exactly how a piece of information was worded, or where it might be contained within a file. This is what we’ve done with file previews on the web.

Extracting text and embeddings with Riviera

The first part of this process is, of course, retrieving the text. Luckily, we already have a framework for transforming basically any file type to text.

Dropbox is capable of turning complex file types like CAD drawings into formats that are easily consumable by web browsers, such as PDF. Historically we have used this system for file previews, but we also use it to power features like transcription and Dropbox Replay. Internally, we call this system Riviera.

At a high level, Riviera consists of a frontend which routes requests through one or more plugins that convert a file from one type to another. Each plugin runs in a jail, which is a container designed to run third party code and tools safely and in an isolated manner. The framework maintains a graph of possible conversions, and is capable of chaining together multiple plugins into a multi-step pipeline to perform even more complex transformations. We currently support conversions between about 300 file types, and the system crunches through about 2.5 billion requests—totalling nearly an exabyte, or one billion gigabytes of data—per day.

In order to apply machine learning to file previews, the conversions we are interested are those that convert any file type to raw text. In the case of video, text extraction might looks something like:

Video (.mp4) -> Audio (.aac) -> Transcript (.txt)

Some of the conversions in Riviera can be quite expensive to compute on-the-fly, so it also includes a sophisticated caching layer that allows us to reuse conversions between plugins. Each state of the pipeline is cached, so intermediate states can be reused.

In the world of LLMs, the mathematical representation of the semantic meaning of text is called the embedding. Riviera treats embeddings like any other file conversion, so in the pipeline above, we can append:

Video (.mp4) -> Audio (.aac) -> Transcript (.txt) -> AIEmbedding

By separating the embedding generation from the actual summary generation we can reuse the cache features built into Riviera. If a user wants to summarize a video, then ask some follow-up questions, we only have to generate the transcript and embeddings once.

Video (.mp4) -> Audio (.aac) -> Transcript (.txt) -> AIEmbedding --> Summary | --> Q&A

The input for the embeddings plugin typically consists of text data extracted from various file types such as documents, videos, or audio files. In the case of video content, for example, the plugin may receive the transcript generated from the audio track of the video.

The process of converting text into embeddings involves using advanced language models that take the text input and produce a vector representing that text. In our implementation, we split the text into paragraph-sized chunks and calculate an embedding for each chunk. By doing this, we effectively increase the granularity of the information stored within a file. Instead of having just a single embedding for an entire file, we have multiple embeddings for different sections of the text. This higher granularity, or bit depth, allows us to capture more detailed and nuanced information. We apply the same chunking and embedding method for both summaries and Q&A, ensuring they share the same embedding cache within Riviera.

While the actual LLM processing happens inside the summary and Q&A plugins, treating embeddings, summaries, and queries as file conversions inside Riviera allows us to operate these features at Dropbox scale.

The high level architecture of our file previews surface, with new machine learning components highlighted

A common use case for LLMs is to concisely summarize large amounts of text. When negotiating a contract, for example, a person might copy the text from a contract PDF, paste it into an LLM-powered chat prompt, and ask the LLM to summarize it in understandable terms. While adding this feature into Dropbox may have been useful as-is, we decided that we didn’t want to stop there. Dropbox users store long documents, videos, and other rich media files in Dropbox, and a summarization feature only gets more useful as the length of the file increases. We wanted to unlock this feature for all of our users' files, no matter the format or length.

First, we needed to define the qualities of a good summary. A contract to purchase a home might include many different concepts. There might be a long description of payment terms, wire instructions, good-faith deposits, and escrow accounts. It also might have a long description of contingencies and when they can be triggered. A good summary might simply say “this document is an agreement to purchase a home for X amount, and it has finance and inspection contingencies.” In other words, we defined a good summary as one that can identify all the different ideas or concepts in a document and give the reader the gist of each.

Language models enable this to be implemented algorithmically using embeddings, which allow passages to be compared semantically. King and Queen might be relatively close together (they are both regal), King and Dog might be somewhere in the middle (they are both living beings, perhaps), while King and Toaster have very little in common. Language model embeddings allow us to compare passages on thousands of dimensions that are all learned through a long training process. These embeddings in turn enable efficient summarization of large files of essentially unlimited length.

Our summarization plugin takes the chunks and associated embeddings from the embeddings plugin and uses k-means clustering to group the text chunks from the file into clusters in this multi-dimensional embedding space. With this method, we can organize data into distinct groups, or clusters, based on their characteristics, so that chunks with similar content are placed in the same cluster. We then identify major clusters (the main ideas of the file) and concatenate a representative chunk from the corresponding text from each cluster into one blob—the context. Finally, we generate a summary of that context via an LLM.

We found k-means clustering was better than alternatives such as a summary-of-summaries approach in a couple of ways:

Higher diversity of topics. Many individual summaries before reaching the final summary of summaries often repeat similar information. Combining these summaries results in a significant loss of overall file content. Using k-means clustering, we discovered that the summaries covered a broader range of topics—approximately 50% more than with map-reduce, since we search for chunks that are semantically dissimilar to one another.
Lower chance of hallucinations. When the LLM receives the entire file in one go, its likelihood of hallucinating decreases significantly. Conversely, each call made to the LLM presents a chance for hallucination, making the problem exponentially worse when summarizing summaries. The map-reduce approach lacks the context provided by other chunks, compounding the issue. Using the k-means technique, pinpointing errors in the final output—especially when comparing between LLMs or models—becomes much easier because there is just a single LLM call to evaluate.

Our Q&A plugin works in a similar manner to the summarization plugin, but in a somewhat opposite way. The Q&A plugin takes in the embeddings and text chunks from the embedding plugin and generates a new embedding for the user question. Then for each chunk of file text it computes the distance to the query text embedding. By calculating the closeness of each chunk to the query in vector space, the most relevant chunks are selected.

In the summarization plugin the text chunks were selected for dissimilarity, while in the Q&A plugin they were selected for similarity to the query text. These text chunks, along with the query, are sent to the language model for generating the answer.

A language model uses the context to generate answers by analyzing the provided text chunks and the query to understand their meaning and relationships. When a query is received, the model first interprets the question and identifies key elements and intents. It then examines the text chunks, which serve as additional context, to extract relevant information. The model employs sophisticated algorithms to detect patterns, correlations, and nuances within the text, allowing it to discern how different pieces of information fit together. By integrating the context from the text chunks with the specifics of the query, the language model can produce a coherent and accurate response that is tailored to the query's requirements. This process involves leveraging large-scale language patterns learned during training, enabling the model to generate answers that are both contextually appropriate and informative.

The relevant chunk locations are then returned to the user as sources, allowing them to reference the specific parts of the file that contributed to the answer.

As an add on feature to both the summarization and Q&A plugins, we also request context-relevant follow-up questions from the LLM. In testing we found that follow-up questions allow the user to more naturally learn about a file and the topic they are interested in. To gather these follow-up questions, we utilize function calling and structured outputs to request follow-up questions from the LLM at the same time we request the summary or an answer to the initial question.

Expanding to multiple files

The first iteration of intelligent summaries and Q&A was limited to one file at a time—but we knew we could do better for our customers. We wanted to make LLM-powered understanding possible across collections of files, and not just individual documents.

Expanding our understanding capabilities to multiple files within Dropbox involved a significant evolution of our capabilities, infrastructure, UI, and algorithms. Building on our existing file processing pipeline, we expanded Riviera’s capabilities to handle multiple files simultaneously inside of a single plugin. The embeddings plugin would still be separate for every file, but the final summarization or Q&A plugin call would need to take in multiple files. The question was: which subset of files selected by the user would we extract the relevant information from?

When testing this feature, we found that some questions were quite direct (“What is Dropbox?”) while some were quite broad (“Can you explain this contract like I’m 5?”). For best results we found that we needed to tailor our response accordingly. Direct questions can often be answered in a single sentence, while the answers to broader questions are typically longer and potentially sourced from a wider context. The challenge was determining which type of answer was required. Put another way: How should we determine the number of relevant chunks or files to use when answering a request or query?

We eventually came to the conclusion that this was a trick question. You cannot determine if a question is direct or broad based just on the question itself. You also need the context the answer is pulled from. The question “What is Dropbox?” could both direct if asked about a list of tech companies, but also broad if asked about the Dropbox Wikipedia page.

To solve this question, we took advantage of power law dynamics to determine the number of relevant chunks to send to the LLM.

Line A is a direct question, while line B is a broad question

Our solutions takes the max and min relevance scores from the top 50 text chunks related to the user query, as calculated through the embeddings, and cuts off the bottom 20% of that spread. This works well because, as shown in the diagram above, direct questions have a steeper power law curve than broad questions.

For Example A above, lets say that the most relevant chunk has a score of 0.9 and the least is 0.2. In this case, everything below a 0.34 score is discarded (the 20th percentile score). Since the slope is steep, over half of the chunks will be discarded, leaving about 15 left. For Example B—a more broad question—let’s say that the most relevant chunk has a score of 0.5, and the least is 0.2. In this case, everything below a 0.26 would be discarded, leaving about 40 left. Since the slope is flatter, more chunks are chosen to send to the LLM.

Another example could be a quarterly earnings report. For the question “what were the financial results?” a lot of the chunks would have medium relevance, resulting in something similar to the B curve above, since it is a broad question. For a question like “how much was spent on real estate?” there would be some very relevant chunks and a lot of non-relevant chunks—like the A curve above, since it is a more direct question. The first question would require more chunks to answer the question fully versus the second question.

This algorithm allowed us to strategically determine which files and chunks within those files were relevant, and thus, which context to send to the LLM. Direct questions get less but more relevant context, while broad questions are given more context to expand on the topic.

Building the machine learning capabilities for file understanding within Dropbox involved a series of strategic decisions, technical challenges, and optimizations to ensure efficient and accurate performance and a scalable cost model. These include:

Real-time processing. We had to decide between calculating embeddings and AI responses ahead of time or in real time. Computing these transformations ahead of time allows for lower latency, but we settled on computing responses in real time because of one main factor: it allows the user to choose which files to share with the LLM—and only those files. Privacy and security is a top priority at Dropbox, so this was an easy choice to go with. As a side benefit, computing these requests in real-time would also save us the cost of pre-computing requests that a user might never make.
Segmentation and clustering. We also found numerous benefits to the segmentation and clustering of requests, versus sending full file text to the LLM. By only sending the most relevant parts of a file to the LLM in a single request, we could return summaries and answers with lower latency and lower cost. Allowing the LLM to focus on the context that matters most to the user’s request also yielded better quality results; we quickly learned through testing that sending garbage into the LLM means garbage out from the LLM.
Chunk priority calculation. To optimize token usage and ensure the most relevant information is included in the summary or answer, we calculate a priority tier for each chunk. The first two chunks chronologically are given top priority, followed by k-means clustering to select semantically dissimilar chunks for summary, or semantically similar chunks to a question. This approach maximizes the breadth of topics covered in the summary and enhances the quality and relevancy of the answer.
Embracing embeddings. This was a crucial decision. Embeddings enabled us to compare passages efficiently in a multi-dimensional space, allowing for more accurate summarization and question answering. For multi-file actions, it also meant that we could pick and choose which files were most relevant for a query out of a list of multiple files. Even though there is a cost in latency and compute to gather these embeddings for each file, they enable a much higher quality response from the LLM.
Cached embeddings. In our initial version, embeddings were not cached, resulting in multiple calls to the LLM for the same document. In the subsequent version, embeddings are cached, reducing the number of API calls and improving performance. This optimization also allows summaries and Q&A to use the same chunks and embeddings, further enhancing efficiency.

All of this work has yielded significant improvements in terms of cost and latency since the start of the project. Cost-per-summary dropped by 93%, and cost-per-query dropped by 64%. The p75 latency of summaries decreased from 115 seconds to 4 seconds, and the p75 latency for queries decreased from 25 seconds to 5 seconds. These optimizations have not only made summarization and Q&A more affordable for Dropbox, but more responsive for our users.

File previews with AI-powered summaries and Q&A are available in early access on select Dropbox plans. Visit our Help Center to learn more.

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit dropbox.com/jobs to see our open roles, and follow @LifeInsideDropbox on Instagram and Facebook to see what it's like to create a more enlightened way of working.