Unlocking knowledge sharing for videos with RAG

At Vimeo, we’re highly invested in providing the most powerful way for our users to pull value from their videos. We believe that one effective way to achieve this is by enabling our viewers to converse with video content in natural language. While in the past, such a task could have been deemed as too ambitious, generative AI has revolutionized the field and has made such chat interactions almost commonplace.

Consider, for example, the 15-minute video of Steve Jobs’ 2005 Stanford Commencement Address. Say you haven’t watched the video before, and you don’t have 15 minutes to watch it now, but you need a summary of the information that the video contains. You can start by asking our video Q&A: “What is the video about?” In return you get a brief description of the video and some relevant playable moments that you can click to watch. You also get some related questions that you can select to dig deeper into any given topic, or you can type new questions of your own. See Figure 1 for details.

Figure 1. The Vimeo video Q&A system can summarize the content of a lengthy video instantly, link to key moments, and suggest additional questions.

The Vimeo video Q&A system is implemented using one of the most recent developments in generative AI called retrieval augmented generation, or RAG for short. This method has become so popular lately that many, including NVIDIA, Custom GPT, and ZS, have crowned 2024 as “the year of RAG.” In this blog post, we present how we apply the RAG approach to videos and explore some of the unique problems we face in this specific domain.

Why RAG?

The novelty of RAG lies in its ability to integrate context from a specific domain or dataset into the large language model, or LLM, which the LLM wouldn’t have knowledge of otherwise.

RAG has several important advantages over using only a standard LLM. It prevents you from generating hallucinated responses, since the answer is grounded in the context provided to the LLM. Moreover, providing specific context to the LLM is especially important for business use cases or when dealing with private user data or knowledge bases. An example of the former can be a customer support bot, where an example of the latter can be extracting information from medical reports.

Another advantage of RAG is that it enables you to deal with long texts efficiently, since it retrieves only a small relevant portion of the text as context for the LLM. The alternative would require the LLM to be provided with the full-length text for each instance of text processing. This approach is impractical for chatbots, where fast response time is essential.

RAG is especially useful when solving the problem of answering questions on specific textual databases. To achieve this, the text is first split into small chunks, and then each chunk is embedded and stored into a vector database. When the chatbot receives a question from a user, the question is first converted into a vector embedding and then queried against the vector database to get the best matches by a nearest neighbors search. Finally, the matches form a context for the LLM, which uses it to generate an answer to the question. So the retrieval component of RAG means retrieving the context from the vector database, which is then augmented to the LLM for the generation component.

RAG is typically used to process documents, but in principle any data that can be transformed into text can be processed by RAG. For example, the audio track of a video file can be transcribed into a text file of what is said, and an image can be represented by a textual description of the visual information that the image contains. Figure 2 shows a diagram of the entire RAG process in the context of video. This is the overall approach that we took when designing our video Q&A system.

Figure 2. Diagram of how the Q&A system works. The video transcript is processed and saved in a vector database. When we get a question from the user, we retrieve the relevant context from the database and show it to the LLM to generate the final answer.

As a first step of implementing our chatbot, we needed to decide how to represent the video content. We decided to rely solely on the transcript of the spoken words in the video. We currently don’t use any visual information to build the summary, although we plan to add this in the future to enhance our product.

We chose to use the transcript for two main reasons. The first is that each video uploaded to Vimeo is already transcribed to generate automatic closed captions, so there’s no need to implement anything new. The second reason is that we decided to focus our product mainly on knowledge-sharing videos, such as meetings, lectures, presentations, and tutorials; our experiments show that the transcript itself is already able to answer most of the important questions about the video in these cases.

So the transcript of the video is the textual entity that we process, chunk, and represent in the vector database. We retrieve from it the context used to answer questions about the video. As we discuss later, one of the main issues we have to solve is how exactly to process and chunk the transcript. A valid question here is: will splitting the transcript into small overlapping chunks yield reasonable results for a wide range of questions?

When you’re building a free-text user interface, you should take into account on the one hand the wide variety of inputs that can be received, and on the other hand try to anticipate the most common types of inputs. In the context of videos, we found that encouraging viewers to pose questions about the video reveals an interesting outcome. Several question types consistently popped up in our testing:

General. Questions about what’s happening in the video in general or asking to summarize the video as a whole or specific topics in it.
Speakers. Questions about the speakers in the video or about what a specific speaker is saying.
Specific details. Questions about specific details mentioned in the video or wanting to navigate to a specific part of the video.

Notice that answering the first two types of questions requires using large portions of the video or perhaps even the entire video, while answering the third type of question typically requires a small portion of the video, maybe just a couple sentences that contain the requested detail.

A bottom-up approach for transcript database registration

A classical approach that is common in machine learning is called bottom-up processing. It originates from cognitive science. The idea is that, when processing a signal, operations are performed iteratively from a bottom layer of representation to higher levels of representation that have a larger semantic meaning. The bottom levels are obtained using operations with a small context window, whereas the higher levels are obtained using larger context windows. This is achieved by merging and processing neighboring results — in other words, grouping results that are close in time or space — to obtain a new level of representation. In this way, at each level of analysis, you effectively enlarge the size of the context window. Figure 3 shows how we apply this bottom-up approach to our transcript registration process.

Figure 3. Diagram of how the video transcript is processed and saved in the vector database. We use several sizes of context windows, summarize long context, and create a description for the entire video.

We start by a standard chunking of the transcript to groups of sentences containing 100 to 200 words (corresponding to 1–2 minutes of playback time). These represent the bottom level of processing. We then take larger chunks of 500 words and use an LLM to summarize them to 100 words each. We ask the LLM to focus on specific details, numbers, dates, and speaker names and include this in the summary. This represents the middle level of representation. The lower and middle levels are good for answering questions about specific details, whereas the middle level can also answer questions about entire sections or topics in the video. Then we take all the summaries and generate a description of the entire video, which focuses on the most important topics. This is the highest level of representation; it understands the video as a whole but doesn’t contain almost any information about specific details in the video. Most of them are lost at this stage.

Finally, we take all the textual representations we get from all levels and register them into one large vector database. Each item in the database consists of:

The textual representation of the transcript chunk, either the transcript text (for the 100- and 200-word chunks) or the processed text summary (for the 500-word chunks and the video description).
The vector embedding of the text representation.
The original transcript word timings of the transcript chunk. These are the times in which each word starts and ends in the video.
The start and end timestamps of the transcript chunk; for example, minutes 3:00–5:00 in the video.

Detecting speakers without facial recognition

Since we’re using only the transcript of the video to feed our video RAG and want to avoid using facial recognition methods, the technique of speaker indexing can help to cluster the video to different speakers.

Within the context of answering questions, particularly in knowledge-sharing videos through formats such as corporate or sales meetings, questions are often aimed at particular speakers. However, we can’t refer to auxiliary sources to access name data for these speakers. This information isn’t provided as metadata, nor is it associated or uploaded in connection to the video content in any manner. Moreover, the video transcript is unstructured, in that it’s simply transcribing what everyone is saying without specifying who’s saying what.

Consider the following example from The Lion King (1994), where Pumbaa, a warthog, says, “I ate like a pig,” to which the lion king Simba replies, “Pumbaa, you are a pig.”

The goal is to convert the unstructured transcript into a structured dialog that includes the names of the speakers. The first step in achieving this is to segment the conversation to different speakers. This can be done by clustering the audio. The result is that each sentence is associated with a specific numerical speaker ID:

Speaker 1: “I ate like a pig.”

Speaker 2: “Pumbaa, you are a pig.”

The next step is to identify the name associated with each speaker ID. This presents us with a host of challenges. Unstructured transcripts consisting of free speech often feature multiple name references in a single sentence or mention a speaker who isn’t included in the conversation. In informal discussions, some speakers might not be identified by name, or various names or nicknames could be used to refer to the same individual, and so on, and so on.

Our research shows that the most common place to find the names of speakers in conversational videos is during conversation transitions from one person to another. These transitions, characterized by the handover of conversational turns between participants, hold a lot of name data. These are the areas in the conversation where it’s most common to find a speaker’s self-introduction, one speaker presenting the next speaker, or a speaker thanking the previous speaker. Therefore, we focus on identifying and isolating speaker transitions and extracting the names of speakers from these transitions.

For example, if you look at the conversation transition between Pumbaa and Simba, you can see that Simba is referring to Pumbaa by name, so it’s possible to conclude the name of the first speaker:

Pumbaa: “I ate like a pig.”

Speaker 2: “Pumbaa, you are a pig.”

Note that you can disregard or mask Pumbaa’s statement without affecting name identification:

Pumbaa: [ — — — ]

Speaker 2: “Pumbaa, you are a pig.”

This is just one of the prompts that we use to detect the names of speakers (see Figure 4).

Figure 4. Diagram of how the speaker detection algorithm works. We use ChatGPT to analyze the text around speaker transitions and associate names to speakers if possible.

The name detection mechanism is based on votes generated using an LLM. For every transition, we request the LLM to identify speaker names by analyzing each side of the transition individually. This is achieved by masking the first or second speaker based on the prompt type. Masking is effective in reducing the noise of irrelevant names and reducing the context provided to the LLM.

We also ask for an identification based on all transitions attributed to a single speaker. Each identification corresponds to one vote. We then apply a decision function to remove any errors in the voting in aggregate and determine if the speaker’s name corresponding to the numerical ID can be established with a high level of confidence.

You might be wondering why we need so many votes. Based on our experience, when decisions are made based on just one vote, there’s a significant increase in the likelihood of misidentification. In our view, assigning a name to a speaker ID should only be done when there is a high degree of certainty in the accuracy of the identification. We prefer to leave the speaker’s name unspecified rather than attributing an incorrect name.

If you look closely at Figure 4, you see that we use four different prompts to get several name votes for each speaker ID. The first prompt asks if the LLM can identify the name of Speaker 2 from what Speaker 1 just said, like this:

Speaker 1: “Take it away, Rachel.”

Speaker 2: [ — — — ]

The second prompt asks if the LLM can identify the name of Speaker 1 from what Speaker 2 says immediately afterward, like this:

Speaker 1: [ — — — ]

Speaker 2: “Thanks, David.”

That’s the pattern that gives us Pumbaa’s name.

The third prompt asks the LLM if the speaker is giving a self-introduction:

Speaker 2: “Hi, everyone, great to be here. My name is Rachel.”

The fourth prompt asks the LLM to identify the name of a specific speaker by passing all transitions of this speaker at once. That’s how to determine Simba’s name, which the LLM can’t ascertain from what either Pumbaa or Simba says in the current context. But Simba’s name does come up in a different context earlier and later in the movie:

Speaker 2: “Whoa! Nice one, Simba.”

Speaker 1: “Thanks. Man, I’m stuffed.”

Speaker 3: “Me too. I ate like a pig.”

Speaker 1: “Pumbaa, you are a pig.”

Speaker 3: “Simba, what do you think?”

Speaker 1: “Well, I don’t know.”

So from these transitions of Speaker 1, the LLM can deduce that Speaker 1 is actually Simba.

We have now yielded a structured transcript:

Pumbaa: “I ate like a pig.”

Simba: “Pumbaa, you are a pig.”

Finding accurate reference points in the video

Since we’re dealing with videos, there’s significant value in not only answering the question in a textual fashion but also referring the viewer to watch moments in the video that support the answer (see Figure 5). These playable moments can convey additional information that doesn’t necessarily appear in the textual answer, including details that were left out, but also visual information, tone of voice, music, emotions, and other potentially relevant factors.

Figure 5. Relevant playable moments are all cued up when the viewer asks a question about a video.

From our experiments, separating the tasks of answering questions and finding references to the answers leads to a much better performance than trying to perform both tasks in the same prompt. This is at least our observation when testing it on ChatGPT 3.5. We assume that the LLM suffers here from some capacity issues when trying to perform both.

We use two prompts: one prompt that answers the question, followed by a second prompt that finds the relevant quotes in the transcript given the question and answer. Both prompts refer to the same context retrieved from the database. The first prompt uses the text that was stored in the database (whether it’s a transcript text or a processed text summary as described above). The second prompt uses the original transcript text that was stored for it in the database, since it’s designed to find quotes that actually appear in the video.

To make the second prompt more efficient, we embed the chat answer to a vector representation and then compute its similarity with all the retrieved matches that formed our original context to the question. We then form a new context that consists of only the matches that had enough similarity with the chat answer. This also helps us guarantee a higher confidence that the extracted quotes do indeed align with the chat answer.

Guiding the viewers on how to ask questions

To achieve a more engaging viewer experience, we populate our chatbot with several pregenerated questions and answers. The goal is to enable viewers who don’t know what questions to ask, or who just feel overwhelmed by having to think of a question to experience our Q&A product. We compute these pregenerated questions and answers during the transcript registration process. This is performed in the same way that we generate the video description. We take all the summary chunks together and ask the LLM to generate the most important question-answer pairs about the video (see Figure 6).

Figure 6. The Q&A system also generates its own questions.

In addition, if the viewer already asked a question or clicked one of the pregenerated questions, we want to keep them engaged by suggesting more related questions that can be asked on the same topic. The problem with this is that we need to be sure that the related questions that have been generated are indeed questions that can be answered in the context of the video. Since this is a closed-world scenario, if we just ask the LLM to create related questions given only the user question, we have no guarantee that those can be answered by our RAG-based chatbot.

Therefore, we use RAG also for creating these questions. We start by embedding the chat answer to a vector representation, just like we described earlier. We then query the vector database and retrieve the relevant context based on its similarity to the chat answer. Part of the retrieved matches are new unseen matches that are now close to the chat answer, although they weren’t necessarily close to the input question. This enables us to widen the context regarding topics that are close to the original question. Finally, we ask the LLM to create related questions that have the following characteristics:

The questions are related to the topic of the original viewer question.
The questions aren’t already answered in the chat answer. We don’t want to suggest trivial questions that the viewer already knows the answer to from the current chat response.
The questions can be answered from the given retrieved context. We don’t want to suggest an unanswerable question.

See Figure 7 for an example.

Figure 7. Based on the question that the viewer asks, our system suggests related questions.

In conclusion

In this blog post, we were happy to share our journey with the latest advances in LLMs and RAG and how we used them to build Vimeo’s new video Q&A system. We aimed to show the unique aspects of solving this problem for the video domain. This included the need of detecting speaker names as well as answering questions for a wide range of context lengths in the video. We also showed how we can keep the viewer engaged by automatically generating new questions for them to ask. We really enjoyed delving into this fascinating and emerging field and hope to foster more innovation in this area.

Acknowledgments

I would like to thank Yedidya Hyams and Naama Ben-Dor for their great work on this project. It certainly took long days and nights of prompt battling, endless testing, and a healthy sense of perfectionism to get this product to where it is today.