Build a Powerful Multi-Modal Search with Voyager-3 and LangGraph

Embedding images and text in the same space will allow us to perform highly accurate searches for multimodal content like web pages, PDF files, magazines, books, brochures, and various papers. Why is this technique so interesting? The main exciting thing about embedding text and images into the same space is that you can search and retrieve text related to a particular image and vice-versa. For example, if you are searching for a cat, you will find the pictures displaying a cat, but you will also get the texts referring to those images even if the text doesn’t explicitly say the word cat.

Let me show the difference between a traditional text embedding similarity search and a multi-modal embedding space:

Example Question: What does the magazine say about cats?

Screenshot from a photography magazine — OUTDOOR

Regular Similarity Search Answer

The search results provided do not contain specific information about cats. They mention animal portraits and photography techniques, but there is no explicit mention of cats or details related to them.

As shown in the image above, the word “cat” is not mentioned; there is just an image and an explanation of how to take pictures of animals. The regular similarity search yielded no results since the word “cat” was not mentioned.

Multi-Modal Search Answer

The magazine features a portrait of a cat, highlighting the detailed capture of its facial features and character. The text emphasizes how well-done animal portraits can delve into the subject’s soul and create an emotional connection with the viewer through compelling eye contact.

Using the multi-model search, we will find a picture of the cat and then link the correlated text to it. Feeding that data to the model will enable it to answer and understand the context much better.

How to build a multimodal embedding and retrieval pipeline

Now, I will describe how such a pipeline works in a few steps:

We will extract text and images from a PDF file using Unstructured, a powerful Python library for data extraction.
We will use the Voyager Multimodal 3 model to create multimodal vectors for text and images within the same vector space.
We will insert that into a vector store (Weaviate).
Finally, we will perform a similarity search and compare the results of text and images.

Step 1: Setup a Vector store and extract both images and text from a file (PDF)

Here, we have to do some manual work. Typically, Weaviate is a very easy-to-use vector store that automatically converts data and adds embeddings when inserting it. However, there is no plugin for the Voyager Multimodal v3, so we’ll have to calculate the embeddings manually. In this case, we’ll have to create a collection without defining a vectorizer module.

import weaviate
from weaviate.classes.config import Configureclient = weaviate.connect_to_local()

collection_name = "multimodal_demo"

client.collections.delete(collection_name)

try:

client.collections.create( name=collection_name, vectorizer_config=Configure.Vectorizer.none() ) collection = client.collections.get(collection_name)

except Exception:

collection = client.collections.get(collection_name)pyt

Here I run a local Weaviate instance in a Docker container.

Step 2: Extracting documents and images from the PDF

This is a crucial step for the process to work. Here, we will take a PDF containing both text and pictures. Then, we will extract the content (images and text) and store it in relevant chunks. So, each chunk will be a list of elements containing strings (the actual text) and Python PIL images.

We will use the Unstructured library to do some heavy lifting, but we still need to write some logic and configure the library parameters.

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_titleelements = partition(

filename="./files/magazine_sample.pdf",

strategy="hi_res",
extract_image_block_types=["Image", "Table"],
extract_image_block_to_payload=True)

chunks = chunk_by_title(elements)

Here, we must use the strategy hi_res and export the images to the payload using extract_image_block_to_payload, as we need this information later for our actual embeddings. Once we have all our elements extracted, we will group them into chunks based on the titles within the documents.

Check the Unstructured documentation on chunking for more information.

In the following script, we will use those chunks to output two lists:

A list containing the objects we will send to Voyager 3 to create our vectors
A list containing the metadata extracted by Unstructured. This metadata is necessary because we must add it to the vector store. It will provide us with extra attributes for filtering and tell us something about the retrieved data.

from unstructured.staging.base import elements_from_base64_gzipped_json
import PIL.Image
import io
import base64embedding_objects = []embedding_metadatas = []

for chunk in chunks:

embedding_object = [] metedata_dict = {

"text": chunk.to_dict()["text"],

"filename": chunk.to_dict()["metadata"]["filename"],
"page_number": chunk.to_dict()["metadata"]["page_number"],
"last_modified": chunk.to_dict()["metadata"]["last_modified"],
"languages": chunk.to_dict()["metadata"]["languages"],
"filetype": chunk.to_dict()["metadata"]["filetype"] }

embedding_object.append(chunk.to_dict()["text"])

if "orig_elements" in chunk.to_dict()["metadata"]:

base64_elements_str = chunk.to_dict()["metadata"]["orig_elements"] eles = elements_from_base64_gzipped_json(base64_elements_str) image_data = []

for ele in eles:

if ele.to_dict()["type"] == "Image":
base64_image = ele.to_dict()["metadata"]["image_base64"] image_data.append(base64_image)

pil_image = PIL.Image.open(io.BytesIO(base64.b64decode(base64_image)))

if pil_image.size[0] > 1000 or pil_image.size[1] > 1000:

ratio = min(1000/pil_image.size[0], 1000/pil_image.size[1])
new_size = (int(pil_image.size[0] * ratio), int(pil_image.size[1] * ratio)) pil_image = pil_image.resize(new_size, PIL.Image.Resampling.LANCZOS) embedding_object.append(pil_image)

metedata_dict["image_data"] = image_data

embedding_objects.append(embedding_object)

embedding_metadatas.append(metedata_dict)

The result of this script will be a lists of lists with contents that looks similar to this:

[['FROM\n\nON LOCATION KIRKJUFELL, ICELAND', <PIL.Image.Image image mode=RGB size=1000x381>, <PIL.Image.Image image mode=RGB size=526x1000>],

['This iconic mountain was on the top of our list of places to shoot in Iceland, and we had seen many images taken from the nearby waterfalls long before we went there. So this was the first place we headed to at sunrise - and we weren't disappointed. The waterfalls provided the perfect foreground interest for this image (top), and Kirkjufell is a perfect pointed mountain from this viewpoint. We spent an hour or two simply exploring these waterfalls, finding several different viewpoints.']]

Step 3: Vectorizing the extracted data

In this step, we will use the chunks we created in the previous step and send them to Voyager using the Voyager Python package. It will return us a list of all the embedding objects. We can then use this result and then finally store it in Weaviate.

from dotenv import load_dotenv
import voyageaiload_dotenv()vo = voyageai.Client()inputs = embedding_objectsresult = vo.multimodal_embed( inputs,

model="voyage-multimodal-3",

truncation=False
)

If we access result.embeddings, we get a list of lists containing all the calculated embedding vectors:

[[-0.052734375, -0.0164794921875, 0.050048828125, 0.01348876953125, -0.048095703125, …]]

We can now store this embedding data inside Weaviate in a single batch using the `batch.add_object` method. Note that we also add the metadata in the properties parameter.

with collection.batch.dynamic() as batch:
for i, data_row in enumerate(embedding_objects): batch.add_object( properties=embedding_metadatas[i], vector=result.embeddings[i]

)

Step 4: Querying the data

We can now perform a similarity search and query the data. This is easy since this flow is similar to regular similarity searches over text embeddings. Because Weaviate doesn’t have a module for Voyager multimodal, we have to calculate the vectors for our search query ourselves before passing the search vectors to Weaviate to perform the similarity search.

from weaviate.classes.query import MetadataQuery

question = "What does the magazine say about waterfalls?"

vector = vo.multimodal_embed([[question]], model="voyage-multimodal-3")

vector.embeddings[0]

response = collection.query.near_vector(

near_vector=vector.embeddings[0],

limit=2,
return_metadata=MetadataQuery(distance=True))

for o in response.objects:

print(o.properties['text'])
for image_data in o.properties['image_data']:

img = PIL.Image.open(io.BytesIO(base64.b64decode(image_data)))

width, height = img.size

if width > 500 or height > 500:

ratio = min(500/width, 500/height)
new_size = (int(width * ratio), int(height * ratio)) img = img.resize(new_size) display(img)

print(o.metadata.distance)

The picture below shows that searching for waterfalls will return text and images related to this search query. As you can see, the photos reflect a waterfall, but the text doesn’t per se mention them. The text is about a picture with a waterfall inside, which is why it’s also being retrieved. This is not possible with regular text embedding searches.

A picture displaying the search results from the similarity search

Step 5: Adding this to an entire retrieval pipeline

Now that we extracted text and images from the magazine, created embeddings for them, added them to Weaviate, and set up our similarity search, I will add this to an entire retrieval pipeline. For this example, I will be using LangGraph. The user will ask a question about this magazine, and the pipeline will respond to this question. Now that all the work is done, this part is as easy as setting up a typical retrieval pipeline using regular text.

I have abstracted some of the logic we discussed in the previous sections in other modules. This is an example of how I would integrate this into a LangGraph pipeline.

class MultiModalRetrievalState(TypedDict):
messages: Annotated[Sequence[BaseMessage], add_messages]
results: List[Document]
base_64_images: List[str]

class RAGNodes(BaseNodes):

def __init__(self, logger, mode="online", document_handler=None):
super().__init__(logger, mode) self.weaviate = Weaviate() self.mode = mode

async def multi_modal_retrieval(self, state: MultiModalRetrievalState, config):

collection_name = config.get("configurable", {}).get("collection_name") self.weaviate.set_collection(collection_name)

print("Running multi-modal retrieval")

print(f"Searching for {state['messages'][-1].content}") results = self.weaviate.similarity_search(

query=state["messages"][-1].content, k=3, type="multimodal"

)

return {"results": results}

async def answer_question(self, state: MultiModalRetrievalState, config):

print("Answering question")
llm = self.llm_factory.create_llm(mode=self.mode, model_type="default")
include_images = config.get("configurable", {}).get("include_images", False) chain = self.chain_factory.create_multi_modal_chain( llm,

state["messages"][-1].content,

state["results"], include_images=include_images, )

response = await chain.ainvoke({})

message = AIMessage(content=response)

return {"messages": message}

class GraphConfig(TypedDict):

mode: str = "online"
collection_name: str
include_images: bool = Falsegraph_nodes = RAGNodes(logger)graph = StateGraph(MultiModalRetrievalState, config_schema=GraphConfig)

graph.add_node("multi_modal_retrieval", graph_nodes.multi_modal_retrieval)

graph.add_node("answer_question", graph_nodes.answer_question)

graph.add_edge(START, "multi_modal_retrieval")

graph.add_edge("multi_modal_retrieval", "answer_question")
graph.add_edge("answer_question", END)

multi_modal_graph = graph.compile()

__all__ = ["multi_modal_graph"]

The code above will result in the following graph

A visual representation of the graph created

In this trace, you can see that both the contents and the images were sent to OpenAI to answer the question.

Conclusion

Multimodal embeddings open up possibilities for integrating and retrieving information across different data types within the same embedding space, such as text and images. By combining cutting-edge tools like the Voyager Multimodal 3 model, Weaviate, and LangGraph, we’ve demonstrated how to build a robust retrieval pipeline capable of understanding and linking content more intuitively than traditional text-only methods.

This approach significantly improves search and retrieval accuracy for diverse data sources like magazines, brochures, and PDFs. It also demonstrates how multimodal embeddings provide richer, context-aware insights, connecting images to descriptive texts even when explicit keywords are absent. This tutorial allows you to explore and adapt these techniques to your projects.

Example Notebook: https://github.com/vectrix-ai/vectrix-graphs/blob/main/examples/multi-model-embeddings.ipynb

Don’t hesitate to post a comment below if you have any questions; I’ll be more than happy to answer your questions.

You can also contact me on LinkedIn or check out our company: https://vectrix.ai