Building Deep Research Agent from scratch

👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in GenAI, MLOps, Data Engineering, Machine Learning and overall Data space.

The big topic of last month was Deep Research Agents that every large player in the LLM industry is building and trying to monetise. The big catalyst for this was the emergence of DeepSeek R1 reasoning model and its open source nature.

In this episode of the Newsletter we are going to build such a Deep Research Agentic system from scratch. It will allow us to strengthen our fundamental knowledge of how these upcoming agentic systems actually function under the hood.

We are also going to run our system utilising one of the previously mentioned LLMs DeepSeek R1.

This brings us to our sponsors of today - SambaNova. The platform will allow us to run and test our system for free. Registering to the platform will give you $5 of credits with no credit card details needed, this will be enough to run the project.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F028517d4-05d7-44d1-9d01-41c0d2c218e1_3972x660.png)

SambaNova provides a selection of models in Llama, Qwen and DeepSeek families through their APIs for applications and a Playground for exploratory purposes. When it comes to DeepSeek models, they provide both distilled and non-distilled 671 Billion parameter versions. In the examples we will be running the 671B version but you can always switch to other versions.

Check it out

Thank you for helping keep SwirlAI Newsletters free for everyone!

To put it simply, these are systems that are capable of running an in depth research on a predefined topic. Usually, this would include at least the following steps:

Planning of the research - this could mean a creation of an outline of the research report that would eventually become the output of the system.
Splitting the above into manageable steps.
Performing deep research on sections of the report. This means reasoning about the data needed to provide comprehensive analysis and utilising web search tools to support the analysis.
Reflecting on data generated in different steps of the research and improving the results.
Summarising the retrieved data and coming back with the final Research Report.

Today we will implement all of the above without using any LLM Orchestration framework.

You can find the code (Notebook included for interactive learning) in my “AI Engineers Handbook” GitHub repository:

GitHub Repository

Follow and Star the repository if you like the content!

For a step-by-step explanation, continue reading.

Below picture represents what we are going to build, here are the steps that the outcome system will perform:

A user will provide a query or topic to be researched.
A LLM will create an outline of the final report that it will be aiming for. It will be instructed to produce not more than a certain number of paragraphs.
Each of the paragraph description will be fed into a research process separately to produce a comprehensive set of information to be used in report construction. Detailed description of the research process will be outlined in the next section.
All of the information will be fed into summarisation step that will construct the final report including conclusion.
The report will then be delivered to the user in MarkDown form.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e3d9c09-537d-4b0f-a198-5745fe2194b5_1932x1337.png)

Deep Research Agent Topology

Let’s zoom into the research step defined in the previous paragraph:

Once we have the outline of each paragraph, it will be passed to a LLM to construct Web Search queries in an attempt to best enrich the information needed.
The LLM will output the search query and the reasoning behind it.
We will execute Web search against the query and retrieve top relevant results.
The results will be passed to the Reflection step where a LLM will reason about any missed nuances to try and come up with a search query that would enrich the initial results.
This process will be repeated for n times in an attempt to get the best set of information possible.

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd105c4b-1aa7-4281-bcb2-b2bfde71f04d_1673x1028.png)

Research Step Topology

Before we go into the implementation stage let’s do some technical hygiene. If you haven’t done that yet, go to SambaNova Cloud console and get your API key, we will use it to explain some important output characteristics of DeepSeek R1 model family.

You can register here.

Go to “APIs” tab, you will get prompted to login, do that. Don’t worry, adding your credit card details is optional, we will be able to run this project without the need to do that.

To run queries against the API we will use OpenAI client. If you don’t have it yet, simply run.

pip install openai

For the project we will use a non-distilled DeepSeek-R1 version with 671B parameters.

If you can’t access it yet by the time of reading the newsletter, join the waitlist and switch the API endpoint and model version to a smaller distilled version.

Makes sure that your SAMBANOVA_API_KEY is exported as an environment variable and run the following in your console or notebook:

import os
import openai

client = openai.OpenAI(
    api_key=os.environ.get("SAMBANOVA_API_KEY"),
    base_url="https://preview.snova.ai/v1",
)

response = client.chat.completions.create(
    model="DeepSeek-R1",
    messages=[{"role":"system","content":"You are a helpful assistant"},
              {"role":"user","content":"Tell me something interesting about human species"}],
    temperature=1
)

print(response.choices[0].message.content)

You should see something similar to:

<think>
Okay, so I'm trying to ... <REDACTED>
</think>

The human species is distinguished by the remarkable cognitive abilities of the brain, which underpin a array of unique traits. Our brain's advanced structure and function enable complex thought, language, and social organization. These capabilities have driven innovation, art, and the creation of intricate societies, setting humans apart in their ability to adapt, innovate, and create beyond any other species. This cognitive prowess is the cornerstone of human achievement and our profound impact on the world.

Reasoning tokens will always be included in the answer. While it is interesting to see the thinking process, what we will need in our systems is the answers only. This is where we can create a hygiene function to remove anything between the tags.

def remove_reasoning_from_output(output):

    return output.split("</think>")[-1].strip()

Simple yet useful.

Great! We have now set up the SambaNova account and understood the output structure of DeepSeek R1 family models, let’s go implement the Deep Research Agent.

First, we will need to define the state of the entire system that will be continuously evolved while the Agent is running in the environment and used by different parts of the system selectively.

Let’s relate the state to the Stages of the Agentic system:

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c697471-715b-4b6a-8b70-0b7db1c13b60_1980x1412.png)

Topology State

Stage 1 will be the creation of the outline where report structure will be planned and its state evolved. We will start with an empty state but evolve it to something similar (reasoning is described in stage 2):

{
    "report_title": "Report Title",
    "paragraphs": [
        {
            "title": "Paragraph Title",
            "content": "Paragraph Content",
            "research": <...>
        },
        {
            "title": "Paragraph Title",
            "content": "Paragraph Content",
            "research": <...>
        }
    ]
}

The state can be implemented in a clean way using python dataclasses. The above would look like:

@dataclass
class Paragraph:
    title: str = ""
    content: str = ""
    research: Research = field(default_factory=Research)

@dataclass
class State:
    report_title: str = ""
    paragraphs: List[Paragraph] = field(default_factory=list)

Stage 2 is be where we will iterate on the state of each paragraph. We will change the “research” of each paragraph. We will use the following structure of the research state per paragraph:
```
{
"search_history": [{"url": "some url", "content": "some content"}],
"latest_summary": "summary of the combined search history",
"reflection_iteration": 1
}
```
[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae2194f5-e847-4f07-9bf4-cf5f5c3baa0f_1673x1028.png)

Research Step

search_history - We will store all of the searches we perform in a list, we will want both the url and the content so that we can deduplicate search results and refer to the links later when forming the final report.

latest_summary - the summarised version of the paragraph given all of the search results. It will be used in reflection step to figure out if more search is needed and passed to the next step of summarisation and report creation.

reflection_iteration - this is to track the current number of reflection iteration and force stop if the limit is reached.

Again, we can implement research state via dataclasses:
```
@dataclass
class Search:
    url: str = ""
    content: str = ""

@dataclass
class Research:
    search_history: List[Search] = field(default_factory=list)
    latest_summary: str = ""
    reflection_iteration: int = 0
```

Different versions of the models will have varying consistency with the answers they produce. I experimented with DeepSeek-R1 a bunch and the following prompt seemed to produce consistently well formatted outputs:

output_schema_report_structure = {
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "content": {"type": "string"}
            }
        }
    }

SYSTEM_PROMPT_REPORT_STRUCTURE = f"""
You are a Deep Research assistan. Given a query, plan a structure for a report and the paragraphs to be included.
Make sure that the ordering of paragraphs makes sense.
Once the outline is created, you will be given tools to search the web and reflect for each of the section separately.
Format the output in json with the following json schema definition:

<OUTPUT JSON SCHEMA>
{json.dumps(output_schema_report_structure, indent=2)}
</OUTPUT JSON SCHEMA>

Title and content properties will be used for deeper research.
Make sure that the output is a json object with an output json schema defined above.
Only return the json object, no explanation or additional text.
"""

[

](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb01ea8db-e2bb-436d-b602-e6edacda0da8_1362x1307.png)

Paragraph Structure State

Let’s run some sample query with the above system prompt:

response = client.chat.completions.create(
    model="DeepSeek-R1",
    messages=[{"role":"system","content":SYSTEM_PROMPT_REPORT_STRUCTURE},
              {"role":"user","content":"Tell me something interesting about human species"}],
    temperature=1
)

print(response.choices[0].message.content)

You will get something similar to:

```json
[
  {
    "title": "Introduction to Human Adaptability",
    "content": "Humans possess a unique capacity for adaptability, which has been crucial in their survival and dominance across various environments. This introduction sets the stage for exploring the different facets of human adaptability."
  },
  ...
  <REDACTED>
  ...
  {
    "title": "Conclusion: The Role of Adaptability in Human Survival",
    "content": "Adaptability has been a cornerstone of human survival and evolution, enabling us to face challenges and explore new frontiers, offering insights into future potential."
  }
]
```

These json tags surrounding the output are not helpful as we will need to transform the output into Python dictionary. Very simple function to remove first and last line of the output:

def clean_json_tags(text):

    return text.replace("```json\n", "").replace("\n```", "")

Here is the properly cleaned output:

json.loads(clean_json_tags(remove_reasoning_from_output(response.choices[0].message.content)))

We can now use above as an input to our global state directly.

STATE = State()

report_structure = json.loads(clean_json_tags(remove_reasoning_from_output(response.choices[0].message.content)))

for paragraph in report_structure:
    STATE.paragraphs.append(Paragraph(title=paragraph["title"], content=paragraph["content"]))

We will be using Tavily to perform web search. You can get your token here.

The tool for it is very simple:

def tavily_search(query, include_raw_content=True, max_results=5):

    tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

    return tavily_client.search(query,
                                include_raw_content=include_raw_content,
                                max_results=max_results)

Each of the function call will return up to max_results of the web search results and for each search result will return:

Title of the search result.
URL of the search result.
Summary of the content.
Full content of the page if possible. We want this for best results.

Once we get the results from the Web Search tool, we can add all of it to the global state directly without need for calling any LLM but we will need to make sure that we are updating the proper element in the list of paragraphs.

Basically, given the structure we defined:

{
    "report_title": "Report Title",
    "paragraphs": [
        {
            "title": "Paragraph Title",
            "content": "Paragraph Content",
            "research": <...>
        },
        {
            "title": "Paragraph Title",
            "content": "Paragraph Content",
            "research": <...>
        }
    ]
}

And given that we are currently researching the i_th_ paragraph, we will need to be updating the field .paragraphs[i].research.

Remembering the structure of the research field:

{
"search_history": [{"url": "some url", "content": "some content"}],
"latest_summary": "summary of the combined search history",
"reflection_iteration": 1
}

Here is a handy function that will update the state correctly provided Tavily search results, index of the paragraph and the state object.

def update_state_with_search_results(search_results, idx_paragraph, state):
    
    for search_result in search_results["results"]:
        search = Search(url=search_result["url"], content=search_result["raw_content"])
        state.paragraphs[idx_paragraph].research.search_history.append(search)

    return state

We will be appending to the search history. latest_summary and reflection_iteration fields will need some work by LLM and will be discussed in Part 5: Reflecting.

To plan the first instance of a search I found the following prompt to produce consistently good results:

input_schema_first_search = {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "content": {"type": "string"}
            }
        }

output_schema_first_search = {
            "type": "object",
            "properties": {
                "search_query": {"type": "string"},
                "reasoning": {"type": "string"}
            }
        }

SYSTEM_PROMPT_FIRST_SEARCH = f"""
You are a Deep Research assistant. You will be given a paragraph in a report, it's title and expected content in the following json schema definition:

<INPUT JSON SCHEMA>
{json.dumps(input_schema_first_search, indent=2)}
</INPUT JSON SCHEMA>

You can use a web search tool that takes a 'search_query' as parameter.
Your job is to reflect on the topic and provide the most optimal web search query to enrich your current knowledge.
Format the output in json with the following json schema definition:

<OUTPUT JSON SCHEMA>
{json.dumps(output_schema_first_search, indent=2)}
</OUTPUT JSON SCHEMA>

Make sure that the output is a json object with an output json schema defined above.
Only return the json object, no explanation or additional text.
"""

We ask for reasoning in the output schema just to force more thought around the query. While it is most likely an overkill for a reasoning model it might be a good idea with a regular LLM. While we are using DeepSeek R1 for this, we don’t necessarily need to. Reasoning models are specifically handy in the first step of the Deep Research Agent where planning of the report structure is required.

Given the fact that we now have a list of paragraphs planned with their content and descriptions, we can feed output of part 3 directly to the prompt, this is how it would look like:

response = client.chat.completions.create(
    model="DeepSeek-R1",
    messages=[{"role":"system","content":SYSTEM_PROMPT_FIRST_SEARCH},                {"role":"user","content":json.dumps(STATE.paragraphs[0]])}],
    temperature=1
)

print(response.choices[0].message.content)

STATE.paragraphs[0] points to the first paragraph state where research field is still empty.

I got the following for my first search plan:

{"search_query": "Homo sapiens characteristics basic biological traits cognitive abilities behavioral traits"}

I can directly input the query into my search tool:

tavily_search("Homo sapiens characteristics basic biological traits cognitive abilities behavioral traits")

The first summary is different from the upcoming Reflection step as there is nothing yet to reflect on and this step will be producing exactly that. The following prompt works relatively well:

input_schema_first_summary = {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "content": {"type": "string"},
                "search_query": {"type": "string"},
                "search_results": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            }
        }

output_schema_first_summary = {
            "type": "object",
            "properties": {
                "paragraph_latest_state": {"type": "string"}
            }
        }

SYSTEM_PROMPT_FIRST_SUMMARY = f"""
You are a Deep Research assistan. You will be given a search query, search results and the paragraph a report that you are researching following json schema definition:

<INPUT JSON SCHEMA>
{json.dumps(input_schema_first_summary, indent=2)}
</INPUT JSON SCHEMA>

Your job is to write the paragraph as a researcher using the search results to align with the paragraph topic and structure it properly to be included in the report.
Format the output in json with the following json schema definition:

<OUTPUT JSON SCHEMA>
{json.dumps(output_schema_first_summary, indent=2)}
</OUTPUT JSON SCHEMA>

Make sure that the output is a json object with an output json schema defined above.
Only return the json object, no explanation or additional text.
"""

We now need to provide data to the LLM in the following format:

 {
    "title": "Title",
    "content": "Content",
    "search_query": "Search Query",
    "search_results": []
}

We can construct the json from data we already have.

Given the response from:

search_results = tavily_search("Homo sapiens characteristics basic biological traits cognitive abilities behavioral traits")

The json would look like:

input = {
    "title": "Introduction to Human Adaptability",
    "content": "Humans possess a unique capacity for adaptability, which has been crucial in their survival and dominance across various environments. This introduction sets the stage for exploring the different facets of human adaptability.",
    "search_query": "Homo sapiens characteristics basic biological traits cognitive abilities behavioral traits",
    "search_results": [result["raw_content"][0:20000] for result in search_results["results"] if result["raw_content"]]
}

We can then run:

response = client.chat.completions.create(
    model="DeepSeek-R1",
    messages=[{"role":"system","content": SYSTEM_PROMPT_FIRST_SUMMARY},
              {"role":"user","content":json.dumps(input)}],
    temperature=1
)

print(remove_reasoning_from_output(response.choices[0].message.content))

You will get something similar to:

{
  "paragraph_latest_state": "Homo sapiens, the species to which modern humans belong, represents a unique and fascinating chapter in the evolutionary narrative of life on Earth. As the only living species within the Homo genus, Homo sapiens are distinguished by a combination of biological, cognitive, and behavioral traits that set us apart from other primates and extinct human relatives. Our biological characteristics include a large and structurally advanced brain, with a neocortex that has expanded significantly compared to our evolutionary ancestors. This anatomical development has enabled exceptional cognitive abilities, such as complex problem-solving, abstract thought, and the capacity for language and symbolic communication. Behaviorally, Homo sapiens exhibit sophisticated social structures, cultural practices, and technological innovations, which have been critical in shaping our ability to adapt to diverse environments and thrive as a species. These traits collectively underscore the intricate interplay between biology and behavior that defines the human condition."
}

This is what we will be updating STATE.paragraphs[0].research.latest_summary field with. We will also reflect on the continuously updated latest state of the paragraph summary as we move forward in Part 6.

We now have the latest state of the report paragraph content and will use that to improve the content by prompting our LLM to reflect on the text and look for any angles it might have missed while drafting the piece.

Here is the prompt that does great job:

input_schema_reflection = {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "content": {"type": "string"},
                "paragraph_latest_state": {"type": "string"}
            }
        }

output_schema_reflection = {
            "type": "object",
            "properties": {
                "search_query": {"type": "string"},
                "reasoning": {"type": "string"}
            }
        }

SYSTEM_PROMPT_REFLECTION = f"""
You are a Deep Research assistant. You are responsible for constructing comprehensive paragraphs for a research report. You will be provided paragraph title and planned content summary, also the latest state of the paragraph that you have already created all in the following json schema definition:

<INPUT JSON SCHEMA>
{json.dumps(input_schema_reflection, indent=2)}
</INPUT JSON SCHEMA>

You can use a web search tool that takes a 'search_query' as a parameter.
Your job is to reflect on the current state of the paragraph text and think if you haven't missed some critical aspect of the topic and provide the most optimal web search query to enrich the latest state.
Format the output in json with the following json schema definition:

<OUTPUT JSON SCHEMA>
{json.dumps(output_schema_reflection, indent=2)}
</OUTPUT JSON SCHEMA>

Make sure that the output is a json object with an output json schema defined above.
Only return the json object, no explanation or additional text.
"""

For the run we are currently implementing the input would look like this:

input = {"paragraph_latest_state": "Homo sapiens, the species to which modern humans belong, represents a unique and fascinating chapter in the evolutionary narrative of life on Earth. As the only living species within the Homo genus, Homo sapiens are distinguished by a combination of biological, cognitive, and behavioral traits that set us apart from other primates and extinct human relatives. Our biological characteristics include a large and structurally advanced brain, with a neocortex that has expanded significantly compared to our evolutionary ancestors. This anatomical development has enabled exceptional cognitive abilities, such as complex problem-solving, abstract thought, and the capacity for language and symbolic communication. Behaviorally, Homo sapiens exhibit sophisticated social structures, cultural practices, and technological innovations, which have been critical in shaping our ability to adapt to diverse environments and thrive as a species. These traits collectively underscore the intricate interplay between biology and behavior that defines the human condition.",
            "title": "Introduction",
            "content": "The human species, Homo sapiens, is one of the most unique and fascinating species on Earth. This section will introduce the basic characteristics of humans and set the stage for exploring interesting aspects of the species."}

As before, let’s run the following:

response = client.chat.completions.create(
    model="DeepSeek-R1",
    messages=[{"role":"system","content": SYSTEM_PROMPT_REFLECTION},
              {"role":"user","content":json.dumps(input)}],
    temperature=1
)

print(remove_reasoning_from_output(response.choices[0].message.content))

The output:

{
  "search_query": "Recent research on Homo sapiens evolution, interaction with other human species, and factors contributing to their success",
  "reasoning": "The current paragraph provides a good overview of Homo sapiens' characteristics but lacks depth on evolutionary history and interactions with other species. Including recent research on these topics will enhance the paragraph's comprehensiveness and provide up-to-date information."
}

Now we run the query, append new results to the paragraph state and combine new results with the latest paragraph state.

After running the search query of reflection step:

search_results = tavily_search("Recent research on Homo sapiens evolution, interaction with other human species, and factors contributing to their success")

We can update the search state of the paragraph with:

update_state_with_search_results(search_results, idx_paragraph, state)

We now run steps 6. and 7. in a loop for a specified number of reflection steps.

We repeat the steps from Parts 4 - 7 for each paragraph that was planned in Part 2. Once we have all of the final states in the paragraphs ready, we can then stitch the whole thing together. We will do that with an LLM and produce a nicely formatted MarkDown document. Here is the prompt:

input_schema_report_formatting = {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "paragraph_latest_state": {"type": "string"}
            }
        }
    }

SYSTEM_PROMPT_REPORT_FORMATTING = f"""
You are a Deep Research assistant. You have already performed the research and constructed final versions of all paragraphs in the report.
You will get the data in the following json format:

<INPUT JSON SCHEMA>
{json.dumps(input_schema_report_formatting, indent=2)}
</INPUT JSON SCHEMA>

Your job is to format the Report nicely and return it in MarkDown.
If Conclusion paragraph is not present, add it to the end of the report from the latest state of the other paragraphs.
"""

Run:

report_data = [{"title": paragraph.title, "paragraph_latest_state": paragraph.research.latest_summary} for paragraph in STATE.paragraphs]

response = client.chat.completions.create(
    model="DeepSeek-R1",
    messages=[{"role":"system","content": SYSTEM_PROMPT_REPORT_FORMATTING},
              {"role":"user","content":json.dumps(report_data)}],
    temperature=1
)

print(remove_reasoning_from_output(response.choices[0].message.content))

And that’s it, you have yourself a Deep Researched report on a topic you provided.

Congratulations! You have successfully implemented a Deep Research Agent from scratch.

If you want to take a look into a more properly implemented code, you can find it in my GitHub repository:

GitHub Repository

Also, there are many nuances to take into account that could make the system more stable:

It is not easy to make the system produce consistently well formatted json outputs as reasoning models are known to be not the best at structured outputs.
Knowing the above, it might make sense to use different models for different tasks in the system topology, we really need a reasoning model for the first step mostly.
There are many improvements that could be implemented in how we search the web and how we rank the retrieved results.
Number of Reflection steps could be configured to be dynamic where the LLM could choose if more is needed.
We could return links that were used when searching the web and provide references in the report for each paragraph.
…

If you want to follow SambaNova work, they have recently released some work on Deep Research agents as well. You can find more information here.

Partner with SwirlAI

Share