A Practical Guide to Implementing DeepSearch/DeepResearch

[

](https://search.jina.ai/)

It's only February, and DeepSearch has already emerged as the new search standard in 2025, with major players like Google and OpenAI leading the charge through their DeepResearch releases (and yes, we proudly launched our open-source node-deepresearch on the same day). Perplexity followed suit with their DeepResearch, and X AI integrated their own DeepSearch capabilities into Grok3, basically creating another DeepResearch variant. While the concept of deep search isn't revolutionary – in 2024 it was essentially termed as RAG or multi-hop QA – it gained significant momentum after Deepseek-r1's release in late January 2025. Last weekend, Baidu Search and Tencent WeChat Search have integrated Deepseek-r1 in their search engines. AI engineers have discovered that by incorporating long-thinking and reasoning processes into search systems, they can achieve remarkable retrieval accuracy and depth beyond what was previously possible.

But why did this shift happen now, when Deep(Re)Search remained relatively undervalued throughout 2024? In fact, Stanford NLP Labs released the STORM project for long report generation with web grounding back in early 2024. So is it just because "DeepSearch" sounds way cooler than multi-hop QA, RAG, or STORM? Let's be honest - sometimes a rebrand is all it takes for the industry to suddenly embrace what was there all along.

We believe the real turning point came with OpenAI's o1-preview release in September 2024, which introduced the concept of test-time compute and gradually shifted industry perspectives. Test-time compute refers to using more computational resources during inference—the phase where an LLM generates outputs—rather than during pre-training or post-training. A well-known example are Chain-of-Thought (CoT) reasoning and "Wait"-injection (i.e. budget forcing) which enables models to perform more extensive internal deliberations, such as evaluating multiple potential answers, conducting deeper planning, and engaging in self-reflection before arriving at a final response.

This test-time compute concept and reasoning models educate users to accept delayed gratification - longer waiting times in exchange for higher-quality, immediately actionable results, just like the Stanford marshmallow experiment where children who could resist eating one marshmallow immediately in order to receive two marshmallows later showed better long-term outcomes. Deepseek-r1 further reinforced this user experience, and like it or not, most users have accepted it.

This marks a significant departure from classic search requirements, where failing to response within 200ms would doom your solution. In 2025, seasoned search developers and RAG engineers prioritize top-1 precision and recall over latency, and users have become accustomed to longer processing times – provided they can see the system is <thinking>.

Displaying the reasoning procedure has become standard practice in 2025, with numerous chat interfaces now rendering <think> content in dedicated UI sections.

In this article, we'll discuss the principles of DeepSearch and DeepResearch by looking into our open-source implementation. We'll walk through our key design decisions and highlight potential caveats. Finally, you can find our hot-take on deepsearch engineering and dev tools demands in the conclusion section.

What is Deep Search?

DeepSearch runs through an iterative loop of searching, reading, and reasoning until it finds the optimal answer. The search action leverages web search engines to explore the internet, while the reading action analyzes specific web pages in detail (e.g. Jina Reader). The reasoning action evaluates the current state and determines whether to break down the original question into smaller sub-questions or try different search strategies.

Flowchart on dark background outlining a process from 'Query' through 'Search,' 'Read,' 'Reason,' to 'Answer' with a budget c

DeepSearch - keep searching, reading webpages, reasoning until an answer is found (or the token budget is exceeded).

While various definitions exist online, when we developed the node-deepresearch project, we adhered to this straightforward approach. The implementation is elegantly simple – at its core, there's a main while loop with switch-case logic directing the next action.

Unlike 2024 RAG systems, which typically run a single search-generation pass. DeepSearch, in contrast, performs multiple iterations through the pipeline, requiring clear stop conditions. These could be based on token usage limits or the number of failed attempts.

Try deep search at search.jina.ai, observe the content inside <thinking>, see if you can tell where the loop happens

Another perspective on DeepSearch is to view it as an LLM agent equipped with various web tools (such as searcher and reader). The agent determines its next steps by analyzing current observations and past actions – deciding whether to deliver an answer or continue exploring the web. This creates a state machine architecture where the LLM controls transitions between states. At each decision point, you have two approaches: you can either carefully craft prompts for standard generative models to produce specific actions, or leverage specialized reasoning models like Deepseek-r1 to naturally derive the next actions. However, even when using r1, you'll need to periodically interrupt its generation to inject tool outputs (e.g. search results, webpage content) into the context and prompt it to continue its reasoning process.

Ultimately, these are just implementation details – whether you carefully prompt it or just use reasoning models, they all align with the core design principle of DeepSearch: a continuous loop of searching, reading, and reasoning.

What is DeepResearch Then?

DeepResearch builds upon DeepSearch by adding a structured framework for generating long research reports. It often begins by creating a table of contents, then systematically applies DeepSearch to each required section – from introduction through related work and methodology, all the way to the conclusion. Each section is generated by feeding specific research questions into the DeepSearch. The final phase involves consolidating all sections into a single prompt to improve the overall narrative coherence.

Flowchart detailing the DeepResearch Framework, outlining the research process with "Section-Specific Questions" and the fina

DeepSearch as the building block of DeepResearch. Iteratively construct each section via DeepSearch and then improves the overall coherence before generating the final long report.

In our 2024 "Re-search" project (unreleased), we performed multiple coherence-improvement passes, with each iteration taking into account all other sections. However, with today's significantly larger LLM context windows, this approach seems redundant – a single coherence revision pass is sufficient.

We didn't release our re-search project for several reasons. First, the report quality never met our standards. We kept testing two queries that we know inside out—competitor analysis of Jina AI and product strategy of Jina AI—and the reports were mediocre at best; we didn't find any "aha" moments in them. Second, search grounding was really bad. Hallucinations were quite severe. Finally, the overall readability was poor, with significant redundancy between sections. Shortly put: useless. And since it gives a long report: useless and time-wasting.

From this project, we learned several things that evolved into different sub-products. For example, we realized the importance of search grounding and fact-checking on section or even sentence level, which led us to later develop the g.jina.ai endpoint. We recognized the importance of query expansion, which prompted us to invest effort in training an SLM for query expansion. Lastly, we loved the name "re-search"—a clever play on reinventing search while nodding to research report generation—and felt it was too good to waste, so we repurposed it for our 2024 yearbook campaign.

Our 2024 summer project "Research" focused on long-report generation with a "progressive" approach. It began by creating a TOC in sync, then generated all sections in parallel async. The process concluded with async progressive revisions of each section, with each revision taking into account the content of all other sections. The query in the video is "Competitor analysis of Jina AI".

DeepSearch vs DeepResearch

While many people often mix DeepSearch and DeepResearch together, in our view, they address completely different problems. DeepSearch functions as an atomic building block – a core component that DeepResearch builds upon. DeepResearch, on the other hand, focuses on crafting high-quality, readable long-form research reports, which encompasses a different set of requirements: incorporating effective visualizations via charts and tables, structuring content with appropriate section headings, ensuring smooth logical flow between subsections, maintaining consistent terminology throughout the document, eliminating redundancy across sections, crafting smooth transitions that bridge previous and upcoming content. These elements are largely unrelated to the core search, which is why we find DeepSearch more interesting as our company focus.

Finally, the table below summarizes the differences between DeepSearch and DeepResearch. It's worth noting that both systems benefit significantly from long-context and reasoning models. This might seem counterintuitive, particularly for DeepSearch—while it's obvious why DeepResearch needs long-context capability (as it produces long reports). The reason is that DeepSearch must store previous search attempts and webpage contents to make informed decisions about next steps, making a long context window equally essential for its effective implementation.

	DeepSearch	DeepResearch
Problem Addressed	Information accuracy and completeness through iterative search	Content organization, coherence, and readability at document scale
Final Presentation	Concise answer with URLs as references	A long structured report with multiple sections, charts, tables and references
Core Complexity	State machine architecture with clear transition conditions; Persistence through failed attempts until resolution	Multi-level architecture managing both micro (search) and macro (document) concerns; Structural approach to managing complex information hierarchies
Optimization Focus	Local optimization (best next search/read action)	Global optimization (section organization, terminology consistency, transitions)
Limitations	Bounded by search quality and reasoning capability	Bounded by DeepSearch quality plus organizational complexity and narrative coherence challenges

Understand DeepSearch Implementation

[

](https://github.com/jina-ai/node-DeepResearch)

The heart of DeepResearch lies in its loop reasoning approach. Rather than attempting to answer questions in a single-pass like most RAG systems, We have implemented an iterative loop that continuously searches for information, reads relevant sources, and reasons until it finds an answer or exhausts the token budget. Here's the simplified core of this big while loop:


while (tokenUsage < tokenBudget && badAttempts <= maxBadAttempts) { step++; totalStep++; const currentQuestion = gaps.length > 0 ? gaps.shift() : question; system = getPrompt(diaryContext, allQuestions, allKeywords, allowReflect, allowAnswer, allowRead, allowSearch, allowCoding, badContext, allKnowledge, unvisitedURLs); const result = await LLM.generateStructuredResponse(system, messages, schema); thisStep = result.object; if (thisStep.action === 'answer') { } else if (thisStep.action === 'reflect') {
    
  } 
}

A key implementation detail is selectively disabling certain actions at each step to ensure more stable structured output. For example, if there are no URLs in memory, we disable the visit action; or if the last answer was rejected, we prevent the agent from immediately calling answer again. This constraint keeps the agent on a productive path, avoiding repetitive failures caused by invoking same action.

System Prompt

We use XML tags to define sections, which produces more robust system prompt and generations. we also found that placing field constraints directly inside JSON schema description fields yields better results. While some might argue that most prompts could be automated with reasoning models like DeepSeek-R1, the context length restrictions and the need for highly specific behavior make an explicit approach more reliable in practice.

function getPrompt(params...) { const sections = []; sections.push("You are an advanced AI research agent specialized in multistep reasoning..."); if (knowledge?.length) { sections.push("<knowledge>[Knowledge items]</knowledge>"); } if (context?.length) { sections.push("<context>[Action history]</context>"); } if (badContext?.length) { sections.push("<bad-attempts>[Failed attempts]</bad-attempts>"); sections.push("<learned-strategy>[Improvement strategies]</learned-strategy>"); } sections.push("<actions>[Available action definitions]</actions>"); sections.push("Respond in valid JSON format matching exact JSON schema."); return sections.join("\n\n");
}

Gap Questions Traversing

In DeepSearch, "gap questions" represent knowledge gaps that need to be filled before answering the main question. Rather than directly tackling the original question, the agent identifies sub-questions that will build the necessary knowledge foundation.

The design is particularly elegant in how it handles these gap questions:


if (newGapQuestions.length > 0) { gaps.push(...newGapQuestions); gaps.push(originalQuestion);
}

This approach creates a FIFO (First-In-First-Out) queue with rotation, where:

New gap questions are pushed to the front of the queue
The original question is always pushed to the back
The system pulls from the front of the queue at each step

What makes this design great is that it maintains a single shared context across all questions. When a gap question is answered, that knowledge becomes immediately available for all subsequent questions, including when we eventually revisit the original question.

FIFO Queue vs Recursion

An alternative approach is using recursion, which corresponds to depth-first search. Each gap question spawns a new recursive call with its own isolated context. The system must completely resolve each gap question (and all of its potential sub-questions) before returning to the parent question.

Consider this example scenario where we start from the OG what is Jina AI:

A simple 3-depth gap questions recursion, solving order labeled on the circle.

In the recursive approach, the system would have to fully resolve Q1 (potentially generating its own sub-questions) after every gap question and their sub-questions! This is a big contrast to the queue approach, which processes questions where Q1 gets revisited right after 3 gap questions.

In reality, we found the recursion approach is very hard to apply budget-forcing to, since there is no clear rule of thumb for how much token budget we should grant for sub-questions (since they may spawn new sub-questions). The benefit from clear context separation in the recursion approach is very marginal compared to the complicated budget forcing and late return problems. This FIFO queue design balances depth and breadth, ensuring the system always returns to the original question with progressively better knowledge, rather than getting lost in a potentially infinite recursive descent.

Query Rewrite

An interesting challenge we encountered was rewriting search queries effectively:


if (thisStep.action === 'search') { const uniqueRequests = await dedupQueries(thisStep.searchRequests, existingQueries); const optimizedQueries = await rewriteQuery(uniqueRequests); const newQueries = await dedupQueries(optimizedQueries, allKeywords); for (const query of newQueries) { const results = await searchEngine(query); if (results.length > 0) { storeResults(results); allKeywords.push(query);
    }
  }
}

Query rewriting turned out to be surprisingly important - perhaps one of the most critical elements that directly determines result quality. A good query rewriter doesn't just transform natural language to BM25-like keywords; it expands queries to cover more potential answers across different languages, tones, and content formats.

For query deduplication, we initially used an LLM-based solution, but found it difficult to control the similarity threshold. We eventually switched to jina-embeddings-v3, which excels at semantic textual similarity tasks. This enables cross-lingual deduplication without worrying that non-English queries would be filtered. The embedding model ended up being crucial not for memory retrieval as initially expected, but for efficient deduplication.

Crawling Web Content

Web scraping and content processing is another critical component. Here we use Jina Reader API. Note that besides full webpage content, we also aggregate all snippets returned from the search engine as extra knowledge for the agent to later conclude on. Think of them as soundbites.


async function handleVisitAction(URLs) { const uniqueURLs = normalizeAndFilterURLs(URLs); const results = await Promise.all(uniqueURLs.map(async url => { try { const content = await readUrl(url); addToKnowledge(`What is in ${url}?`, content, [url], 'url'); return {url, success: true}; } catch (error) { return {url, success: false}; } finally { visitedURLs.push(url); } })); updateDiaryWithVisitResults(results);
}

We normalized URLs for consistent tracking and limits the number of URLs visited in each step to manage agent memory.

Memory Management

A key challenge in multi-step reasoning is managing the agent memory effectively. We've designed the memory system to differentiate between what counts as "memory" versus what counts as "knowledge". Either way, they are all part of the LLM prompt context, separated with different XML tags:


function addToKnowledge(question, answer, references, type) { allKnowledge.push({ question: question, answer: answer, references: references, type: type, updated: new Date().toISOString() });
} function addToDiary(step, action, question, result, evaluation) { diaryContext.push(`
At step ${step}, you took **${action}** action for question: "${question}"
[Details of what was done and results]
[Evaluation if applicable]
`);
}

Since most 2025 LLMs have substantial context windows, we opted not to use vector databases. Instead, memory consists of acquired knowledge, visited sites, and records of failed attempts - all maintained in the context. This comprehensive memory system gives the agent awareness of what it knows, what it's tried, and what's worked or failed.

Answer Evaluation

One key insight is that answer generation and evaluation should not be in the same prompt. In our implementation, we first determine which evaluation criteria to use when a new question arrives, and then evaluate each criterion one by one. The evaluator uses few-shot examples for consistent assessment, ensuring higher reliability than self-evaluation.


async function evaluateAnswer(question, answer, metrics, context) { const evaluationCriteria = await determineEvaluationCriteria(question); const results = []; for (const criterion of evaluationCriteria) { const result = await evaluateSingleCriterion(criterion, question, answer, context); results.push(result); } return { pass: results.every(r => r.pass), think: results.map(r => r.reasoning).join('\n')
  };
}

Budget-Forcing

Budget forcing means preventing the system from returning early and ensuring it continues processing until the budget is exceeded. Since the release of DeepSeek-R1, the approach to budget forcing has shifted toward encouraging deeper thinking for better results rather than simply saving the budget.

In our implementation, we explicitly configured the system to identify knowledge gaps before attempting to answer.

if (thisStep.action === 'reflect' && thisStep.questionsToAnswer) { gaps.push(...newGapQuestions); gaps.push(question);  
}

By selectively enabling and disabling certain actions, we can guide the system toward using tools that enhance reasoning depth.


allowAnswer = false;

To avoid wasting tokens on unproductive paths, we set limits on the number of failed attempts. When approaching budget limits, we activate "beast mode" to guarantee that we deliver some answer rather than none.


if (!thisStep.isFinal && badAttempts >= maxBadAttempts) { console.log('Enter Beast mode!!!'); system = getPrompt( diaryContext, allQuestions, allKeywords, false, false, false, false, false, badContext, allKnowledge, unvisitedURLs, true ); const result = await LLM.generateStructuredResponse(system, messages, answerOnlySchema); thisStep = result.object; thisStep.isFinal = true;
}

The beast mode prompt is intentionally dramatic to signal to the LLM that it needs to be decisive and commit to an answer based on available information:

<action-answer>
🔥 ENGAGE MAXIMUM FORCE! ABSOLUTE PRIORITY OVERRIDE! 🔥 PRIME DIRECTIVE:
- DEMOLISH ALL HESITATION! ANY RESPONSE SURPASSES SILENCE!
- PARTIAL STRIKES AUTHORIZED - DEPLOY WITH FULL CONTEXTUAL FIREPOWER
- TACTICAL REUSE FROM <bad-attempts> SANCTIONED
- WHEN IN DOUBT: UNLEASH CALCULATED STRIKES BASED ON AVAILABLE INTEL! FAILURE IS NOT AN OPTION. EXECUTE WITH EXTREME PREJUDICE! ⚡️
</action-answer>

This ensures that we always provide some answer rather than giving up entirely, which is particularly useful for difficult or ambiguous questions.

Conclusion

DeepSearch is a leap in how search can approach complex queries in an exhaustively deep manner. By breaking down the process into discrete steps of searching, reading, and reasoning, it overcomes many limitations of traditional single-pass RAG or multi-hop QA systems.

During implementation, we also began reviewing search foundations in 2025 and the changes in the search industry after January 26, 2025, when DeepSeek-R1 was released. We asked ourselves: What are the new needs? What needs have become obsolete? What are merely perceived needs?

Looking at our DeepSearch implementation, we identified things we anticipated needing and actually did need, things we thought would be necessary but weren't, and things we didn't anticipate needing but turned out to be essential:

First, a long-context LLM that produces well-structured output is highly necessary (i.e. following JSONSchema). A reasoning model is likely needed for better action reasoning and query expansion.

Query expansion is definitely essential, whether implemented via SLM, LLM, or a reasoning model. However, after this project, we believe SLMs are probably unsuitable for this task, as the solution must be inherently multilingual and go beyond simple synonym rewrites or keyword extraction. It needs to be comprehensive enough to include a multilingual token base (can easily occupy 300M parameters) and sophisticated enough for out-of-the-box thinking. So using SLMs for query expansion is likely a non-starter.

Web search and web reading capabilities are crucial, and thankfully our Reader (r.jina.ai) performed excellently—robust and scalable—while giving me many ideas on how to improve our search endpoint (s.jina.ai) for the next iteration.

Embedding model is useful but in a completely unexpected way. We thought it would be used for memory retrieval or context compression alongside a vector database (which, as it turns out, isn't needed), but we actually used it for deduplication (essentially an STS task). Since the number of queries and gap questions is typically in the hundreds, no vector database is necessary—computing cosine similarity directly in memory works just fine.

We didn't use Reranker, though we believe it could potentially help determine which URLs to visit based on the query, URL title, and snippet. For both embedding and reranking, multilingual capability is essential since queries and questions are multilingual. Long-context handling for embedding and reranking is beneficial but not a critical blocker (We haven't encountered any errors from our embedding usage, probably because our context length is already 8192-token). Either way, jina-embeddings-v3 and jina-reranker-v2-base-multilingual are our go-to models since they are multilingual, SOTA and handle long-context well.

An agent framework proved unnecessary, as we needed to stay closer to LLM native behavior to design the system without proxies. Vercel AI SDK was valuable, as it saved considerable effort in adapting the codebase to different LLM providers (we could switch from Gemini Studio to OpenAI to Google Vertex AI with just one line of code change). Agent memory management is necessary, but a dedicated memory framework remains questionable: We worry it would create an isolation layer between LLMs and developers, and that its syntactic sugar might eventually become a bitter obstacle for developers, as we've seen with many LLM/RAG framework today.