Anthropic-Style Citations with Any LLM

Image Sourced from https://www.anthropic.com/news/introducing-citations-api and edited by author

Anthropic’s new Citations feature for Claude recently went viral because it lets you attach references to your AI’s answers automatically — yet it’s only available for Claude. If your pipeline runs on ChatGPT, a local open-source model, or something else, you’re out of luck with the official approach.

That’s why I put together this article: to show how you can roll your own Anthropic-style citation system, step by step, for any LLM you want. We’ll store chunks in a vector DB, retrieve them, pass them to the LLM with instructions on how to produce <CIT> tags referencing specific sentences, and then parse the final answer to display a neat, interactive UI for each citation. Yes, it’s a bit messy—and, if I had my choice, I’d use Anthropic’s built-in feature. But if you can’t, here’s your alternative.

Note*: Anthropic likely uses a single-pass approach (like we do) to generate both the final answer and the citations inline. Another approach is two-pass: first the model writes an answer, then we ask it to label each snippet with references. That can be more accurate, but it’s also more complex and slower. For many use cases, inline citations are enough.*

1. The Architecture at a Glance

Below is a quick look at how our do-it-yourself citation system works:

  1. User Query: We ask, say, “How does Paul Graham decide what to work on?”
  2. Vector DB Search: We embed the query, search in KDB.AI for relevant text chunks.
  3. Chunks: The top hits are split by sentences (we do naive splitting or some advanced method) to allow fine-grained references like “sentences=2–4.”
  4. LLM Prompt: We instruct the model to produce an answer that includes inline tags (<CIT chunk_id='0' sentences='1-3'>…</CIT>) around specific phrases.
  5. LLM Output: The single output includes both the final text and embedded citation tags.
  6. Parser: We parse those tags out, map them back to the original chunk sentences, and build metadata (like the exact snippet the model claims is from chunk 0, sentences 1–3).
  7. UI: Finally, we show an interactive popover or tooltip in the user’s browser, letting them see the reference text from the chunk.

The result will be the following UI, with hoverable sentences in the style of Gemini:

Image source: author

2. Full Code: Start to Finish

If you want to just try it yourself in Colab, check out this notebook.

We’ll be building a single-pass inline citation approach similar to what Anthropic likely uses under the hood. Note that a lot of the complexity of this approach comes from wanting to cite not only chunks, but fine-grained sentences within these chunks. This is something that I try to do because displaying these to the user is usually a good idea. But without this requirement, the code becomes substantially simpler and you can easily modify the following to simply return chunk citations instead.

2.1 Setup and Dependencies

We’ll rely on:

  • kdbai_client to store chunks in the KDB.AI vector database.
  • fastembed, a library for generating local embeddings quickly.
  • llama-index to parse Paul Graham’s dataset.
!pip install llama-index fastembed kdbai_client onnxruntime==1.19.2

import os
from getpass import getpass
import kdbai_client as kdbai
import time
from llama_index.core import Document, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
import pandas as pd
from fastembed import TextEmbedding
import openai
import textwrap

2.2 Connecting to KDB.AI

We store all data in KDB.AI — each chunk along with its 384-dimensional embedding. This setup allows us to perform vector similarity searches to quickly identify the most relevant chunks.

You can sign-up for KDB.AI server for free here: https://trykdb.kx.com/kdbai/signup/

os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"
fastembed = TextEmbedding()
KDBAI_TABLE_NAME = "paul_graham"

# start session with KDB.AI Server
session = kdbai.Session(endpoint="http://localhost:8082")
database = session.database("default")

# Drop table if exists
try:
    database.table(KDBAI_TABLE_NAME).drop()
except kdbai.KDBAIException:
    pass
schema = [
    dict(name="text", type="bytes"),
    dict(name="embedding", type="float32s")
]
indexes = [dict(name="flat_index", column="embedding", type="flat", params=dict(metric="L2", dims=384))]
table = database.create_table(KDBAI_TABLE_NAME, schema=schema, indexes=indexes)

2.3 Data Prep: Paul Graham Essays

We fetch the Paul Graham essays, parse them into ~500 token chunks with 100 token overlap to preserve context:

!mkdir -p ./data
!llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data

node_parser = SentenceSplitter(chunk_size=500, chunk_overlap=100)
essays = SimpleDirectoryReader(input_dir="./data/source_files").load_data()
docs = node_parser.get_nodes_from_documents(essays)
len(docs)

We embed each chunk with a local model:

embedding_model = TextEmbedding()
documents = [doc.text for doc in docs]
embeddings = list(embedding_model.embed(documents))

records_to_insert_with_embeddings = pd.DataFrame({
    "text": [d.encode('utf-8') for d in documents],
    "embedding": embeddings
})

table.insert(records_to_insert_with_embeddings)

2.4 RAG Implementation

Our data is now in our table, and we can query it:

query = "How does Paul Graham decide what to work on?"
query_embedding = list(embedding_model.embed([query]))[0].tolist()

search_results = table.search({"flat_index": [query_embedding]}, n=10)
search_results_df = search_results[0]
df = pd.DataFrame(search_results_df)
df.head(5)

Image source: author

We have the 10 chunks most relevant to our query. Next, we’ll feed them to the LLM.

2.5 The Citation Pipeline Code

Here is the part that does the heavy lifting. Much of this code is not for the actual citation generation, but instead to meaningfully display the result, which is tedious in Python.

First, we need to import some more libraries.

#!/usr/bin/env python3

import os
import re
import json
import openai
import pandas as pd
from typing import List, Dict, Any
from IPython.display import display, HTML

Step 1: Prepare Data (Splitting Text into Sentences)

Before calling the LLM, we need a way to reference individual sentences within retrieved text chunks. This function splits a text chunk into sentences and assigns metadata like start and end character offsets.

################################################################################
# STEP 1: PREPARE DATA
################################################################################

def parse_chunk_into_sentences(chunk_text: str) -> List[Dict[str, Any]]:
    """
    Splits 'chunk_text' into naive 'sentences' with start/end offsets.
    Returns a list of dicts like:
      {
        "sentence_id": int,
        "text": str,
        "start_char": int,
        "end_char": int
      }
    """
    # We'll do a simple regex to split on '.' while capturing the period if it appears
    # Then we re-join it. A robust approach might use spacy or NLTK, but for demonstration:
    import re
    raw_parts = re.split(r'(\.)', chunk_text)

    # We'll combine text + punctuation
    combined = []
    for i in range(0, len(raw_parts), 2):
        text_part = raw_parts[i].strip()
        punct = ""
        if i+1 < len(raw_parts):
            punct = raw_parts[i+1]
        if text_part or punct:
            combined_text = (text_part + punct).strip()
            if combined_text:
                combined.append(combined_text)

    sentences = []
    offset = 0
    for s_id, s_txt in enumerate(combined, start=1):
        start_char = offset
        end_char = start_char + len(s_txt)
        sentences.append({
            "sentence_id": s_id,
            "text": s_txt,
            "start_char": start_char,
            "end_char": end_char
        })
        offset = end_char + 1  # assume space or newline after each
    return sentences

We split into sentences so that the LLM can not only cite specific chunks, but return the exact sentences in the chunk that are relevant**.**

Step 2: Call OpenAI to Generate a Response with Citations

Now that we can reference individual sentences, let’s query the LLM and instruct it to generate citations inline.

################################################################################
# STEP 2: CALL OPENAI WITH A ROBUST SYSTEM PROMPT
################################################################################

def call_openai_with_citations(chunks: List[str], user_query: str) -> str:
    """
    Asks the LLM to produce a single continuous answer,
    referencing chunk_id + sentences range as:
      <CIT chunk_id='N' sentences='X-Y'>...some snippet...</CIT>.
    """

    # If you want, set your API key in code or rely on environment variable
    # openai.api_key = "sk-..."
    if not openai.api_key and "OPENAI_API_KEY" in os.environ:
        openai.api_key = os.environ["OPENAI_API_KEY"]

    # We'll craft a robust system prompt with examples
    system_prompt = (
        "You have a collection of chunks from a single document, each chunk may have multiple sentences.\n"
        "Please write a single continuous answer to the user's question.\n"
        "When you reference or rely on a specific portion of a chunk, cite it as:\n"
        "  <CIT chunk_id='N' sentences='X-Y'>the snippet of your final answer</CIT>\n"
        "Where:\n"
        "  - N is the chunk index.\n"
        "  - X-Y is the range of sentence numbers within that chunk. Example: 'sentences=2-4'.\n"
        "  - The text inside <CIT> is part of your answer, not the original chunk text.\n"
        "  - Keep your answer minimal in whitespace. Do not add extra spaces or line breaks.\n"
        "  - Only add <CIT> tags around the key phrases of your answer that rely on some chunk.\n"
        "    E.g. 'He stated <CIT chunk_id='3' sentences='1-2'>it was crucial to experiment early</CIT>.'\n\n"
        "Remember: The text inside <CIT> is your final answer's snippet, not the chunk text itself.\n"
        "The user question is below."
    )

    # We just show the user the chunk texts:
    chunks_info = "\n\n".join(
        f"[Chunk {i}] {chunk}" for i, chunk in enumerate(chunks)
    )

    # We create the conversation
    messages = [
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": f"{chunks_info}\n\nQuestion: {user_query}\n"
        }
    ]

    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.3,
        max_tokens=1024
    )
    return response.choices[0].message.content

This function sends a query to OpenAI, instructing it to generate a response that includes citations inline. The prompt explicitly directs the model to use <CIT> tags to mark references, ensuring each citation includes both the corresponding chunk_id and the specific sentence range (sentences=X-Y). For example, OpenAI might return a response like:

Paul Graham suggests that <CIT chunk_id=’2' sentences=’1–2'>choosing work should be based on curiosity</CIT>.

This approach ensures that the final answer is self-contained and properly annotated, allowing for precise attribution of information.

Step 3: Parse the LLM Response to Extract Citations

Once OpenAI returns a response, we need to parse the citation tags and extract structured data.

################################################################################
# STEP 3: PARSE THE LLM RESPONSE
################################################################################

def parse_response_with_sentence_range(response_text: str) -> Dict[str, Any]:
    """
    Produce a single block with:
    {
      "type": "text",
      "text": <the final answer minus CIT tags but with snippet inline>,
      "citations": [
        {
          "chunk_id": int,
          "sentences_range": "X-Y",
          "answer_snippet": snippet,
          "answer_snippet_start": int,
          "answer_snippet_end": int
        },
        ...
      ]
    }
    """
    pattern = re.compile(
        r'(.*?)<CIT\s+chunk_id=[\'"](\d+)[\'"]\s+sentences=[\'"](\d+-\d+)[\'"]>(.*?)(?:</CIT>|(?=<CIT)|$)',
        re.DOTALL
    )
    final_text = ""
    citations = []
    idx = 0

    while True:
        match = pattern.search(response_text, idx)
        if not match:
            # leftover
            leftover = response_text[idx:]
            final_text += leftover
            break

        text_before = match.group(1)
        chunk_id_str = match.group(2)
        sent_range = match.group(3)
        snippet = match.group(4)

        final_text += text_before

        start_in_answer = len(final_text)
        final_text += snippet
        end_in_answer = len(final_text)

        citations.append({
            "chunk_id": int(chunk_id_str),
            "sentences_range": sent_range,
            "answer_snippet": snippet,
            "answer_snippet_start": start_in_answer,
            "answer_snippet_end": end_in_answer
        })

        idx = match.end()

    return {
        "type": "text",
        "text": final_text,
        "citations": citations
    }

This function extracts and structures citations from the LLM response by identifying <CIT> tags using a regex pattern. It removes these tags from the final text while storing metadata like chunk_id, sentence range, and snippet position separately. The output is a dictionary with the cleaned response and a list of citations, enabling precise mapping of references for user-friendly display.

Step 4: Matching Cited Sentences and Finding Character Ranges

Once we have extracted citations from the LLM response, we need to match them back to the original text chunks. This step ensures that each reference in the answer corresponds accurately to its source. The function below looks up the cited chunk_id and sentence range, retrieves the relevant sentences from our indexed text, and records their exact character offsets. This allows us to display precise references without including irrelevant information.

################################################################################
# STEP 4: MATCH CITED SENTENCES + FIND CHAR RANGES IN CHUNK
################################################################################

def gather_sentence_data_for_citations(block: Dict[str, Any], sentence_map: Dict[int, List[Dict[str, Any]]]) -> Dict[str, Any]:
    """
    For each citation, parse the chunk_id + sentences='X-Y'.
    Gather the text of those sentences from 'sentence_map[chunk_id]'
    and record their combined text plus start/end offsets in the chunk.
    """
    for c in block["citations"]:
        c_id = c["chunk_id"]
        sent_range = c["sentences_range"]
        try:
            start_sent, end_sent = map(int, sent_range.split("-"))
        except:
            start_sent, end_sent = 1, 1

        # get the sentence list for that chunk
        sents_for_chunk = sentence_map.get(c_id, [])
        # filter the range
        relevant_sents = [s for s in sents_for_chunk if start_sent <= s["sentence_id"] <= end_sent]

        if relevant_sents:
            combined_text = " ".join(s["text"] for s in relevant_sents)
            chunk_start_char = relevant_sents[0]["start_char"]
            chunk_end_char = relevant_sents[-1]["end_char"]
        else:
            combined_text = ""
            chunk_start_char = -1
            chunk_end_char = -1

        c["chunk_sentences_text"] = combined_text
        c["chunk_sentences_start"] = chunk_start_char
        c["chunk_sentences_end"] = chunk_end_char

    return block

################################################################################
# STEP 5: BUILD HTML FOR DISPLAY
################################################################################

def build_html_for_block(block: Dict[str, Any]) -> str:
    """
    Build an HTML string that underlines each snippet in the final answer
    and shows a tooltip with 'chunk_sentences_text' plus start/end offsets.
    """
    css = """
    <style>
    body {
      font-family: Arial, sans-serif;
      margin: 20px;
      line-height: 1.6;
    }
    .tooltip {
      position: relative;
      text-decoration: underline dotted;
      cursor: help;
    }
    .tooltip .tooltiptext {
      visibility: hidden;
      width: 400px;
      background: #f9f9f9;
      color: #333;
      text-align: left;
      border: 1px solid #ccc;
      border-radius: 4px;
      padding: 10px;
      position: absolute;
      z-index: 1;
      top: 125%;
      left: 50%;
      transform: translateX(-50%);
      opacity: 0;
      transition: opacity 0.3s;
    }
    .tooltip:hover .tooltiptext {
      visibility: visible;
      opacity: 1;
    }
    </style>
    """

    full_text = block["text"]
    citations = sorted(block["citations"], key=lambda x: x["answer_snippet_start"])

    html_parts = [f"<!DOCTYPE html><html><head><meta charset='UTF-8'>{css}</head><body>"]
    cursor = 0

    for cit in citations:
        st = cit["answer_snippet_start"]
        en = cit["answer_snippet_end"]

        if st > cursor:
            html_parts.append(full_text[cursor:st])

        snippet_text = full_text[st:en]

        # Build tooltip with chunk sentences
        tooltip_html = f"""
        <span class="tooltip">
          {snippet_text}
          <span class="tooltiptext">
            <strong>Chunk ID:</strong> {cit["chunk_id"]}<br>
            <strong>Sentence Range:</strong> {cit["sentences_range"]}<br>
            <strong>Chunk Sentences Offset:</strong> {cit["chunk_sentences_start"]}-{cit["chunk_sentences_end"]}<br>
            <strong>Chunk Sentences Text:</strong> {cit["chunk_sentences_text"]}
          </span>
        </span>
        """
        html_parts.append(tooltip_html)
        cursor = en

    if cursor < len(full_text):
        html_parts.append(full_text[cursor:])

    html_parts.append("</body></html>")
    return "".join(html_parts)

def display_html_block(block: Dict[str, Any]):
    from IPython.display import display, HTML
    html_str = build_html_for_block(block)
    display(HTML(html_str))

Why This Step Matters

By linking each citation back to its exact sentence and character offsets, we ensure that the references displayed in the final answer are both accurate and contextually relevant. This prevents citations from being too broad or misleading, making the results more transparent and trustworthy.

Now that the references are structured correctly, the next step is to build a visual representation that allows users to interact with the citations.

2.6 Running the Pipeline

Let’s put it all together. Let’s run our main function that does RAG with our query and displays the result.

################################################################################
# PUTTING IT ALL TOGETHER
################################################################################

def main(df, user_query: str):
    """
    Full pipeline:
      1) We'll parse each chunk into sentences.
      2) We'll call openai with a robust system prompt for <CIT> usage.
      3) We'll parse the LLM's response for chunk_id + sentences='X-Y'.
      4) We'll gather the chunk sentences text, produce a final block with citations.
      5) We'll build HTML and display in Colab.
    """

    # 1) Prepare chunk data
    # - We'll assume df has columns: chunk_id, text
    # - We'll parse each chunk into sentence_map
    sentence_map = {}
    chunk_texts = []
    max_chunk_id = df["chunk_id"].max()
    for i, row in df.iterrows():
        c_id = row["chunk_id"]
        c_txt = row["text"]
        # build the sentence parse
        sents = parse_chunk_into_sentences(c_txt)
        sentence_map[c_id] = sents
        # We'll store chunk_texts in an array in chunk_id order
        # If chunk_id is not sequential from 0..N, you might do a dict.
        # But let's do the simplest approach.
        # We'll expand chunk_texts if needed
        if len(chunk_texts) <= c_id:
            chunk_texts.extend([""]*(c_id - len(chunk_texts)+1))
        chunk_texts[c_id] = c_txt

    # 2) Call LLM
    answer_text = call_openai_with_citations(chunk_texts, user_query)

    # 3) Parse the response
    block = parse_response_with_sentence_range(answer_text)

    # 4) Enrich each citation with chunk sentences
    block = gather_sentence_data_for_citations(block, sentence_map)

    # 5) Display final result
    print("----- JSON OUTPUT -----")
    print(json.dumps({"content": [block]}, indent=2, ensure_ascii=False))

    display_html_block(block)

Now we get:

  1. The final consolidated answer from the LLM, minus the <CIT> tags in the displayed text.
  2. Underlined snippets (where <CIT> was) that show a tooltip with the chunk’s exact text.

Image source: author

As you can see, when we hover on a sentence, we can see the exact chunk it is citing, as well as the exact relevant sentences of that chunk to the RAG answer. We get an extreme amount of granularity, which means we can display the source text to the user without worrying about irrelevant information being displayed!

Although a lot of this code is for displaying the RAG answer with citations in a meaningful way, we end up with JSON that can be displayed much more easily in a web app:

Image source: author

3. Why This Single-Pass, Inline Approach?

A common alternative is a two-step approach:

  1. Ask the model to produce the best final answer, no references.
  2. Pass that final answer plus the top chunks to the model again, asking “where did each piece come from?”

Pros: Possibly more accurate references.
Cons: Double the LLM calls, more complicated parsing, and you might have to handle partial overlaps.

Anthropic likely uses a single-pass approach for Citations because it’s simpler, and if your model is well-trained, the references can still be quite accurate. But you might see an occasional mismatch. That’s life in RAG.

4. Wrap-Up

We overcame the main limitation: Anthropic’s “don’t do it yourself” approach is great if you rely on Claude. But if you want to replicate it on GPT-4o or any other model, you can absolutely do so by:

  1. Chunking your text at the sentence level (optional)
  2. Telling the LLM to label each snippet of the final answer with <CIT> tags referencing chunk ID + sentence range.
  3. Parsing that output and building an interactive UI.

Yes, this code is more complicated than toggling a single parameter in Anthropic’s API — and you’ll see edge cases. But it works with any LLM. One day, maybe OpenAI (or a library) will release official citations for GPT. Until then, you’ve got a blueprint for building your own.

Happy citing! If you have questions or run into interesting challenges, feel free to reach out. And if you would like cutting-edge RAG/LLM content injected regularly into your feed, follow me on LinkedIn!

Accueil - Wiki
Copyright © 2011-2026 iteam. Current version is 2.155.1. UTC+08:00, 2026-04-18 15:38
浙ICP备14020137号-1 $Carte des visiteurs$