生成式 AI 大语言模型（LLMs）核心算法及源码解析：预训练篇

原创素数（李阳）支付宝体验科技

2025年01月23日 04:34

随着人工智能技术的飞速发展，大语言模型（Large Language Models, LLMs）在各领域展现出强大的潜力。本文深入剖析了 LLMs 预训练阶段的核心算法和关键源代码实现，旨在全面、系统地阐述其背后的技术原理和实现细节。通过对算法架构、训练方法和代码实现的详实分析，我们期望这些内容能够为大模型领域的学习者和研究者提供有益参考，感兴趣的读者可以沿着文中技术脉络深入探索。

1. 引言

随着 OpenAI GPT [1-3] 的出现，全球对大型语言模型（LLMs）热情不断升温；最近，OpenAI o1 [4] 在推理能力的进化更是令人惊叹。Llama 系列 [5-7] 的发布进一步激发了开源社区的热情，尤其是 Llama3（Meta, 2024）[7] 作为最先进的开放权重模型系列，显著缩小了与顶级专有模型的性能差距。这一系列进展表明，越来越多有竞争性的 LLMs 正在追求类似于 OpenAI GPT 的技术突破。这其中包括 Qwen、DeepSeek、Llama、Mistral 和 Gemma 等，它们也以开放权重的形式发布。就在不久前（2024.09.19）的云栖大会上，阿里云发布了通义千问 Qwen2.5 系列模型，其性能比肩 Llama3-405B，Qwen2.5-72B 成为全球最强开源模型之一。

LLMs 大语言模型训练通常经历两个主要阶段 [7-10]：第一阶段是预训练阶段（pre-training），通过自回归方法预测下一个 token 来进行大规模训练；第二阶段是后训练阶段（post-training），对预训练阶段生成的模型进行监督微调（SFT）和直接偏好优化 (DPO)，并基于人类反馈的强化学习 (RLHF) 来更好地与人类偏好对齐。本文将聚焦于预训练阶段，结合 Qwen2.5 系列模型，对预训练模型的核心算法原理及其代码实现进行深入分析。

本文写作目的：a) 旨在全面、系统地阐述其背后的技术原理和实现细节，对 LLMs 算法架构、训练方法和代码实现的详实分析；b) 期望这些内容能够为大模型领域的学习者和研究者提供有益参考，感兴趣的读者可以沿着文中技术脉络深入探索。

本文组织结构：第 2 部分介绍大语言模型主流系统架构演进，第 3 部分结合 Qwen2.5 系列模型深入分析预训练阶段的核心算法及代码实现细节，第 4 部分对预训练过程做概括性介绍，第 5 部分对全文总结。

2. LLMs 系列主流模型架构探秘

2.1 LLMs 主流架构

随着 GPT 系列模型的成功发展及巨大影响力，当前 NLP 领域走向了生成式大语言模型道路，以 GPT [1-3]、Llama [5-7]、Qwen [8-10] 为代表的解码器架构已成为目前大语言模型主流架构。为了能清晰介绍该模型架构，我们无法绕开在此之前最为经典的 Transformer [11] 编码器-解码器架构。二者模型功能模块及架构差异性比较如图1所示，下面将详细介绍。

编码器-解码器 (Encoder-Decoder) 架构是 NLP 领域里最经典的模型结构，广泛应用于机器翻译等多项任务，原始的 Transformer 模型也使用这一架构。此架构在encoder 端采用双向自注意力机制对输入信息进行处理，而在 decoder 端则使用交叉注意力与因果掩码自注意力机制，进而通过自回归的方式对输出进行生成。基于该架构的预训练语言模型在众多自然语言理解与生成任务中展现出优异性能，但是目前诸如 T5 [12, 13] 少数大语言模型采用该架构。

解码器（Decoder-only）架构是当前主流大语言模型结构。进一步，该架构还可以细分为两个变种：因果解码器 (Causal Decoder) 和前缀解码器 (Prefix Decoder) 。Causal Decoder 架构采用 Transformer 解码器组件，同时做了几点重要改动：因果解码器没有显式地区分输入和输出部分，而是采用单向因果掩码（Causal Mask）自注意力机制，使得每个输入序列只关注位于它前面的序列和其本身，进而自回归地预测输出序列（Seq2Seq）；由于不含编码器部分，因果解码器删除了关注编码器输出的交叉注意力模块。Prefix Decoder 架构对因果解码器掩码机制进行了修改，同时参考了编码器-解码器的设计，对于输入和输出分别进行不同的注意力掩码处理，对于输入 (Prefix) 使用双向自注意力进行编码，而对于输出采用单向因果掩码自注意力；与编码器-解码器不同的是，前缀解码器在编码和解码过程中是共享参数的，并没有划分为独立的解码器和编码器。值得注意的是，我们提到解码器架构通常指的都是因果解码器，诸如 GPT、Llama、Qwen 等系列模型架构；而前缀解码器架构也比较小众，诸如 GLM [14] 系列模型架构。

图1：Encoder-Decoder VS Decoder-only 大语言模型架构比较

2.2 为什么采用解码器架构

回望当前 LLMs 大模型架构，可以看到各大厂商都趋向于 Decoder-only 解码器架构收敛，绝大多数 LLMs 模型结构与之非常接近，基本已经形成目前大模型标准结构。下表列举了一些典型的 LLMs 架构及其详细配置。

表 1：大语言模型架构配置表 (L 表示层数，N 表示注意力头数，H 表示隐藏状态的大小) , 数据来源 [15]

由于大多数语言任务可以转换为基于输入的预测问题来解决，因此这些仅包含解码器的 LLMs 可能更具优势。一些研究还表明，仅包含解码器的 LLMs 可以通过自回归地预测下一个 token 而自然地迁移到某些任务中，无需微调。

推理效率方面，Decoder-only 架构可以持续复用键值缓存（KV-Cache）[16]，这使其在多轮对话情境中尤为有效。每个 token 的表示都依赖于其之前的输入，这一特性是 Encoder-Decoder 架构难以实现。

此外，这种架构的日益成熟也得益于 OpenAI 早期探索过程中的经验教训 [1-3]，推动了该技术的发展。

Qwen、Llama 系列开源大模型是 Decoder-only 架构典型代表，我们将以自家产品 Qwen2.5 为例，详细剖析该架构设计及核心算法。

2.3 Qwen2.5 模型架构

Hugging Face 的 Qwen2.5-72B 模型结构如图 2 所示。进一步解读 Hugging Face Transformers 中的 Qwen2 文件内模型代码，我们可以获得 Qwen 系列模型各功能模块的更清晰呈现，如图 3 所示。

在第3部分中，我们将围绕模型 Token 编码、位置编码、归一化、注意力机制（Attention）和前馈网络（FFN）等功能模块，详细介绍核心算法及其代码实现。

图 2：HuggingFace Qwen2.5-72B 模型结构简介

[https://huggingface.co/Qwen/Qwen2.5-72B-Instruct]

图 3：Qwen2.5 预训练阶段核心算法模块

[https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/modeling_qwen2.py]

3. Qwen2.5 预训练阶段核心算法

3.1 Tokenizer

Tokenizer（分词器）是一个用来将文本字符串转换为更小单位（称为“tokens”）的工具，也是大语言模型基础组件之一。这些小单位可以是单词、词根、子词或字符，具体取决于使用的分词策略。目前最常用的三种分词算法是 BPE（Byte-Pair Encoding）、WordPiece 和 SentencePiece。传统的分词方法通常基于词典，但在处理大规模语料库时，词典可能会非常庞大，且难以维护；BPE 通过学习语料库中频繁出现的子词组合，采用迭代的方式将频繁出现的子词组合合并为新的标记，动态地构建一个有限的词汇表，从而有效地解决了这个问题。在 LLMs 中，许多模型分词器选择使用字节对编码 BPE 算法，BPE 已成为 NLP 领域最经典的算法之一，它的有效性也被GPT系列、BERT、RoBERTa、T5、Llama 和 Qwen 等这些最强大的语言模型所证实。网上关于 BPE 算法介绍文章较多，本文不再做详细介绍。

与之前发布的 Qwen 系列一样，Qwen2.5 也采用基于 Byte-level BPE（B-BPE）[17] 的字节对编码分词器。B-BPE 是 BPE 算法的一种扩展，它将字节视为合并操作的基本符号，从而实现了更细粒度分割，并有效解决了未登录词问题。这种分词器具有高编码效率，其压缩率优于其他替代方案，有助于 Qwen2.5 的多语言功能。所有规模的模型都使用由 151643 个常规 tokens 和 3 个特殊控制 token 组成的通用词汇表 (详见模型 vocab 配置文件：https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/raw/main/tokenizer.json)。

以下是将中文编码为 tokens 的例子，可以直观看到 Qwen 与 GPT-4o BPE编码结果的差异性。

图 4：Qwen2.5 vs GPT-4o BPE 编码结果差异性比较

注: 1. Qwen BPE 编码测试链接 [https://dashscope.console.aliyun.com/tokenizer]

2. GPT-4o BPE 编码测试链接 [https://platform.openai.com/tokenizer]

3.2 位置编码

3.2.1 为什么要位置编码

不同于RNN、CNN等模型，在 Transformer 模型中的位置编码（Positional Encoding）是一个关键组件，其目的是让模型能够理解输入序列中各个 token 的相对或绝对位置。这是因为纯粹的Attention机制是无法捕捉输入顺序，即无法区分不同位置的Token。为了应对这一挑战，我们通常有两种选择：

绝对位置编码：将位置信息直接融入到输入中。这是最常见的做法，通过在每个 token 的嵌入向量中添加一个表示其位置的向量，使得模型能够区分不同位置的 token。
相对位置编码：对自注意力机制进行微调，使其能够区分不同位置的 token。这种方法通过在注意力计算过程中引入相对位置信息，使得模型能够更好地理解 token 之间的相对位置关系。

3.2.2 绝对位置编码

三角函数式位置编码（Sinusoidal PE），是 Google 的论文《Attention is All You Need》[11] 所提出来的一个显式解：

代码实现及解析如下，可以看到：

位置编码矩阵是确定的、不可 trainable，在输入阶段位置信息就已经和 token embedding 编码相加融合。
对融合后的编码信息，施加了 dropout 操作，输出结果送入 attention 模块。

GPT-2绝对位置编码解析：

class PositionalEncoding(nn.Module):
    def __init__(self, max_len: int, d_model: int, dropout_prob: float = 0.1):
        super().__init__()
        self.max_len = max_len
        self.d_model = d_model
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, x: torch.Tensor):
        '''
            x: token embedding 编码 (B, Sequence, d_model)
        '''
        self.max_len = x.shape[1]     # sequence 序列长度 注意: 解码器输入x 其序列长度为 self.max_len -1, 因此位置编码也要做相应调整
        p_encodings = torch.zeros(self.max_len, self.d_model)     # (max_len, d_model): max_len 最大序列长度, d_model 位置编码维度
        # 位置索引 0,1,2,..., max_len - 1
        position = torch.arange(0, self.max_len, dtype=torch.float32).unsqueeze(dim=-1)   # (max_len, 1)
        '''
            p/10000 ** (2i/d_model) = p * (10000 ** (-2i/d_model)
            10000 ** (-2i/d_model) 取对数 ==> (-2i/d_model) * math.log(10000) = 2i * (-math.log(10000)/d_model) = 2i * tmp1 = tmp2
            再求指数 ==> tmp3 = torch.exp(tmp2)
        '''
        tmp1 = - math.log(10000.0) / self.d_model
        tmp2 = torch.arange(0, self.d_model, 2, dtype=torch.float32) * tmp1    #  2i * tmp1,  (d_model/2,)
        tmp3 = torch.exp(tmp2).unsqueeze(dim=0)     # (1, d_model/2)

        # (max_len, 1) @ (1, d_model/2) -> (max_len, d_model/2)
        # 也参考llama实现方式, 基于外积方式
        value = torch.matmul(position, tmp3)
        p_encodings[:, 0::2] = torch.sin(value)    # embedding 奇数位 sin
        p_encodings[:, 1::2] = torch.cos(value)    # embedding 偶数位 cos
        # 增加 batch 维度 (1, S, d_model), 同时位置编码不可 trainable
        p_encodings = p_encodings.unsqueeze(0).requires_grad_(False)

        # token 编码 + 位置编码
        x = x + p_encodings
        return  self.dropout(x)

3.2.3 旋转位置编码（RoPE）

旋转位置编码（Rotary Position Embedding, RoPE）[18] 其核心思想是通过绝对位置编码的方式来实现相对位置编码。这种做法具有理论上的优雅性和实践上的实用性。例如，RoPE 可以扩展到线性注意力（Linear Attention）中，主要原因就在于此。RoPE 也是当前诸如Qwen 和 Llama 等主流大语言模型广泛采用的位置编码方案，实践已验证其优势。为达到“通过绝对位置编码的来实现相对位置编码”目的，作者假设通过下述运算来给Attention 中 query、key 添加绝对位置信息：，这样携带了位置m、n的信息。而 Attention 核心运算是内积，所以我们希望内积结果带有相对位置信息，因此作者假设存在如下恒等关系：

借助复数矩阵运算 (旋转变换) 技巧，作者得到N维空间旋转位置编码矩

则有 query、key 内积计算

，得到n-m 相对位置信息。

由于上述矩阵的稀疏性，会造成计算资源浪费，所以在实际运算时采用逐位相乘再相加的矩阵运算，且便于编码实现，Qwen/Llama 等代码也基于此实现。

综上所述，旋转位置编码具有高效计算、易于实现和便于外推的特点，适用于线性注意力。此外，它还具有远程衰减特性，即在计算注意力时，距离较远的词对之间注意力权重会逐渐减小。这种特性使得模型在处理长距离依赖关系时更加有效，因为它可以更好地捕捉不同位置之间的相对重要性。

Qwen2 RoPE 代码实现需要注意：

细心同学可能发现旋转矩阵计算与上述矩阵运算公式有所调整，这可能是为了便于实现。
与绝对位置编码作用于输入阶段不同，RoPE 作用于 Attention 计算阶段，query、key 增加旋转位置编码，计算内积后得到的结果即融入了二者的相对位置信息，这正是 RoPE 的巧妙之处。
原始 Transformer 采用绝对位置编码时，位置编码信息会同时融入 query、key、value 三个矩阵；而 RoPE 只有 query、key 融入位置信息，value 并不携带位置信息。这个差别在后续 Attention 算法中会具体介绍。
位置编码 embedding 长度也有区别：绝对位置编码由于需要与输入 x 融合，因此编码 embedding 长度与 x 相同，为；但 RoPE 作用于 Attention 阶段，为了适配 multi-head（多头注意力头数）维度，其位置编码维度=.

RoPE旋转位置编码解析：

# #定义半旋转函数，将输入张量 x 的后一半维度旋转到前面，同时将前一半维度保持不变
def rotate_half(x):
    """
       Rotates half the hidden dims of the input.
       这里对 query key 执行该操作，实现与论文公式推导有所不同
    """
    x1 = x[..., : x.shape[-1] // 2]         # 最后一维前一半
    x2 = x[..., x.shape[-1] // 2 :]         # 最后一维后一半
    return torch.cat((-x2, x1), dim=-1)

# 计算旋转矩阵 cos sin 值
def compute_default_rope_parameters(base: int =10000, head_dim: int=16, max_sequence: int=8):
    # Compute the inverse frequencies
    # theta_{i} = 10000 ^ (-2i/d)
    inv_freq = 1.0 / ( base ** (torch.arange(0, head_dim, 2, dtype=torch.int64).float() / head_dim) )
    print(f"inv_freq: {inv_freq.shape} {inv_freq}")       # (dim/2, )

    # 位置序列编号 [0, 1, 2, ..., max_sequence-1]
    position_ids = torch.arange(0, max_sequence).float()

    # 两个一维向量计算外积,  生成 m * n 矩阵 (max_sequence, dim//2)
    freq_half = torch.outer(position_ids, inv_freq)      # 元素值 m * theta_{i}

    freq = torch.zeros(max_sequence, head_dim)
    freq[:, 0:head_dim:2] = freq_half   # theta_{0}  theta_{1}  ...  theta_{dim/2}
    freq[:, 1:head_dim:2] = freq_half   # theta_{0}  theta_{1}  ...  theta_{dim/2}

    # 分别计算 cos  sin 值, 2维扩展为4维, 后续 attention 运算
    # 注意, 这里dim是 head_dim(多头自注意力的dim, 而不是x的dim)
    cos, sin = freq.cos()[None, None, :, :], freq.sin()[None, None, :, :]
    return cos, sin     # (1, 1, seq_len, head_dim)  后续计算广播为 (B, heads, seq_len, head_dim)


# Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
# atteion 计算中，对query key 施加旋转位置编码
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    """Applies Rotary Position Embedding to the query and key tensors.

    Args:
        q (`torch.Tensor`): The query tensor.  (B, heads, seq_len, head_dim)
        k (`torch.Tensor`): The key tensor.    (B, heads, seq_len, head_dim)
        cos (`torch.Tensor`): The cosine part of the rotary embedding.
        sin (`torch.Tensor`): The sine part of the rotary embedding.
        position_ids (`torch.Tensor`, *optional*):
            Deprecated and unused.
        unsqueeze_dim (`int`, *optional*, defaults to 1):
            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
    Returns:
        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
    """
    # cos = cos.unsqueeze(unsqueeze_dim)    # (B, heads, seq_len, head_dim)
    # sin = sin.unsqueeze(unsqueeze_dim)    # (B, heads, seq_len, head_dim)
    # 对 query key 执行旋转位置编码操作           (B, heads, seq_len, head_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

3.3 Pre-LayerNorm (Pre-LN)

归一化与残差连接在深度学习中属于常用操作，在推广算法中也被广泛使用，这部分简要介绍。

3.3.1 BatchNorm (BN)

自2015年《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》[19]一文被提出以来，批归一化（Batch Normalization, BN）已成为当前深度学习领域的标准组件，在计算机视觉以及搜推广领域被广泛应用。在 BN 出现之前，训练深度学习模型时普遍面临模型难以收敛、训练速度慢以及内部协方差偏移（Internal Covariate Shift, ICS）等问题。BN 计算过程如图5所示：均值、标准差；归一化；仿射变换；更新全局均值、标准差的移动平均值。在训练过程中，BN 层会先计算一个小批次的均值和标准差；而在推理阶段，由于通常只有一个样本，BN 层会使用训练阶段已存储的全局移动平均作为均值和标准差。该过程的代码实现较为简单，具体可参考 `torch.nn.BatchNorm1d` 源码。

图 5：BN 计算过程 [19]

3.3.2 LayerNorm (LN)

Layer Normalization (LN) 是一种用于深度神经网络的归一化技术，特别适用于RNNs 和 Transformers 等模型。LN 在每个样本特征维度上进行归一化，而不是在 batch 维度上；这种设计使得 LN 不依赖于样本批次大小，从而更适合处理变长序列和小批次数据。另外，LN 不统计全局均值和标准差，它避免了 BN 中常见的对小批次敏感的问题。因此，LN 在处理变长序列时更加稳定和有效。这些特性使得 LN 成为当前大语言模型广泛使用的归一化技术，例如 GPT 系列、Llama 和 Qwen 等。

实践中，LayerNorm 还存在两种略有差异性的架构：Post-LN [11]和 Pre-LN [20]。Post-LN，如图6(a)所示，即残差连接在前、层归一化在后，这也是原始 Transformer 论文中方案；在这种架构中，输入首先通过某个子层（如自注意力层或前馈神经网络）进行处理，然后加上残差连接，最后再进行层归一化操作。Pre-LN 则采用相反顺序, 如图6(b) 所示，即层归一化在前、残差连接在后；在这种架构中，输入首先经过层归一化，然后再通过子层处理，最后加上残差连接。论文[20]的研究表明，Pre-LN 有更好的模型收敛性及泛化性。许多大型语言模型，如 GPT-3、Llama 和 Qwen 等，都采用了 Pre-LN 这一架构，实践也进一步证明了其优良性能。

图 6：Post-LN vs Pre-LN [11, 20]

传统的层归一化（LN）采用均值-标准差归一化方法。然而，目前大多数大语言模型主要采用 RMSNorm（Root Mean Square Normalization）层归一化技术，例如 Qwen、Llama、Gemma等。RMSNorm 由 Google Brain 的 Noam Shazeer 等人在 2019 年的论文《Fast Transformers with Clustered Attention》[21] 中首次提出，该技术旨在简化和加速传统基于均值-标准差归一化过程。RMSNorm 不使用均值，而是直接基于特征的平方根均值进行归一化，从而显著减少了计算复杂度。此外，RMSNorm 可以在归一化后引入仿射变换，以增加模型的灵活性。RMSNorm 计算公式如下：

Qwen2 代码实现部分需要注意：RMSNorm 归一化后的仿射变换，Qwen 无 bias 偏置项参数（Llama 也同样无 bias）。

Qwen2RMSNorm：

class Qwen2RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        Qwen2RMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        '''
            hidden_states: x 输入, (B, seq_len, d_model)
            LN层归一化作用在最后一维 d_model 上
            RMS归一化: x/RMS(x) * W
        '''
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        # LN层归一化作用在最后一维 d_model 上
        variance = hidden_states.pow(2).mean(-1, keepdim=True)   # (B, seq_len, 1)
        # RMS 归一化, torch.rsqrt 计算根号倒数
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        # 仿射变换, 无 bias
        return self.weight * hidden_states.to(input_dtype)

3.4 Attention

在 Transformer 模型中，注意力机制（Attention）是一个至关重要的组成部分，它有效地捕捉了序列内部的依赖关系；该机制允许输入序列中的每个元素与其他元素进行交互，这些交互根据元素的重要性进行加权求和，从而生成新的表示。这种处理方式显著提升了模型在 NLP 任务中的表现。传统的递归神经网络（RNN）和长短期记忆网络（LSTM）在处理长范围依赖关系时往往表现不佳，这主要因为它们需要逐步地处理序列，而这种逐步处理方式容易导致梯度消失或爆炸的问题，影响模型训练效果。然而，Transformer 通过自注意力（Self-Attention）结构可以并行处理序列中的所有元素，允许模型直接建模任意长度的依赖关系，从而克服了 RNN 的局限性。除了在 NLP 领域的广泛应用外，注意力机制也在搜索、推荐以及广告系统得到广泛应用，已经被各界认知和接受。由于在捕捉上下文信息和处理长距离依赖关系方面的优势，注意力机制已成为现代深度学习架构中不可或缺的算法。

原始 Transformer 模型采用 Multi-Head Attention（MHA）[11] ，为了进一步优化显存空间及计算复杂度，多查询注意力机制（Multi-Query Attention，MQA) [23] 、分组查询注意力机制（Grouped-Query Attention, GQA）[24]、FlashAttention 等不断被提出。

3.4.1 MHA

Multi-Head Attention（MHA）通过引入多个并行的注意力头（Attention Heads）来捕捉不同的特征子空间。每个注意力头都有自己的查询（Query）、键（Key）和值（Value）矩阵，并分别计算注意力，以提高模型表达能力和泛化能力。

其中：为三个可学习权重矩阵，multi-head 个数，对每个头并行计算注意力；然后将各个头的输出拼接，并通过线性变换得到最终输出，为线性变换可学习权重矩阵。详细矩阵运算过程，如图 7 所示。

图 7：Attention 矩阵运算详细过程 [22]

Transformer 输入序列由 𝑙 个 token 组成，每个 token 用一个 𝑑 维向量表示，形成一个 𝑑×𝑙 的矩阵。MHA 模块首先将序列乘以三个不同的权重矩阵：生成查询、键和值的激活。然后将这些激活分成 ℎ 个头 (head)，每个 head 隐藏维度为 𝑑 / ℎ。每个块分别传递到 ℎ 个不同的注意力头，在那里查询和键块沿隐藏维度相乘，生成一个 𝑙×𝑙 的激活矩阵。这个矩阵通过 Softmax 操作（生成注意力分数）后与值块相乘，得到隐藏维度为 𝑑/ℎ 的激活。随后，将所有注意力头的激活沿隐藏维度连接，生成一个隐藏维度为 𝑑 的激活，并通过最后一个线性层投影到相同维度，权重矩阵为.

总的来看，MHA 模块包括六个线性操作。其中四个是将权重投影到激活值上的矩阵乘法（即和投影），另两个是激活值之间的矩阵乘法（即 Q×K，注意力分数 ×V）。前者被称为投影矩阵乘法，后者被称为激活到激活的矩阵乘法（act-to-act 矩阵乘法），因为它们在运行时的行为不同。自注意力的计算过程对于基于硬件的并行优化(如 GPU、TPU 等)非常友好，因此能够支持大规模参数的高效优化。

另外，为了确保自回归性质，注意力机制还需要引入因果掩码（Causal Masking）[11]：确保模型在生成下一个 token 时，仅考虑当前 token 及其之前的 tokens，从而维护文本生成的顺序性和一致性。模型在训练阶段，因果掩码机制遮蔽了未来词的注意力权重，确保只能关注已生成的词。因此，在推理阶段也能够有效防止模型“窥探未来”信息。假设序列长度为 L，其 L*L 的因果掩码矩阵为

则带因果掩码的 MHA 计算如下：

代码实现需要注意：

Qwen 针对 Q K V 矩阵线性变换, 增加了 bias 偏置项；而 Transformer 原始论文中 MHA 实现不含 bias 偏置项，Llama 系列 attention 也不含偏置项。个人猜测，Qwen 团队可能发现增加 Q K V 偏置项有助于提高模型表达能力。
Qwen 针对 softmax 之后的 attention score，并没有进行 dropout 操作(dropout 概率=0)；而原始 Transformer 对 attention score 增加了 dropout 操作(dropout 概率=0.1)。
增加 Multi-head 头的数量，不会增加模型计算量，这是因为 head_dim * num_heads = hidden_size，隐层维度为固定值。
attn_weights = attn_weights + causal_mask 操作中：causal_mask 为因果掩码矩阵 (L, L)，需要掩码的元素值设为 -inf，需正常进行 attention 运算的元素值设为 0。这样执行 softmax 操作后，需要掩码的 attention_score=0

Qwen2Attention：

class Qwen2Attention(nn.Module):
    """
    Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
    and "Generating Long Sequences with Sparse Transformers".
    """

    def __init__(self, config: Qwen2Config, layer_idx: Optional[int] = None):
        super().__init__()
        self.config = config
        self.layer_idx = layer_idx
        if layer_idx is None:
            logger.warning_once(
                f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
                "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
                "when creating this class."
            )

        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.num_key_value_heads = config.num_key_value_heads
        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
        self.rope_theta = config.rope_theta
        self.is_causal = True
        self.attention_dropout = config.attention_dropout

        '''
            layer_idx: decoder 层数编号, kv_cache 需要标识对应的层数信息(不可缺少); 默认 num_hidden_layers=32 层
            hidden_size: token embedding 编码长度, 默认值 hidden_size=4096
            num_heads: 多头注意力机制heads数, 默认值 num_attention_heads=32
            num_key_value_heads: 键值对的头数，用于分组查询注意力
            attention_dropout: Qwen 没有采用, 而标准MHA attention_dropout=0.10
            rope_theta: Qwen 设置 rope_theta=1000000.0, 标准版本为10000.0
        '''

        if (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(
                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
                f" and `num_heads`: {self.num_heads})."
            )

        # 4个将权重投影到激活值上的矩阵乘法
        # 注意: Qwen 针对 Q K V 矩阵线性变换, 增加了 bias; 经典 MHA bias=False 即不含 bias 偏置项
        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)

        # 旋转位置编码 RoPE
        self.rotary_emb = Qwen2RotaryEmbedding(config=self.config)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Cache] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
        cache_position: Optional[torch.LongTensor] = None,
        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        '''
            hidden_states: (B, seq_len, hidden_size) 输入tensor x
        '''

        bsz, q_len, _ = hidden_states.size()

        # self-attention 生成 q k v 矩阵; (B, seq_len, hidden_size) -> (B, seq_len, num_heads * head_dim)
        # 注意: num_heads * head_dim == hidden_size, 维度保持不变, 便于后续 multi-head 矩阵操作
        query_states = self.q_proj(hidden_states)
        key_states = self.k_proj(hidden_states)
        value_states = self.v_proj(hidden_states)

        # reshape: (B, seq_len, num_heads * head_dim) ->  (B, seq_len, num_heads, head_dim) -> (B, num_heads, seq_len, head_dim)
        # 注意: 经典的 MHA（Multi-Head Attention） q/k/v head 头数相同,即 num_heads == num_key_value_heads
        # MQA（Multi-Query Attention）以及 GQA（Grouped-Query Attention） k/v head头数一般少于 query head头数, 显存优化
        # key value 的 head数可以少于 query head 数, 但 key value 的head数二者要相同
        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

        if position_embeddings is None:
            logger.warning_once(
                "The attention layers in this model are transitioning from computing the RoPE embeddings internally "
                "through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed "
                "`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.46 `position_ids` will be "
                "removed and `position_embeddings` will be mandatory."
            )
            # 旋转位置编码 RoPE, 计算旋转角度 cos sin 值, 注意: 第一个参数 value_states 并未起到真正作用
            cos, sin = self.rotary_emb(value_states, position_ids)
        else:
            cos, sin = position_embeddings

        # 旋转位置编码(RoPE), 位置信息融入 query key 矩阵中
        # 注意: value 矩阵不包含位置编码信息(不需要), 目前LLM主流做法
        # query: (B, num_heads, seq_len, head_dim) -> (B, num_heads, seq_len, head_dim)
        # key: (B, num_key_value_heads, seq_len, head_dim) -> (B, num_key_value_heads, seq_len, head_dim)
        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)

        # kv_cache 优化, infer推理阶段使用; train 阶段序列并行运算, 不需要 kv_cache
        if past_key_value is not None:
            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}  # Specific to RoPE models
            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

        # repeat k/v heads if n_kv_heads < n_heads
        # num_key_value_groups = self.num_heads // self.num_key_value_heads
        # key value: (B, num_key_value_heads, seq_len, head_dim) -> (B, num_heads, seq_len, head_dim); q k v 矩阵操作时, heads相同
        key_states = repeat_kv(key_states, self.num_key_value_groups)
        value_states = repeat_kv(value_states, self.num_key_value_groups)

        # soft_max(Q * K.T/sqrt(head_dim)) * V   attention 运算
        # (B, num_heads, seq_len, head_dim) @ (B, num_heads, head_dim, seq_len) -> (B, num_heads, seq_len, seq_len)
        # attention_mask: 因果 mask + padding mask, 与 attn_weights 维度相同, 被mask元素值-inf, 其余元素值0
        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
        if attention_mask is not None:  # no matter the length, we just slice it
            causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
            attn_weights = attn_weights + causal_mask   # (B, num_heads, seq_len, seq_len)

        # upcast attention to fp32
        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
        attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
        # (B, num_heads, seq_len, seq_len) @ (B, num_heads, seq_len, head_dim)-> (B, num_heads, seq_len, head_dim)
        attn_output = torch.matmul(attn_weights, value_states)

        if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
            raise ValueError(
                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
                f" {attn_output.size()}"
            )

        # (B, num_heads, seq_len, head_dim)->(B,seq_len,num_heads,head_dim)->(B,seq_len,num_heads * head_dim)
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

        # 对 self-attention 输出做线性变换, 不带 bias 偏置项; (B,seq_len,num_heads * head_dim) == (B,seq_len,hidden_size)
        attn_output = self.o_proj(attn_output)

        if not output_attentions:
            attn_weights = None

        return attn_output, attn_weights, past_key_value

因果掩码(causal_mask)模块：

def _make_causal_mask(
        input_ids_shape: torch.Size,
        dtype: torch.dtype,
        device: torch.device,
        past_key_values_length: int = 0,
        sliding_window: Optional[int] = None,
    ):
        """
        Make causal mask used for bi-directional self-attention.
        input_ids_shape: 输入ID的形状,通常是(batch_size, sequence_length)
        dtype: 掩码的数据类型
        past_key_values_length: 过去的key-value缓存长度,默认为0
        """
        bsz, tgt_len = input_ids_shape

        # 创建一个填充了 float 最小值的掩码矩阵 L * L, dtype = torch.float16  or  torch.float32
        # torch.finfo() 用于获取浮点数类型的数值特性信息,如最大值、最小值、精度等
        mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
        mask_cond = torch.arange(mask.size(-1), device=device)

        '''
            mask_cond < (mask_cond + 1).view(mask.size(-1), 1): 这个操作会创建一个布尔矩阵，其中下三角部分（包括对角线）为True，上三角部分为False
            (tgt_len,) < (tgt_len,1) 广播操作   [[0, 1, 2, 3],   <    [[1, 1, 1, 1],
                                                [0, 1, 2, 3],         [2, 2, 2, 2],
                                                [0, 1, 2, 3],         [3, 3, 3, 3],
                                                [0, 1, 2, 3]]         [4, 4, 4, 4]]
            masked_fill_是一个原地操作，它会修改mask张量; 它将为True的位置填充为0,其他位置保持不变(-inf)
        '''
        mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
        mask = mask.to(dtype)

        '''
            causal_mask 也可参考 GPT-2 实现方法:
            # mask = torch.ones(tgt_len,tgt_len) * float('-inf')
            mask = torch.ones(tgt_len,tgt_len) * torch.finfo(torch.float32).min
            # 上三角矩阵 torch.triu, 下三角部分元素置为0
            mask = torch.triu(mask, diagonal=1)
        '''

        if past_key_values_length > 0:
            mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)

        # add lower triangular sliding window mask if necessary
        if sliding_window is not None:
            diagonal = past_key_values_length - sliding_window - 1

            context_mask = torch.tril(torch.ones_like(mask, dtype=torch.bool), diagonal=diagonal)
            mask.masked_fill_(context_mask, torch.finfo(dtype).min)

        return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)

3.4.2 MQA & GQA

多查询注意力机制（Multi-Query Attention，MQA) [23] 是一种优化的注意力机制，它与标准的多头自注意力机制 MHA 具有显著的区别和优势。MQA 采取了一种新的策略：它将所有头的 Keys 和 Values 进行共享，仅为每个查询计算独立的注意力权重。其工作原理如下：

共享 Keys 和 Values：MQA 所有注意力头共用相同的 Keys 和 Values。这意味着，无论有多少个头，计算时只需处理一套 Keys 和 Values，从而显著减少了计算的复杂性。
独立的查询权重：虽然 Keys 和 Values 是共享的，但每个查询（Query）仍然会计算其独特的注意力权重。这允许模型在处理输入时保留多样性和灵活性，而不牺牲性能。

这种方法旨在提高计算效率并减少内存需求，以适应大规模模型的实际应用，且对模型性能产生的影响也较小。

为了结合多查询注意力机制（MQA）[23] 的效率与多头自注意力机制（MHA）[11] 的性能，研究人员进一步提出了分组查询注意力机制（Grouped-Query Attention, GQA）[24]。GQA 将所有注意力头划分为若干组，并针对同一组内的头共享相同的变换矩阵；这意味着，在计算注意力时，同一组内的头将使用相同的线性变换和权重，能够显著减少计算复杂性。GQA 通过有效地平衡效率和性能，成为现代注意力机制中一种重要优化方案。它在保留模型表达能力的同时，显著提高了计算速度和资源利用率。GQA 已被 Llama2 模型采纳，并且在 Qwen1.5 及后续版本中也被广泛应用。图8直观地展示了 MHA、MQA、GQA 上述三种注意力机制差异性。

图 8：多头注意力（MHA）、多查询注意力（MQA）以及分组查询注意（GQA）简化示意图 [24]

MHA、MQA 和 GQA 在代码实现上差异性较小。MQA 和 GQA 通过复制 Keys 和 Values 矩阵，以确保它们的大小与 Queries 矩阵一致。为了支持这种结构，增加了 `repeat_kv()` 函数。这一函数功能是将 Keys 和 Values 进行复制，以适应多查询或分组查询的需求，确保在计算注意力时，各个查询能够正确访问到相应的键值信息。代码实现上需注意：

.expand()用于扩展张量的维度，允许通过“视图”的方式增加维度，生成新的形状，而不复制数据，不会占用额外内存，因为它只返回一个新的视图。

.repeat()通过真正复制张量数据来扩展维度，会占用额外的内存。因此，MQA、GQA 不能使用该方法复制 Keys 和 Values 矩阵。

MQA/GQA KV 复制：

# Copied from transformers.models.llama.modeling_llama.repeat_kv
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
    """
    '''
        hidden_states: attention key 或者 value 矩阵
        n_rep: attention key 或者 value 的 multi-head 每个键值头的内容被复制 n_rep 次, 这样与query的heads保持一致
        针对 MQA（Multi-Query Attention）以及 GQA（Grouped-Query Attention）,减少内存开销
    '''
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    # 标准 MHA, q k v heads头数相等
    if n_rep == 1:
        return hidden_states

    # MQA（Multi-Query Attention）以及 GQA（Grouped-Query Attention）情况, 允许多个查询头共享相同的键和值，从而减少计算量和内存使用
    # .expand 对每个 num_key_value_heads '复制' n_rep 次, 注意这里的'复制'本质是引用,原始内存空间并没有增加
    # num_key_value_heads * n_rep == num_heads, 最终保证 q k v 的head相同, 使其适合进行分组查询注意力计算，其中多个查询头可以共享相同的键和值，从而在保持模型表达能力的同时提高效率
    # 注: 不能使用 .repeat(),否则无法实现减少内存的目的
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

3.4.3 FlashAttention

在 Transformer 架构中，自注意力机制是核心组成部分，但其计算复杂度较高，特别是在处理长序列时。自注意力的计算复杂度为, n 为文本长度, d为嵌入维度。因此，如何高效地计算自注意力成为了一个重要研究方向。

FlashAttention [25-27] 是一种用于自注意力计算的高效算法，它通过优化内存使用、分块计算、任务并行和融合操作等方法，在保证性能的同时显著降低了计算复杂度和内存消耗，非常适合大语言模型的训练和推理。随着 Transformer 及其变种的广泛应用，FlashAttention 为提升模型性能提供了一种有效的解决方案。FlashAttention 涉及的算法原理是 attention softmax 的分块计算，是对 Online softmax [28, 29] 的扩展。

总体来说，FlashAttention 是对 attention softmax 计算的加速优化，可以结合 MHA、MQA、GQA 一起使用，Qwen2.5 系列正是采用 GQA + FlashAttention 结合加速。

softmax 分块算法主要用到指数运算技巧。除此之外，还涉及到 GPU 架构及调度细节等。

3.5 Feed Forward Netwok（FFN）

为了学习复杂的函数关系和特征，Transformer 模型引入了一个前馈网络层 (Feed Forward Netwok, FFN)。具体来说，给定输入 $ X $，Transformer 前馈神经网络由两个线性变换和一个非线性激活函数组成。FFN 通过激活函数引入非线性映射变换，提升了模型的表达能力，从而更好地捕获复杂的交互关系：

Qwen2 FFN层：

# Copied from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->Qwen2
class Qwen2MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)

        '''
            Qwen LLama 系列模型设置非常相似, intermediate_size = 2.7 * hidden_size, 这可能是大型语言模型的一个普遍有效设置
            GPT 系列FFN两层线性变换, w1 w2 总参数量=2dl=2d*(4d)=8d^2, 即GPT隐层 l 默认设置为hidden_size的4倍;
            Qwen LLama 系列FFN三层线性变换,w1 w2 w3 总参数量=3dl,这里假设参数量与GPT相等,即 3dl = 8d^2;
            因此, Qwen LLama 系列FFN 隐层设置 intermediate_size = 8/3 * hidden_size = 2.7 * hidden_size
            注: FFN 线性变换均不含偏置项 bias
        '''
        # 升维(至 intermediate_size), BERT T5 GPT-2等模型设置 intermediate_size = 4 * hidden_size
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        # 降维至 hidden_size(与输入维度保持一致)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)

        # Qwen LLama 系列 FFN layer 激活函数: SwiGLU
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, hidden_state):
        '''
            hidden_state: (B, seq_len, hidden_size)
            SwiGLU (Swish-Gated Linear Unit)
        '''
        return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))

4. Qwen2.5 预训练过程

4.1 Unsupervised Pre-training

在进行大规模预训练时，语言建模任务是当前大多数大型语言模型普遍采用的预训练任务。该任务的核心在于“预测下一个 token”，并常用于训练基于解码器的语言模型，例如 GPT、 PaLM、Llama 和 Qwen 等。尽管对下一个token的预测看似简单，但当预训练数据足够丰富时，LLMs 能够有效地学习自然语言生成规律和表达模式。给定一个无监督标记语料库,我们使用标准的语言建模目标来最大化以下似然函数[1]：

大模型训练通常采用 Teacher Forcing 策略[11]：将当前时刻的真实目标输出作为下一个时刻的输入，而不是使用模型在前一个时刻生成 (预测) 的输出。这种做法帮助模型在每一时刻都能接收正确信息，从而加速学习过程，提高模型收敛速度和生成质量。Teacher Forcing 广泛应用于序列生成模型，如 RNN、LSTM 和 Transformer。然而，在推理阶段，模型则完全依赖自己生成的输出，而不是像训练期间那样总是依赖于真实目标输出。这种“分布不匹配”现象可能导致在推理时模型性能下降。为了解决 Teacher Forcing 不足，研究人员提出了课程学习（Curriculum Learning）策略：通过逐渐增加模型生成输出的比例，减少对真实输出的依赖，旨在让模型训练从简单到复杂、逐步适应推理时的条件；这种训练方式使得模型在处理复杂任务时，更加稳健和高效。

4.2 知识蒸馏（Knowledge Distillation）

Qwen2.5 小尺寸模型 (2B、7B) 训练过程在公开的技术报告中鲜有提及，不过常用的方案是知识蒸馏（Knowledge Distillation）。比如 Google 今年开源的 Gemma2 [31] 小尺寸模型就采用知识蒸馏训练，相较于从零开始学习，知识蒸馏能够提升小尺寸模型约10%性能。Meta 基于知识蒸馏，将 System2 知识迁移到 System1 [32]。Xiaohan Xu 等人 [33] 对大语言模型中的知识蒸馏做了全面综述，感兴趣的读者可参考 [github 项目](https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs)。Jianping Gou等人 [34] 全面综述了当前蒸馏技术方案，LLMs 领域常用 logit 蒸馏，如下图9所示。可以说，知识蒸馏是为大语言模型训练“量身定做”的方案。

图 9：基于 logit 的知识蒸馏

  给定一个用作教师的大模型（Qwen2.5-72B），通过从教师模型对每个token

在其上下文的概率蒸馏，来训练较小的模型。优化目标为最小化教师和学生之间概率的负对数似然（交叉熵）[31]：

4.3 模型参数量计算

针对 Qwen2.5 系列，假设词表大小为 𝑉，模型包含 𝐿 层解码器，token embedding 编码维度大小为 𝐻，前馈网络层的中间状态维度大小为 𝐻′ 。

Embedding 层. 将词表中每个 token 映射到 𝐻 维向量，因此输入编码层参数量 .
self-attention. 假设采用传统的多头注意力（MHA），，因此一层参数量，L 层参数量. 假设采用 GQA，, 总参数量要少于 MHA 参数量。
FFN. 由三个线性变换组成，如 3.5 部分介绍，一层 FFN 参数量，L 层参数量 .
LN. 每一层解码器还包含两个 RMSNorm 操作, L 层参数量；另外，接近输出层还有一次 LN 操作，参数量 .
输出层. 线性变换，将解码器输出映射到词表大小 𝑉 的维度上，使用 softmax 归一化后预测下一个 token 概率分布，参数量 .

综上所述，Qwen2.5 系列基于传统 MHA 模型总参数量：

而基于 GQA 的模型总参数量（该公式也适用于传统 MHA 模型参数量计算，kv_heads=attention_heads）：

以 Qwen2.5-72B 模型为例，解码器模型结构设置：解码器层数 L=80，词表大小 V=152064，embedding 编码维度 H (hidden_size)=8192，𝐻′(intermediate_size)=29568，num_attention_heads=64，num_key_value_heads=8。基GQA 模型总参数量=72705384448，即 72.7B 与公布的模型参数量一致。假如采用传统的 MHA，参数量为82B，即 GQA vs MHA 减少了12%参数量。

另外需要说明的是，由于 Qwen2.5 采用了GQA优化技术，attention 层参数量减少，为了维持模型整体参数量稳定，FFN 层的中间状态维度𝐻′加大，比如上述 Qwen2.5-72B 模型 H'/H=3.6>2.7 。

图 10：Qwen2.5 系列模型参数设置

4.4 Qwen2ForCausalLM

  基于第3部分模型架构的子模块算法介绍，比较容易将其组装成完整的 LLM模型：

Qwen2DecoderLayer ：(一层)解码器，包含 attention、FNN、LN、残差连接；
Qwen2Model：包含词嵌入 Embedding 层 + 解码器堆叠层;
Qwen2ForCausalLM: LLM 自回归 (因果) 模型，包含 logits、交叉熵 loss 计算。

该部分代码实现并不复杂，Qwen2ForCausalLM 类的实现需要注意：

自回归：Teacher Forcing 策略，将 logits 向左移动一位, labels 向右移动一位（Output Shifted Right）.

output shifted right：

# 特别注意: 将logits向左移动一位
# (batch_size, sequence_length, vocab_size) -> # (batch_size, sequence_length-1, vocab_size)
shift_logits = logits[..., :-1, :].contiguous()

# 特别注意: 标签向右移动一位
# 这里使用 [..., 1:] 切片操作,去掉了标签序列的第一个元素。这样做的目的是让每个位置的预测与下一个位置的实际标签对齐
# (batch_size, sequence_length) -> (batch_size, sequence_length-1)
shift_labels = labels[..., 1:].contiguous()

CrossEntropyLoss 交叉熵损失。原始 Transformer 论文由于引入了标签平滑（Label Smoothing）[11] 正则化策略，采用 KL Divergence Loss。

一层解码器模块：

# (一层)解码器模块
class Qwen2DecoderLayer(nn.Module):
    def __init__(self, config: Qwen2Config, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size

        if config.sliding_window and config._attn_implementation != "flash_attention_2":
            logger.warning_once(
                f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
                "unexpected results may be encountered."
            )

        # 候选3种 attention 实现方式
        self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
        # FFN
        self.mlp = Qwen2MLP(config)
        # LN
        self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
        cache_position: Optional[torch.LongTensor] = None,
        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # will become mandatory in v4.46
        **kwargs,
    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
        """
        Args:
            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
                `(batch, sequence_length)` where padding elements are indicated by 0.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
            cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
                Indices depicting the position of the input sequence tokens in the sequence.
            position_embeddings (`Tuple[torch.FloatTensor, torch.FloatTensor]`, *optional*):
                Tuple containing the cosine and sine positional embeddings of shape `(batch_size, seq_len, head_dim)`,
                with `head_dim` being the embedding dimension of each attention head.
            kwargs (`dict`, *optional*):
                Arbitrary kwargs to be ignored, used for FSDP and other methods that injects code
                into the model
        """
        '''
            Transformer Decoder-only 解码器层,包括自注意力机制和FNN层,它使用残差连接和层归一化来稳定训练
            基于 Decoder-only 架构的LLM, 比如 Qwen LLaMA 系列, 解码器工作机制大同小异
            先进行 LN() 层归一化操作(Pre_LN), Qwen LLaMA 系列采用的方案.
            Transformers 原始论文用的是 Post-Layer Normalization (Post-LN) Post Norm, LN在残差连接之后
            Pre_LN Post-LN 两种架构的比较参考paper:《On Layer Normalization in the Transformer Architecture》
            Pre_LN 训练稳定性更好、加速收敛性, 因此是目前主流架构, Qwen LLaMA 系列均采用 Pre_LN 训练
        '''

        residual = hidden_states
        # 先进行层归一化 Pre_LN
        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention 采用 Grouped-Query Attention, GQA
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
            cache_position=cache_position,
            position_embeddings=position_embeddings,
        )
        # 残差网络连接  x + sublayer(x)
        hidden_states = residual + hidden_states

        # Fully Connected  FFN
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights,)

        if use_cache:
            outputs += (present_key_value,)

        return outputs

模型部分, 输出logits：

@add_start_docstrings(
    "The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
    QWEN2_START_DOCSTRING,
)
class Qwen2Model(Qwen2PreTrainedModel):
    """
    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`]
    构建完整的 Qwen2 预训练模型, 输出结果为 BaseModelOutputWithPast 数据类(dataclass),用于存储和返回模型的输出。
    Args:
        config: Qwen2Config
    """

    def __init__(self, config: Qwen2Config):
        super().__init__(config)
        # 设置序列填充标记 padding 的索引
        self.padding_idx = config.pad_token_id
        # tokens 词表大小 = 152064
        self.vocab_size = config.vocab_size
        # 创建词嵌入层, 它将词汇表中的每个词映射到一个 hidden_size 维的向量
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)

        # Qwen2DecoderLayer * num_hidden_layers 层堆叠, 每个层都是 Qwen2DecoderLayer 的不同实例
        # 注意: nn.ModuleList 动态地管理多个层或模块，存储 nn.Module 实例，可以像普通列表一样对 nn.ModuleList 进行迭代和索引操作
        self.layers = nn.ModuleList(
            [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
        # Grouped-Query Attention, GQA
        self._attn_implementation = config._attn_implementation
        # RMSNorm
        self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        # 旋转位置编码 RoPE
        self.rotary_emb = Qwen2RotaryEmbedding(config=config)

        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.embed_tokens

    def set_input_embeddings(self, value):
        self.embed_tokens = value

    @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        cache_position: Optional[torch.LongTensor] = None,
    ) -> Union[Tuple, BaseModelOutputWithPast]:

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache

        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if (input_ids is None) ^ (inputs_embeds is not None):
            raise ValueError(
                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
            )

        # 如果使用梯度检查点且处于训练模式,则禁用缓存; kv_cache 只用于(加速)推理阶段
        if self.gradient_checkpointing and self.training:
            if use_cache:
                logger.warning_once(
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                )
                use_cache = False

        # kept for BC (non `Cache` `past_key_values` inputs)
        # 处理缓存相关的逻辑,如果使用旧格式的缓存,会发出警告并转换为新格式
        return_legacy_cache = False
        if use_cache and not isinstance(past_key_values, Cache):
            return_legacy_cache = True
            if past_key_values is None:
                past_key_values = DynamicCache()
            else:
                past_key_values = DynamicCache.from_legacy_cache(past_key_values)
                logger.warning_once(
                    "We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and "
                    "will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class "
                    "(https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)"
                )
        # 使用 input_ids 生成词嵌入
        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)

        # 设置cache_position和position_ids
        if cache_position is None:
            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
            cache_position = torch.arange(
                past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device
            )
        if position_ids is None:
            position_ids = cache_position.unsqueeze(0)

        # 更新因果掩码
        causal_mask = self._update_causal_mask(
            attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions
        )

        hidden_states = inputs_embeds

        # create position embeddings to be shared across the decoder layers
        # RoPE 旋转位置编码生成位置 embedding
        position_embeddings = self.rotary_emb(hidden_states, position_ids)

        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        next_decoder_cache = None

        for decoder_layer in self.layers:      # 遍历所有解码器层
            if output_hidden_states:           # 是否需要输出所有隐藏状态, 添加到元祖中
                all_hidden_states += (hidden_states,)

            # 如何调用解码器层。
            # 如果使用梯度检查点且处于训练模式,则使用_gradient_checkpointing_func来调用解码器层,这有助于节省内存。
            # 否则,直接调用解码器层。
            if self.gradient_checkpointing and self.training:
                layer_outputs = self._gradient_checkpointing_func(
                    decoder_layer.__call__,
                    hidden_states,
                    causal_mask,
                    position_ids,
                    past_key_values,
                    output_attentions,
                    use_cache,
                    cache_position,
                    position_embeddings,
                )
            else:
                layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=causal_mask,
                    position_ids=position_ids,
                    past_key_value=past_key_values,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
                    cache_position=cache_position,
                    position_embeddings=position_embeddings,
                )
            # 更新hidden_states为当前层的输出; layer_outputs 结果为元祖, 第一个元素为 hidden_states
            hidden_states = layer_outputs[0]

            # 如果使用缓存,保存当前层的key/value状态。注意索引的选择取决于是否输出注意力权重
            if use_cache:
                next_decoder_cache = layer_outputs[2 if output_attentions else 1]

            # 是否输出注意力权重
            if output_attentions:
                all_self_attns += (layer_outputs[1],)

        # 执行外所有解码器层结果输出后, 执行 LN 归一化操作
        hidden_states = self.norm(hidden_states)

        # add hidden states from the last decoder layer
        # 最后一个解码器层的隐藏状态添加到all_hidden_states元组中。这确保了最终的隐藏状态也被包含在输出中
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        next_cache = next_decoder_cache if use_cache else None
        # 如果需要返回旧格式的缓存,将next_cache转换为旧格式。这是为了向后兼容
        if return_legacy_cache:
            next_cache = next_cache.to_legacy_cache()

        if not return_dict:
            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)

        # BaseModelOutputWithPast: 数据类(dataclass),用于存储和返回模型的输出。
        # 它继承自ModelOutput类,专门设计用于那些可能包含过去key/values的模型输出(这些key/values可以用来加速序列解码)
        return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
        )

    # Copied from transformers.models.llama.modeling_llama.LlamaModel._update_causal_mask
    def _update_causal_mask(
        self,
        attention_mask: torch.Tensor,
        input_tensor: torch.Tensor,
        cache_position: torch.Tensor,
        past_key_values: Cache,
        output_attentions: bool,
    ):
        if self.config._attn_implementation == "flash_attention_2":
            if attention_mask is not None and 0.0 in attention_mask:
                return attention_mask
            return None

        # For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
        # order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
        # to infer the attention mask.
        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
        using_static_cache = isinstance(past_key_values, StaticCache)

        # When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
        if self.config._attn_implementation == "sdpa" and not using_static_cache and not output_attentions:
            if AttentionMaskConverter._ignore_causal_mask_sdpa(
                attention_mask,
                inputs_embeds=input_tensor,
                past_key_values_length=past_seen_tokens,
                is_training=self.training,
            ):
                return None

        dtype, device = input_tensor.dtype, input_tensor.device
        min_dtype = torch.finfo(dtype).min
        sequence_length = input_tensor.shape[1]
        if using_static_cache:
            target_length = past_key_values.get_max_length()
        else:
            target_length = (
                attention_mask.shape[-1]
                if isinstance(attention_mask, torch.Tensor)
                else past_seen_tokens + sequence_length + 1
            )

        # In case the provided `attention` mask is 2D, we generate a causal mask here (4D).
        causal_mask = _prepare_4d_causal_attention_mask_with_cache_position(
            attention_mask,
            sequence_length=sequence_length,
            target_length=target_length,
            dtype=dtype,
            device=device,
            min_dtype=min_dtype,
            cache_position=cache_position,
            batch_size=input_tensor.shape[0],
        )

        if (
            self.config._attn_implementation == "sdpa"
            and attention_mask is not None
            and attention_mask.device.type == "cuda"
            and not output_attentions
        ):
            # Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
            # using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
            # Details: https://github.com/pytorch/pytorch/issues/110213
            causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)

        return causal_mask

Qwen2ForCausalLM：

Qwen2ForCausalLMclass Qwen2ForCausalLM(Qwen2PreTrainedModel):
    '''
        用于因果语言模型(Causal Language Model) 预训练任务
    '''

    _tied_weights_keys = ["lm_head.weight"]

    def __init__(self, config):
        super().__init__(config)

        # 初始化基础 Qwen2Mode
        self.model = Qwen2Model(config)
        self.vocab_size = config.vocab_size

        # 创建语言模型头(lm_head),这是一个线性层,将隐藏状态映射到词汇表大小的输出
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    def get_output_embeddings(self):
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings

    def set_decoder(self, decoder):
        self.model = decoder

    def get_decoder(self):
        return self.model

    @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        cache_position: Optional[torch.LongTensor] = None,
        num_logits_to_keep: int = 0,
    ) -> Union[Tuple, CausalLMOutputWithPast]:
        r"""
        Args:
            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

            num_logits_to_keep (`int`, *optional*):
                Calculate logits for the last `num_logits_to_keep` tokens. If `0`, calculate logits for all
                `input_ids` (special case). Only last token logits are needed for generation, and calculating them only for that
                token can save memory, which becomes pretty significant for long sequences or large vocabulary size.

        Returns:

        Example:

        ```python
        >>> from transformers import AutoTokenizer, Qwen2ForCausalLM

        >>> model = Qwen2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)

        >>> prompt = "Hey, are you conscious? Can you talk to me?"
        >>> inputs = tokenizer(prompt, return_tensors="pt")

        >>> # Generate
        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
        ```"""

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
        # 调用基础的Qwen2Model,获取输出
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            cache_position=cache_position,
        )

        # 提取最后一层的隐藏状态
        hidden_states = outputs[0]
        if labels is None and not is_torchdynamo_compiling():
            logger.warning_once(
                "Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)"
            )
        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
        # TODO: remove the float() operation in v4.46
        # 应用语言模型头(lm_head)到隐藏状态,生成logits;
        # 注意: 这里只计算最后 num_logits_to_keep 个token的 logits (prompt 部分不计算logit及loss)
        logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()

        loss = None
        if labels is not None:
            # Upcast to float if we need to compute the loss to avoid potential precision issues
            logits = logits.float()
            # Shift so that tokens < n predict n: 预测序列中的下一个token
            # Teacher Forcing 策略: 将logits向左移动一位, 标签向右移动一位

            # 特别注意: 将logits向左移动一位
            # (batch_size, sequence_length, vocab_size) -> # (batch_size, sequence_length-1, vocab_size)
            shift_logits = logits[..., :-1, :].contiguous()

            # 特别注意: 标签向右移动一位
            # 这里使用 [..., 1:] 切片操作,去掉了标签序列的第一个元素。这样做的目的是让每个位置的预测与下一个位置的实际标签对齐
            # (batch_size, sequence_length) -> (batch_size, sequence_length-1)
            shift_labels = labels[..., 1:].contiguous()

            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            # 注意: 这里的重塑操作将批次和序列维度合并,使得每个token的预测都被当作一个独立的样本
            shift_logits = shift_logits.view(-1, self.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            # Enable model parallelism
            shift_labels = shift_labels.to(shift_logits.device)
            loss = loss_fct(shift_logits, shift_labels)     # 交叉熵损失

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation
    def prepare_inputs_for_generation(
        self,
        input_ids,
        past_key_values=None,
        attention_mask=None,
        inputs_embeds=None,
        cache_position=None,
        position_ids=None,
        use_cache=True,
        num_logits_to_keep=None,
        **kwargs,
    ):
        # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens
        # Exception 1: when passing input_embeds, input_ids may be missing entries
        # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
        if past_key_values is not None:
            if inputs_embeds is not None:  # Exception 1
                input_ids = input_ids[:, -cache_position.shape[0] :]
            elif input_ids.shape[1] != cache_position.shape[0]:  # Default case (the "else", a no op, is Exception 2)
                input_ids = input_ids[:, cache_position]

        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill_(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1] :]

                # This `clone` call is needed to avoid recapturing cuda graphs with `torch.compile`'s  `mode="reduce-overhead`, as otherwise the input `position_ids` would have various stride during the decoding. Here, simply using `.contiguous()` is not sufficient as in the batch size = 1 case, `position_ids` is already contiguous but with varying stride which retriggers a capture.
                position_ids = position_ids.clone(memory_format=torch.contiguous_format)

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and cache_position[0] == 0:
            model_inputs = {"inputs_embeds": inputs_embeds, "input_ids": None}
        else:
            # The clone here is for the same reason as for `position_ids`.
            model_inputs = {"input_ids": input_ids.clone(memory_format=torch.contiguous_format), "inputs_embeds": None}

        if isinstance(past_key_values, StaticCache) and attention_mask.ndim == 2:
            if model_inputs["inputs_embeds"] is not None:
                batch_size, sequence_length, _ = model_inputs["inputs_embeds"].shape
                device = model_inputs["inputs_embeds"].device
            else:
                batch_size, sequence_length = model_inputs["input_ids"].shape
                device = model_inputs["input_ids"].device

            dtype = self.lm_head.weight.dtype
            min_dtype = torch.finfo(dtype).min

            attention_mask = _prepare_4d_causal_attention_mask_with_cache_position(
                attention_mask,
                sequence_length=sequence_length,
                target_length=past_key_values.get_max_length(),
                dtype=dtype,
                device=device,
                min_dtype=min_dtype,
                cache_position=cache_position,
                batch_size=batch_size,
            )

        if num_logits_to_keep is not None:
            model_inputs["num_logits_to_keep"] = num_logits_to_keep

        model_inputs.update(
            {
                "position_ids": position_ids,
                "cache_position": cache_position,
                "past_key_values": past_key_values,
                "use_cache": use_cache,
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

5. 总结

本文对当前大模型 LLMs 预训练阶段技术进行了系统总结。鉴于 AI 领域技术的快速发展，写作期间参考了大量最新的研究论文、报告以及 GitHub 代码库 [35-38]，在此对前人的研究成果表示感谢。结合 Qwen2.5 系列大模型，深入剖析了 Transformer Decoder 架构中各功能模块的核心算法原理，并结合代码实现进行了详细解读。尽管如此，限于个人能力，文中可能存在理解偏差或表述错误，欢迎各位老师交流指正。受限于篇幅，本文未能详述预训练阶段的全部模型架构细节，诸如 DeepSeek 基于混合专家模型 (MoE) 框架的 DeepSeek-V3 等 [39]。

除了 Pre-training 阶段的 Transformer 架构，Post-training 阶段也涉及许多关键技术，例如通用指令微调、人类反馈强化学习、安全性和 Agent 等。最近 OpenAI o1 [4] 在推理方面的进化更是令人惊叹，推理 Reasoning 方向研究非常激动人心，未来计划深入研究。

参考资料：

[1] Improving Language Understanding by Generative Pre-Training, 2018.

[2] Language Models are Unsupervised Multitask Learners, 2019.

[3] Training Language Models to Follow Instructions with Human Feedback, 2022.

[4] Learning to Reason with LLMs, https://openai.com/index/learning-to-reason-with-llms.

[5] LLaMA: Open and Efficient Foundation Language Models, 2023.

[6] Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.

[7] The Llama 3 Herd of Models, 2024.

[8] Qwen Technical Report, 2023.

[9] Qwen2 Technical Report, 2024.

[10] Qwen2.5-Coder Technical Report, 2024.

[11] Attention Is All You Need, 2017.

[12] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2019.
[13] Scaling Instruction-Finetuned Language Models, 2022.

[14] GLM: General Language Model Pretraining with Autoregressive Blank Infilling, 2022.
[15] A Survey of Large Language Models, 2023.

[16] Keep the Cost Down: A Review on Methods to Optimize LLM’s KV Cache Consumption, 2024.
[17] Neural Machine Translation with Byte-Level Subwords, 2019.

[18] RoFormer: Enhanced Transformer with Rotary Position Embedding, 2021.

[19] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.
[20] On Layer Normalization in the Transformer Architecture, 2020.

[21] Fast Transformers with Clustered Attention, 2019.

[22] Full Stack Optimization of Transformer Inference: a Survey, 2023.

[23] Fast Transformer Decoding: One Write-Head is All You Need, 2019.

[24] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, 2023.

[25] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, 2022.

[26] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, 2023.

[27] FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, 2024.

[28] Online normalizer calculation for softmax, 2018.

[29] From Online Softmax to FlashAttention, 2023.

[30] Swish: a Self-Gated Activation Function, 2017.

[31] Gemma 2: Improving Open Language Models at a Practical Size, 2024.

[32] Distilling System 2 into System 1，2024.

[33] A Survey on Knowledge Distillation of Large Language Models，2024.

[34] Knowledge Distillation: A Survey，2020.

[35] https://github.com/huggingface/transformers

[36]https://github.com/openai/gpt-2

[37]https://github.com/meta-llama/llama3

[38]https://github.com/qwenlm/qwen2.5

[39]https://github.com/deepseek-ai/deepseek-v3/blob/main/deepseek_v3.pdf

继续滑动看下一个