Transformers & Large Language Models

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. https://cme295.stanford.edu CME 295 – Transformers & Large Language Models VIP Cheatsheet: Transformers & Large Language Models Similar cute 2 This VIP cheatsheet gives an overview of what is in the "Super Study Guide: Transformers & Large Language Models" book, which contains ∼600 illustrations over 250 pages and goes into the following concepts in depth. You can find more details at https: // superstudy. guide . 2.1 Attention ❒ Formula – Given a query q, we want to know which key k the query should pay "attention" to with respect to the associated value v. v a a Tokens v cute cute k T a k T cute ❒ Definition – A token is an indivisible unit of text, such as a word, subword or character, and is part of a predefined vocabulary. v teddy cute bear teddy bear k T teddy q teddy a Remark: The unknown token [UNK] represents unknown pieces of text while the padding token [PAD] is used to fill empty positions to ensure consistent input sequence lengths. bear v is v reading v . is reading . is reading . k T is k T reading k T . bear teddy bear Attention can be efficiently computed using matrices Q, K, V that contain queries q, keys k and values v respectively, along with the dimension d k of keys: ❒ Tokenizer – A tokenizer T divides text into tokens of an arbitrary level of granularity. T teddy bear Transformers Foundations this teddy bear is reaaaally cute airplane teddy bear Remark: Approximate Nearest Neighbors (ANN) and Locality Sensitive Hashing (LSH) are methods that approximate the similarity operation efficiently over large databases. March 23, 2025 1.1 Independent unpleasant teddy bear Afshine Amidi and Shervine Amidi 1 Dissimilar  this teddy bear is [UNK] cute [PAD] ... [PAD] attention = softmax QK T √ d k  V Here are the main types of tokenizers: Pros Cons Word • Easy to interpret • Short sequence • Large vocabulary size • Word variations not handled • Word roots leveraged • Intuitive embeddings • Increased sequence length • Tokenization more complex • No out-of-vocabulary concerns • Small vocabulary size • Much longer sequence length • Patterns hard to interpret because too low-level Subword Character Byte ❒ MHA – A Multi-Head Attention (MHA) layer performs attention computations across mul- tiple heads, then projects the result in the output space. Illustration teddy bear Input queries ted ##dy bear t e d d y 1.2 Embeddings W O Output Remark: Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) are variations of MHA that reduce computational overhead by sharing keys and values across attention heads. ❒ Similarity – The cosine similarity between two tokens t 1 , t 2 is quantified by: 2.2 t 1 · t 2 = cos(θ) ∈ [−1, 1] ||t 1 || ||t 2 || Architecture ❒ Overview – Transformer is a landmark model relying on the self-attention mechanism and is composed of encoders and decoders. Encoders compute meaningful embeddings of the input that are then used by decoders to predict the next token in the sequence. The angle θ characterizes the similarity between the two tokens: Stanford University Input values Attention head h W h Q W h K W h V Projection It is composed of h attention heads as well as matrices W Q , W K , W V that project the input to obtain queries Q, keys K and values V . The projection is done using matrix W O . ❒ Definition – An embedding is a numerical representation of an element (e.g. token, sentence) and is characterized by a vector x ∈ R n . similarity(t 1 , t 2 ) = Input keys b e a r Remark: Byte-Pair Encoding (BPE) and Unigram are commonly-used subword-level tokenizers. Attention head 1 W 1 Q W 1 K W 1 V Type 1 Spring 2025
2. CME 295 – Transformers & Large Language Models Shervine Amidi & Afshine Amidi est positive Encoder Decoder Linear + Softmax ... ... Encoder Decoder my teddy bear is cute . N × [CLS] my teddy bear is fr-FR ❒ Decoder-only – Generative Pre-trained Transformer (GPT) is an autoregressive Transformer- based model that is composed of a stack of decoders. Contrary to BERT and its derivatives, GPT treats all problems as text-to-text problems. Remark: Although the Transformer was initially proposed as a model for translation tasks, it is now widely used across many other applications. cute ❒ Components – The encoder and decoder are two fundamental components of the Trans- former and have different roles: Encoder Encoded embeddings encapsulate meaning of input Linear + Softmax N × Decoder Decoded embeddings encapsulate meaning of both input and output predicted so far + Key Remark: Encoder-decoder models, like T5, are also autoregressive and share many character- istics with decoder-only models. + Cross-Attention ... Self-Attention Value Most of the current state-of-the-art LLMs rely on a decoder-only architecture, such as the GPT series, LLaMA, Mistral, Gemma, DeepSeek, etc. Feed-Forward Neural Network Feed-Forward Neural Network Value Key 2.4 Query Masked Self-Attention Value Key Optimizations ❒ Attention approximation – Attention computations are in O(n 2 ), which can be costly as the sequence length n increases. There are two main methods to approximate computations: + Query Decoder [BOS] my teddy bear is + + cute A [CLS] token is added at the beginning of the sequence to capture the meaning of the sentence. Its encoded embedding is often used in downstream tasks, such as sentiment extraction. [BOS] mon ours en peluche en-US Encoder • Sparsity: Self-attention does not happen through the whole sequence but only between more relevant tokens. Query ❒ Position embeddings – Position embeddings inform where the token is in the sentence and are of the same dimension as the token embeddings. They can either be arbitrarily defined or learned from the data. Remark: Rotary Position Embeddings (RoPE) are a popular and efficient variation that rotate query and key vectors to incorporate relative position information. 2.3 • Low-rank: The attention formula is simplified as the product of low-rank matrices, which brings down the computation burden. Variants ❒ Flash attention – Flash attention is an exact method that optimizes attention computations by cleverly leveraging GPU hardware, using the fast Static Random-Access Memory (SRAM) for matrix operations before writing results to the slower High Bandwidth Memory (HBM). ❒ Encoder-only – Bidirectional Encoder Representations from Transformers (BERT) is a Transformer-based model composed of a stack of encoders that takes some text as input, and outputs meaningful embeddings, which can be later used in downstream classification tasks. Stanford University Remark: In practice, this reduces memory usage and speeds up computations. 2 Spring 2025
3. CME 295 – Transformers & Large Language Models Shervine Amidi & Afshine Amidi Large language models 3 3.1 Remark: Other PEFT techniques include prefix tuning and adapter layer insertion. Overview 3.4 ❒ Definition – A Large Language Model (LLM) is a Transformer-based model with strong NLP capabilities. It is "large" in the sense that it typically contains billions of parameters. ❒ Reward model – A Reward Model (RM) is a model that predicts how well an output ŷ aligns with desired behavior given the input x. Best-of-N (BoN) sampling, also called rejection sampling, is a method that uses a reward model to select the best response among N generations. ❒ Lifecycle – An LLM is trained in 3 steps: pretraining, finetuning and preference tuning. Pretraining Finetuning Preference tuning Learn generalities about language Learn speci c tasks Demote bad answers x Finetuning and preference tuning are post-training approaches that aim at aligning the model to perform certain tasks. 3.2 Preference tuning y 1 ̂ , y 2 ̂ , ..., y N ̂ f k = argmax r(x, y i ̂ ) RM i ∈ [[1,N ]] ❒ Reinforcement learning – Reinforcement Learning (RL) is an approach that leverages RM and updates the model f based on rewards for its generated outputs. If RM is based on human preferences, this process is called Reinforcement Learning from Human Feedback (RLHF). Prompting ❒ Context length – The context length of a model is the maximum number of tokens that can fit into the input. It typically ranges from tens of thousands to millions of tokens. x f y r(x, y) ̂ RM ❒ Decoding sampling – Token predictions are sampled from the predicted probability distri- bution p i , which is controlled by the hyperparameter temperature T . p exp T ≪ 1 fi x p i =  n X exp x i T   x  j Proximal Policy Optimization (PPO) is a popular RL algorithm that incentivizes higher rewards while keeping the model close to the base model to prevent reward hacking. p Remark: There are also supervised approaches, like Direct Preference Optimization (DPO), that combine RM and RL into one supervised step. T ≪ 1 T x j=1 3.5 Optimizations ❒ Mixture of experts – A Mixture of Experts (MoE) is a model that activates only a portion of its neurons at inference time. It is based on a gate G and experts E 1 , ..., E n . Remark: High temperatures lead to more creative outputs whereas low temperatures lead to more deterministic ones. G ❒ Chain-of-thought – Chain-of-Thought (CoT) is a reasoning process in which the model breaks down a complex problem into a series of intermediate steps. This helps the model to generate the correct final response. Tree of Thoughts (ToT) is a more advanced version of CoT. E 1 x Remark: Self-consistency is a method that aggregates answers across CoT reasoning paths. 3.3 ❒ SFT – Supervised FineTuning (SFT) is a post-training approach that aligns the behavior of the model with an end task. It relies on high-quality input-output pairs aligned with the task. n X G(x) i E i (x) i=1 MoE-based LLMs use this gating mechanism in their FFNNs. Remark: Training an MoE-based LLM is notoriously challenging, as mentioned in the LLaMA paper whose authors chose to not use this architecture despite its inference-time efficiency. Remark: If the SFT data is about instructions, then this step is called "instruction tuning". ❒ PEFT – Parameter-Efficient FineTuning (PEFT) is a category of methods used to run SFT efficiently. In particular, Low-Rank Adaptation (LoRA) approximates the learnable weights W by fixing W 0 and learning low-rank matrices A, B instead: k k r k W y × ŷ = E n Finetuning d E 2 ... ≈ d W 0 + d B × r ❒ Distillation – Distillation is a process where a (small) student model S is trained on the prediction outputs of a (big) teacher model T . It is trained using the KL divergence loss: A KL(ŷ T ||ŷ S ) = X i (i) ŷ T  log (i) ŷ T  (i) ŷ S Remark: Training labels are considered as "soft" labels since they represent class probabilities. Stanford University 3 Spring 2025
4. CME 295 – Transformers & Large Language Models Shervine Amidi & Afshine Amidi Given a knowledge base D and a question, a Retriever fetches the most relevant documents, then Augments the prompt with the relevant information before Generating the output. ❒ Quantization – Model quantization is a category of techniques that reduces the precision of model weights while limiting its impact on the resulting model performance. As a result, this reduces the model’s memory footprint and speeds up its inference. Remark: The retrieval stage typically relies on embeddings from encoder-only models. Remark: QLoRA is a commonly-used quantized variant of LoRA. ❒ Hyperparameters – The knowledge base D is initialized by chunking the documents into chunks of size n c and embedding them into vectors of size R d . Applications 4 4.1 LLM-as-a-Judge d ❒ Definition – LLM-as-a-Judge (LaaJ) is a method that uses an LLM to score given outputs according to some provided criteria. Notably, it is also able to generate a rationale for its score, which helps with interpretability. Criteria Cuteness Teddy bears are the cutest n c 4.3 Rationale Teddy bear 10/10 Score ❒ ReAct – Reason + Act (ReAct) is a framework that allows for multiple chains of LLM calls to complete complex tasks: Contrary to pre-LLM era metrics such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE), LaaJ does not need any reference text, which makes it convenient to evaluate on any kind of task. In particular, LaaJ shows strong correlation with human ratings when it relies on a big powerful model (e.g. GPT-4), as it requires reasoning capabilities to perform well. Input Self-enhancement bias Problem Favors first position in pairwise comparisons Favors more verbose content Favors outputs generated by themselves Solution Average metric on randomized positions Add a penalty on the output length Use a judge built from a different base model Plan This framework is composed of the steps below: • Observe: Synthesize previous actions and explicitly state what is currently known. • Plan: Detail what tasks need to be accomplished and what tools to call. • Act: Perform an action via an API or look for relevant information in a knowledge base. Remark: Evaluating an agentic system is challenging. However, this can still be done both at the component level via local inputs-outputs and at the system level via chains of calls. ❒ Common biases – LaaJ models can exhibit the following biases: Verbosity bias Output Observe Act Remark: LaaJ is useful to perform quick rounds of evaluations but it is important to monitor the alignment between LaaJ outputs and human evaluations to make sure there is no divergence. Position bias Agents ❒ Definition – An agent is a system that autonomously pursues goals and completes tasks on a user’s behalf. It may use different chains of LLM calls to do so. LaaJ Item to score 1 4.4 Reasoning models A remedy to these issues can be to finetune a custom LaaJ, but this requires a lot of effort. ❒ Definition – A reasoning model is a model that relies on CoT-based reasoning traces to solve more complex tasks in math, coding and logic. Examples of reasoning models include OpenAI’s o series, DeepSeek-R1 and Google’s Gemini Flash Thinking. Remark: The list of biases above is not exhaustive. Remark: DeepSeek-R1 explicitly outputs its reasoning trace between <think> tags. 4.2 RAG ❒ Scaling – Two types of scaling methods are used to enhance reasoning capabilities: ❒ Definition – Retrieval-Augmented Generation (RAG) is a method that allows the LLM to access relevant external knowledge to answer a given question. This is particularly useful if we want to incorporate information past the LLM pretrained knowledge cut-off date. Description Train-time scaling Test-time scaling Q Stanford University LLM A 4 Run RL for longer to let the model learn how to produce CoT-style reasoning traces before giving an answer Let the model think longer before providing an answer with budget forcing keywords such as "Wait" Illustration Performance RL steps Performance CoT length Spring 2025

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-03 14:58
浙ICP备14020137号-1 $访客地图$