Phi-4 Technical Report

如果无法正常显示，请先停止浏览器的去广告插件。

1. Phi-4 Technical Report Marah Abdin Ronen Eldan Mojan Javaheripi Yuanzhi Li Eric Price Shital Shah Dingli Yu Jyoti Aneja Suriya Gunasekar Piero Kauffmann Weishung Liu Gustavo de Rosa Xin Wang Cyril Zhang Harkirat Behl Michael Harrison James R. Lee Caio C. T. Mendes Olli Saarikivi Rachel Ward Yi Zhang Sébastien Bubeck Russell J. Hewett Yin Tat Lee Anh Nguyen Adil Salim Yue Wu Microsoft Research Abstract We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based pri- marily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training tech- niques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size – especially on reasoning-focused benchmarks – due to improved data, training curriculum, and innovations in the post-training scheme. 1 Introduction Recent advancements in Large Language Models (LLMs) have shown that significant improvements in data quality can rival, and sometimes surpass, the performance gains traditionally achieved by scaling compute with model and dataset size. Building on the success of the Phi family [GZA + 23, LBE + 23, JBA + 23, AAA + 24], we introduce phi-4, a 14-billion parameter model that further advances performance of small language models by introducing innovative synthetic data generation methods for reasoning- focused tasks, by optimizing the training curriculum and data mixture, and by introducing new tech- niques in post-training. Synthetic data constitutes the bulk of the training data for phi-4 and is generated using a diverse array of techniques, including multi-agent prompting, self-revision workflows, and instruction reversal. These methods enable the construction of datasets that induce stronger reasoning and problem-solving abilities in the model, addressing some of the weaknesses in traditional unsupervised datasets. Synthetic data in phi-4 also plays a crucial role in post-training, where techniques such as rejection sampling and a novel approach to Direct Preference Optimization (DPO) are employed to refine the model’s outputs. The development of phi-4 is guided by three core pillars: 1. Synthetic Data for Pretraining and Midtraining: High-quality synthetic datasets are de- signed to prioritize reasoning and problem-solving, carefully generated to ensure diversity and 1

2. Small models Large models phi-4 14b phi-3 14b Qwen 2.5 14b instruct GPT 4o-mini Llama-3.3 70b instruct Qwen 2.5 72b instruct GPT 4o MMLU GPQA MATH HumanEval MGSM SimpleQA DROP 84.8 56.1 80.4 82.6 80.6 3.0 75.5 77.9 31.2 44.6 67.8 53.5 7.6 68.3 79.9 42.9 75.6 72.1 79.6 5.4 85.5 81.8 40.9 73.0 86.2 86.5 9.9 79.3 86.3 49.1 66.3 1 78.9 1 89.1 20.9 90.2 85.3 49.0 80.0 80.4 87.3 10.2 76.7 88.1 50.6 74.6 90.6 90.4 39.4 80.9 MMLUPro HumanEval+ ArenaHard LiveBench IFEval 70.4 82.8 75.4 47.6 63.0 51.3 69.2 45.8 28.1 57.9 63.2 79.1 70.2 46.6 78.7 63.4 82.0 76.2 48.1 80.0 64.4 77.9 65.5 57.6 89.3 69.6 78.4 78.4 55.3 85.0 73.0 88.0 75.6 57.6 84.8 PhiBench (internal) 56.2 43.9 49.8 58.7 57.1 64.6 72.4 Table 1: Performance of phi-4 on a set of standard benchmarks. The first set of benchmarks uses OpenAI’s simple-evals framework [Ope24b], specifying the prompts/extraction/temperature=0.5. We compare to small models of similar inference cost, as well as to larger models. relevance. We change our training curriculum and create new pretraining and midtraining data mixtures to increase the allocation of synthetic tokens, compared to older generations of phi. 2. Curation and Filtering of High-Quality Organic Data: We meticulously curate and filter organic 2 data sources, including web content, licensed books, and code repositories to extract seeds for the synthetic data pipeline that encourage high-depth reasoning and prioritize educational value (to the model). These seeds form the foundation of the synthetic generation pipeline. To complement these synthetic datasets, we also filter the web for high-quality data (in terms of knowledge and reasoning) to use directly in pretraining. 3. Post-Training: We further advance the post-training recipe in phi-4 by creating new refined versions of SFT datasets, as well as by developing a new technique to create DPO pairs, based on pivotal token search. With these innovations, the performance of phi-4 on reasoning-related tasks is comparable to or surpasses much larger models. For example, its performance on many widely used reasoning-related benchmarks meets or exceeds that of Llama-3.1-405B. In Table 1 we compare the performance of our model on academic benchmarks to several contemporary foundation models. We find that phi- 4 significantly exceeds its teacher GPT-4o on the GPQA (graduate-level STEM Q&A) and MATH (math competition) benchmarks. 1 These scores are lower than those reported by Meta, perhaps because simple-evals has a strict formatting requirement that Llama models have particular trouble following. We use the simple-evals framework because it is reproducible, but Meta reports 77 for MATH and 88 for HumanEval on Llama-3.3. 2 We use organic to refer to human-generated or otherwise non-synthetic data. 2

3. 89.8 77.4 77.9 90 Large model Small model ct So de 3 nn 14 Qw 80 70 74.8 78.2 78.7 91.8 81.6 66.4 60 score (temperature = Average performance on November 2024 AMC 10/12 tests Figure 1: Average performance of different models on the November 2024 AMC-10 and AMC-12 tests. This is the average score (with maximum score 150) over the four tests on 100 runs with temperature t = 0.5. We chose t = 0.5 to follow simple-evals [Ope24b]. Error bars are 2σ of the estimate. On competition math, phi-4 scores well above its weight-class even compared to non–open-weight models. 1.1 Addressing Overfitting and Data Contamination Decontamination: One pitfall of foundation models is overfitting to benchmarks, such as through the leakage of benchmark test sets via the web corpus. We improved the data decontamination process for phi-4 compared to previous Phi models to ensure no unfair influence on evaluation results. More details of the decontamination method are given in Appendix B. AMC Benchmark: The surest way to guard against overfitting to the test set is to test on fresh data. We tested our model on the November 2024 AMC-10 and AMC-12 math competitions [Com24], which occurred after all our training data was collected, and we only measured our performance after choosing all the hyperparameters in training our final model. These contests are the entry points to the Math Olympiad track in the United States and over 150,000 students take the tests each year. In Figure 1 we plot the average score over the four versions of the test, all of which have a maximum score of 150. phi-4 outperforms not only similar-size or open-weight models but also much larger frontier models. Such strong performance on a fresh test set suggests that phi-4’s top-tier performance on the MATH benchmark is not due to overfitting or contamination. We provide further details in Appendix C. Relying on Contamination-Proof Benchmarks: We give significant weight to benchmarks which were designed in such a way that the questions are original and do not appear on the web, such as GPQA [RHS + 23]. While optimizing our model, we relied on an internal benchmark composed primarily of original prompts written by the team (see Section 5 for further details). Long Chain-of-Thought Models: A style of LLM that scales inference-time compute by generating long chains of thought has emerged over the past few months, as pioneered by OpenAI O1 [Ope24a] and followed by DeepSeek-R1-Lite-Preview [Dee24] and Qwen/QwQ-32B-Preview [Tea24]. These models perform well on reasoning benchmarks, where QwQ, the only such model with open weights, averages 124.5 points in the AMC-10/12 setting of Figure 1. However, QwQ also uses 4X more tokens on this task than phi-4 and has more than twice as many parameters. Thus, the inference cost of QwQ is an 3

4. order of magnitude higher than phi-4. Consequently, these models are not in the same class as phi-4 with respect to cost or latency. 2 Approach to Data The pretraining phase of phi-4 relies heavily on synthetic datasets generated through a variety of tech- niques. In addition, we employ several methods for filtering organic data sources that are used both as complementary datasets in the pretraining and as seeds for generating synthetic data. 2.1 Purpose of Synthetic Data Synthetic data as a substantial component of pretraining is becoming increasingly common, and the Phi series of models has consistently emphasized the importance of synthetic data. Rather than serving as a cheap substitute for organic data, synthetic data has several direct advantages over organic data. Structured and Gradual Learning. In organic datasets, the relationship between tokens is often complex and indirect. Many reasoning steps may be required to connect the current token to the next, making it challenging for the model to learn effectively from next-token prediction. By contrast, each token generated by a language model is by definition predicted by the preceding tokens, making it easier for a model to follow the resulting reasoning patterns. In this way, synthetic data may act as a form of “spoonfeeding,” presenting challenges in a digestible and progression-oriented manner. A simple example to illustrate this is that a human-written solution to a math problem might start with the final answer. This answer is much too hard to output immediately, for either a human or an LLM—the human produced it by nonlinear editing, but pretraining expects the LLM to learn to produce it linearly. Synthetic solutions to math problems will not have such roadblocks. Alignment with Inference Contexts. Synthetic data is typically closer to the format of outputs we expect our models to generate. Training on such data helps align the model’s pretraining experience with the scenarios it encounters during inference. This alignment ensures that the context seen during generation remains in-distribution with respect to the data the model was pretrained on. For example, web forums are very different in style from LLM interactions. If a fact only appears in web forum data, the pretrained model will think it is very unlikely to occur in the chats it produces. Rewriting facts from the web forum into the language style of an LLM makes the facts more accessible during the LLM chat context of inference. Principles. Our approach to generating synthetic data for phi-4 is guided by the following principles: 1. Diversity: The data should comprehensively cover subtopics and skills within each domain. This requires curating diverse seeds from organic sources. 2. Nuance and Complexity: Effective training requires nuanced, non-trivial examples that reflect the complexity and the richness of the domain. Data must go beyond basics to include edge cases and advanced examples. 3. Accuracy: Code should execute correctly, proofs should be valid, and explanations should adhere to established knowledge, etc. 4

5. 4. Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner. This fosters coherent outputs for complex tasks. 2.2 Synthetic Data for Pretraining and Midtraining We created 50 broad types of synthetic datasets, each one relying on a different set of seeds and dif- ferent multi-stage prompting procedure, spanning an array of topics, skills, and natures of interaction, accumulating to a total of about 400B unweighted tokens. In Appendix D, we give a few examples of transcripts taken from our synthetic generations. Here, we highlight novel methodologies used in generating synthetic datasets for phi-4: • Seed Curation: The synthetic dataset generation begins with high-quality seeds sourced from multiple domains. These curated seeds provide the foundation for synthetic data generation, enabling the creation of exercises, discussions, and reasoning tasks tailored to the model’s training objectives. 1. Web and Code-based Seeds: Excerpts and snippets are extracted from web pages, books, and code repositories with a focus on content that demonstrates high complexity, reasoning depth, and educational value. To ensure quality, we employ a two-stage filtering process: first, identifying pages with strong educational potential, and second, segmenting the selected pages into passages, scoring each for its factual and reasoning content. 2. Question Datasets: A large set of questions was collected from websites, forums, and Q&A platforms. These questions were then filtered using a plurality-based technique to balance difficulty. Specifically, we generated multiple independent answers for each question and applied majority voting to assess the consistency of responses. We discarded questions where all answers agreed (indicating the question was too easy) or where answers were entirely inconsistent (indicating the question was too difficult or ambiguous). This filtering process produces a dataset of questions that challenge the model’s reasoning and problem-solving abilities while remaining approachable. The plurality answers were used in place of the ground truth in our rejection-sampling based generations. 3. Creating Question-Answer pairs from Diverse Sources: Another technique we use for seed curation involves leveraging language models to extract question-answer pairs from organic sources such as books, scientific papers, and code. This approach does not rely on merely identifying explicit Q&A pairs within the text. Instead, it involves a pipeline designed to detect deduction chains or logical progressions in the text. The language model identifies key steps in reasoning or problem-solving processes and reformulates them into questions and corresponding answers. Our experiments show that, if done correctly, training on the resulting content can be far more effective (in terms of improvement on academic and internal benchmarks) than training on the original content. • Rewrite and Augment: Seeds are transformed into synthetic data through multi-step prompting workflows. This includes rewriting most of the useful content in given passages into exercises, discussions, or structured reasoning tasks. • Self-revision: The initial responses are then iteratively refined through a feedback loop where a model critiques and subsequently improves its own outputs, guided by the rubrics focused on reasoning and factual accuracy. 5

6. • Instruction Reversal for Code and Other Tasks: To enhance the model’s ability to generate outputs from instructions, we used an instruction reversal technique. For example, we take existing code snippets from the code data corpus and use it to generate corresponding instructions that include the problem description or task prompt. The resulting synthetic data pairs were structured with the instruction appearing before the code. Only data with high fidelity between the original and regenerated code are retained, ensuring alignment between the instructions and the outputs. This method can be generalized to other targeted use cases. • Validation of Code and Other Scientific Data: When appropriate, we incorporate tests for validating our reasoning-heavy synthetic datasets. The synthetic code data is validated through ex- ecution loops and tests. For scientific datasets, the questions are extracted from scientific materials using a method designed to ensure high relevance, groundedness, and difficulty balance. 2.3 Curation and Filtering of Web and Q&A Data Q&A datasets. We collected tens-of-millions high-quality organic problems and solutions by review- ing public websites, relying on existing datasets, and acquiring external datasets. Our experience from previous models showed that question-answer data contributed significantly to various capabilities, such as mathematical reasoning and academic performance. Our ablation studies showed that organic ques- tions are substantially more effective than synthetic questions. We used several ways to synthetically augment the dataset of organic questions to obtain a larger dataset. While these rewritten questions improved the model’s capabilities, the gains were not as pronounced. A significant portion of the col- lected questions lacked accurate solutions. To address this, we replaced the answers with synthetically generated ones and used majority-voting to increase accuracy. All collected questions and solutions underwent a thorough decontamination process to ensure there is no overlap with test sets 3 . Targeting High-quality Web Data. We collected a wide variety of high-quality organic data sources for phi-4, prioritizing reasoning-dense and nuanced material (e.g., academic papers, educational forums, and programming tutorials). In addition to directly training on this text, we used various web sources as seeds for specialized synthetic data generation pipelines. We found clean and correct natural data to be absolutely crucial for seeding synthetic data: minor errors can result in severe quality degradations for derived synthetic documents. We therefore invested heavily in the perfectionistic curation of our web data. We discuss the main techniques and considerations below: • Targeted Acquisitions: We included major repositories of reasoning-dense documents that are publicly permissible for use (e.g., arXiv, PubMed Central, GitHub) or explicitly licensed (e.g., licensed books) aiming for a level of comprehensiveness, recency, and cleanliness above the typical standard of externally available corpora. • Filtering Web Dumps: To capture the long tail of information-rich web sources (e.g., forums, blogs, course material, domain-specific wikis), we took the approach of selecting a small fraction of highest-quality documents from bulk web dumps, using small (non-LLM) classifiers trained on ∼ 10 6 LLM-generated annotations. This approach tends to over-index on STEM-related keywords, so we created a specialized pipeline to amplify high-quality non-STEM content (e.g., arts, history, travel, culture, and entertainment). These topic classifications were also obtained by distilling an 3 This step is crucial to the reliability of some of the academic benchmarks: for instance, some test benchmark variants can be found on platforms like Hugging Face. Moreover, benchmarks such as MMLU are frequently compiled from web- sourced questions. 6

7. LLM annotator. Finally, we removed corrupted text and binary files by detecting outliers according to n-gram statistics and compression ratios. • Multilingual Data: We incorporated multilingual datasets to ensure that our model could han- dle a wide range of languages, including German, Spanish, French, Portuguese, Italian, Hindi and Japanese. This involved sourcing and processing high-quality multilingual documents from CommonCrawl and Wikipedia. Our multilingual processing pipeline consists of a language identifi- cation model, based on fastText used to categorize documents into 176 languages, then uses the same classifiers for filtering web dumps to filter for quality. Note that the classifiers were trained on multilingual LLM-generated annotations. • Custom Extraction and Cleaning Pipelines: To ensure sufficient cleanliness and uniformity between heterogeneous organic data sources, we needed a collection of customized heuristics and parsers. For each targeted data source, we built custom pipelines to ingest a variety of file formats (e.g., multi-file TeX source, ePub and other XML-like formats, Microsoft Word documents, and PDFs). For general web data, we built a custom HTML-to-text extractor, taking significant care to preserve fragile content that is frequently corrupted by naı̈ve parsers (e.g., TeX/MathML equations, code blocks, tables, and forum thread structure). This extractor prunes and normalizes the DOM tree, using a variety of signals (e.g., HTML tag names, CSS classes, content length, and tree depth) to distinguish elements such as boilerplate, advertisements, equations, and syntax- highlighter artifacts. 2.4 Post-Training datasets Our post-training data is composed of: • Supervised Fine-Tuning (SFT) Datasets: Using carefully curated user prompts taken from a mixture of publicly available datasets and synthetically generated data, we generate multiple model responses and select the best using an LLM-based evaluation process. • Direct Preference Optimization (DPO): We generate DPO pairs based on rejection sampling and LLM evaluation, a part of which is based on our approach to creating pivotal token-based pairs, explained in Section 4.3 below. 3 Pretraining details The phi-4 model is based on a decoder-only transformer architecture [VSP + 17] with 14B parameters and a default context length of 4096. This is later extended to a 16K context length during midtraining. The architecture closely follows phi-3-medium, except that we now use the tiktoken tokenizer (for better multilingual support) with a padded vocabulary size of 100,352 (including unused tokens) and we use full attention over the 4K context length, rather than a 2K sliding window used in phi-3-medium. The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules with peak learning rate of 0.0003, constant weight decay of 0.1, and global batch size of 5760. The training hyperparameters are tuned using interpolations from shorter horizon runs and further adjusted by stress testing the learning rate warm-up stage for stability. Pretraining is followed by a shorter midtraining stage to increase the original context length of 4k to 16k. Since pre-trained models are not good at instruction following, it is not very informative to use 0-shot evaluations that require the answer to be in a specific format, for example simple-evals. We 7

8. phi-4 (4k) phi-4 (16k) MMLU MMLU pro GSM8k Human-Eval ARCC MBPP MATH TQA +3.0 +2.7 +10.3 +8.9 +2.2 +1.2 +7.8 +9.0 +1.1 +0.9 +6.8 +9.6 +8.9 +8.4 -0.7 -1.5 Table 2: Pretraining benchmarks for phi-4 compared to its predecessor, phi-3-medium after pretraining. therefore use an internal implementation of benchmarks for pretraining which uses a mixture of log- likelihood and/or few-shot prompts for various tasks. Specifically, we used log-likelihood evaluations for MMLU (5-shot), MMLU-pro, and ARCC (1-shot). We used 1, 3, 4, and 8 few-shot examples for TriviaQA (TQA), MBPP, MATH, and GSM8k to help the model adhere to the answer format for easier extraction of the solution. We use this evaluation method throughout Section 3. Table 2 summarizes the performance boost of pretrained phi-4 compared with its predecessor phi-3-medium. 3.1 Data Composition in Pretraining The phi-3 model family were trained using a two-phase strategy. Most of the training tokens were used in phase 1 of the training, which consisted largely of filtered web data. Phase 2 was trained with a data mixture consisting primarily of synthetic tokens and a much smaller allocation for ultra-filtered and reasoning-heavy web data. As the size and complexity of our synthetic data grew, we observed a marginal drop in the benefit from using non-synthetic tokens for the phi-3 family of model sizes. We note two key observations. • Web datasets showed small benefits on reasoning heavy benchmarks. Prioritizing more epochs over our synthetic data led to better performance with respect to adding fresh web tokens. • Models trained only with synthetic data underperformed on the knowledge-heavy benchmarks and demonstrated increased hallucinations. Figure 2 demonstrates the first phenomenon using smaller scale phase 2 pretraining exercises. In this example, we conduct two training runs per model scale, using the same number of training tokens on top of phase 1 pretrained checkpoints. For all runs, the number of unique synthetic tokens is fixed (a subsample of full synthetic data) but the number of repetitions on this data changes, namely 4 and 12 epochs. The rest of the training tokens are fresh unique tokens supplied from web sources. As seen, performing more iterations on the synthetic data is more beneficial than supplying more web tokens. Inspired by this scaling behavior of our synthetic data, we trained a 13B parameter model solely on synthetic 4 data, for ablation purposes only – the model sees over 20 repetitions of each data source. For the sake of ablations, we partitioned our synthetic data into web rewrites, which includes more direct rewrites of our filtered web content relative to all other types of synthetic data. Table 3 compares the previous phi-3-medium model with the new model trained entirely on the synthetic data. Throughout training, all benchmarks consistently improved, despite the increase in epochs, and the majority of the benchmarks showed improvements over phi-3. However, knowledge-related benchmarks, like 1-shot triviaqa (TQA), show a large gap where synthetic models are subpar. These observations led us to rethink the role of web data in our data mixture. 4 This is an updated mixture of synthetic data that contains new sources compared to phi-3. 8

9. Figure 2: 5-shot MMLU score for phase 2 pretraining runs with 4 and 12 epochs of synthetic data. All models are trained for the same token horizon, thus the model with 4 epochs of synthetic has seen more (unique) web tokens. We see that despite many epochs on synthetic data, we do not see overfitting behavior and in fact the 12 epoch models perform better than those that have seen more unique web tokens. Synthetic Synthetic + Web Rewrites MMLU MMLU pro GSM8k Human-Eval ARCC MBPP MATH TQA +0.8 +0.3 +4.0 +4.1 +2.2 +1.8 +12.1 +13.3 0.0 +3.0 +5.0 +7.6 +4.9 +8.1 -14.8 -7.7 Table 3: Benchmark performance of 13B models (used for ablations only) trained on data mixtures containing no web data. The respective training tokens are either from synthetic sources, or an equal share of synthetic data and web rewrites. All numbers are reported relative to the performance of phi-3-medium, which has seen a combination of web and synthetic data. 3.2 Data Mixture To design our pretraining data mixture for a given training token budget, we search over different allocation of tokens coming from various sources, namely, 1) synthetic, 2) web rewrites 5 , 3) filtered web (divided into reasoning and knowledge-heavy portions), 4) targeted acquisitions and organic data (e.g., academic data, books, and forums), and 5) code data. We conducted ablations using a shorter token horizon of 1T tokens to derive the data mixture. These ablations rely on our established result on the high-rank correlation of short training with longer training, up to the over-fitting saturation threshold of data sources. In addition we observe a high rank correlation between the performance of the 7B and 14B models on different data mixtures, given a large enough distance between the data mixtures. This allowed us to conduct the experiments at 7B scale and transfer the findings to phi-4. Among the numerous ablations, we highlight a few that show best insights on our data composition. Specifically, we freeze the ratio of tokens coming from targeted acquisitions and code categories, and change the ratio of tokens for the synthetic, web, and web rewrites clusters. Table 4 summarizes the results for the hand-picked ablations, as compared with the data mixture that was used for the final training run. A uniform allocation of tokens among the three categories is suboptimal due to the higher quality of synthetic data and the only benchmark that shows a clear benefit from web data is TQA. While the synthetic-heavy variations on rows 2 and 3 of the table are marginally better than the chosen final data mixture, we decided to integrate the targeted and knowledge-heavy filtered web data sources to improve knowledge benchmarks (see Section 3.1) to balance all model 5 Web rewrites is a sub-category of synthetic data that is substantially large and contains direct rewrites of web content. 9

10. Uniform S S + WR S+W MMLU MATH GSM8k Human-Eval ARCC MBPP TQA MMLU pro Average -3.3 +3.3 +0.6 -0.6 -5.4 +4.0 +1.2 -0.7 -5.8 +2.1 +1.5 -0.7 -1.2 -6.1 -1.2 -4.3 +0.6 +1.9 +1.6 +0.3 -2.0 +0.4 +1.6 -2.0 +3.3 -3.0 -3.7 +6.9 -3.6 +3.7 +1.2 +0.9 -2.2 +0.8 +0.4 0.0 Table 4: Ablations on the allocation of 75% of training tokens to synthetic (S), filtered web (W), and web rewrite (WR) categories, while other data sources are held constant in the remaining 25% token budget. All benchmark numbers are measured relative to the final data mixture used for training phi-4. capabilities. We also note that we observed the gap between the chosen data mixture and the synthetic heavy runs largely closes as the model goes through the post-training stage. An end-to-end optimization of pretraining data mixture that also takes into account the effects of post-training is an interesting future area of investigation. Data Source Fraction of Training Unique Token Count Number of Epochs Web Web rewrites Synthetic Code data Acquired sources 15% 15% 40% 20% 10% 1.3T 290B 290B 820B 580B 1.2 5.2 13.8 2.4 1.7 Table 5: Data mixture for pretraining. The final data mixture used for phi-4 allocates 30% of the training tokens to web and web rewrites data sources, divided equally between them. The remaining tokens are largely sourced from synthetic data which accounts for 40% of the data mixture tokens. Finally we allocate 20% of tokens to code data (mixture of synthetic and raw code) and 10% to targeted acquired sources like academic data and books. In terms of total number of unique tokens in each data mixture cluster, filtered web data is the largest cluster with ∼ 1.3T tokens. Code and targeted acquisitions are the second and third largest clusters with ∼ 820B and ∼ 580B tokens, respectively. Finally, web rewrites and synthetic data have similar token count of ∼ 290B tokens. The total number of epochs on each data source is determined using the ratio of allocated tokens in the mixture and the number of unique tokens in that source. 3.3 Midtraining Details phi-4 includes a midtraining stage where the context length is increased from the original 4K to 16K. We conduct several ablations to study the role of data on long-context performance. Specifically, we try data sources that are inherently long context, and compare them with artificially created long context data where samples are padded together to fill the sequence. We observe the former to perform better in longer context tasks. Inspired by this, we further filter our high-quality non-synthetic datasets (i.e., academic, books, and code data) to separate samples above 8K context. We then up-weight the data subsets that are 16K or higher in length. We also create new synthetic datasets that satisfy the > 4K sequence requirement. The final data mixture includes 30% of the newly curated longer context data and a 70% portion of recall 10

11. Model Max Length Recall RAG ICL Re-rank QA Summ phi-4 Qwen-2.5-14B Llama-3.3-70B GPT-4o-mini GPT-4o 8K 8K 8K 8K 8K 100.0 100.0 92.0 99.2 100.0 58.1 62.2 65.3 65.8 66.9 68.0 67.8 69.4 74.4 83.0 65.3 58.2 64.4 69.4 75.1 26.7 24.7 30.0 31.3 37.3 38.3 37.2 37.8 38.5 43.0 phi-4 Qwen-2.5-14B Llama-3.3-70B GPT-4o-mini GPT-4o 16K 16K 16K 16K 16K 99.0 100.0 92.0 100.0 100.0 57.1 59.1 62.2 63.6 66.7 77.0 67.6 70.0 78.4 85.6 54.4 50.3 63.3 63.9 73.8 36.0 29.7 36.7 36.0 43.7 40.5 42.3 41.9 45.2 46.3 Table 6: Evaluation results on the long-context benchmark HELMET [YGH + 24]. tokens from the pretraining stage. To accommodate longer context, we increase the base frequency of rope position encoding to 250K following [AI23b]. We drop the maximum learning rate by a factor of 10 compared to the pretraining stage and train for a total of 250B tokens. To effectively evaluate the long-context capability of our model, it is essential to have a comprehensive evaluation framework with practical scenarios. While synthetic benchmarks like needle-in-a-haystack and RULER are preferred for their simplicity and control, our emphasis is on a diverse range of tasks that reflect real-world applications, such as reasoning across entire documents. We report the performance of phi-4 and other models on the tasks we selected from the HELMET [YGH + 24] evaluation suite in Table 6 and outline our evaluation methods below. Note that results are average across 5 runs for each categories. • Recall: The task involves retrieving the corresponding value from a randomly-generated long JSON file given a specific key (Metric: SubEM) • RAG: Answer questions based on many retrieved and shuffled Wikipedia documents. The datasets used for this task are NaturalQuestions, HotpotQA, and PopQA. Final results are average of all datasets (Metric: SubEM) • Re-rank: The task is to re-rank the top-10 documents given a query and many retrieved and shuffled documents. This uses the MSMARCO dataset (Metric: nDCG@10) • ICL: The task involves many-shot in-context learning with datasets such as TREC coarse, TREC fine, Banking77, NLU and CLINC150. Final results are average of all datasets (Metric: F1) • QA: Answer questions given a lengthy document. The dataset associated with this task is Narra- tiveQAv2 (Metric: GPT-4o scoring) • Summ: The task involves summarizing a lengthy legal document, and the dataset used is Multi- LexSum (Metric: GPT-4o scoring) 11

12. Dataset Name Sample Count unknown + safety data generic multiple-choice Q&A math data python data cpp, go, java, js, rust data 3,000 132,859 76,552 16,080 21,806 Dataset Name Sample Count unknown + safety data any vs any overall any vs any accuracy 43,842 266,000 532,000 Table 8: Data Mixture for Judge Guided DPO Table 7: Data Mixture for Pivotal Token DPO 4 Post-Training Post-training is aimed at transforming the pretrained language model into an AI assistant that users can safely interact with. We align the pretrained model with one round of SFT 4.1, one round of DPO [RSM + 23] on data from our pivotal token search method (see Section 4.3), and one round of DPO on full length preference pairs. The model is chat finetuned using the standard chatml format, example usage template for two rounds of a conversation is as follows: <|im start|>system<|im sep|>system message<|im end|> <|im start|>user<|im sep|>prompt1<|im end|><|im start|>assistant<|im sep|>response1<|im end|> <|im start|>user<|im sep|>prompt2<|im end|><|im start|>assistant<|im sep|> 4.1 Supervised Fine-Tuning In this phase, we fine-tune the pretrained model with a learning rate of 10 −6 on a variety of data generated from high-quality data across diverse domains, including math, coding, reasoning, conversation, model identity, and safety. We also added multilingual data for 40 languages. We use around 8B tokens of data in this phase, all formatted in the chatml format. 4.2 Direct Preference Optimization We use DPO [RSM + 23] to align the model with human preferences, and also to steer the model away from unwanted behavior through pairs of desired and undesired outputs. DPO data covers chat format data, reasoning, and Responsible AI (RAI) data and improves the model in math, coding, reasoning, robustness, and safety. We do two rounds of DPO on the SFT model. We introduce a technique, Pivotal Token Search (PTS), to generate pairs for DPO for the first DPO round. Details of the data mixture for first round are provided in Table 7. For the second round, which we call judge-guided DPO, we gather approximately 850k pairs of desired and undesired outputs. The prompts are sourced from various publicly available instruction tuning datasets and also include prompts related to safety and Responsible AI (RAI). Next, for each of these prompts, we generate responses from GPT-4o, GPT-4t and our model. From these responses, we create various combinations of DPO pairs and use GPT-4o as a judge to label positive or negative for a given pair. For a given pair of responses, each assistant response is given a score based on accuracy, style, and detail. We label the response with higher accuracy or overall (average of accuracy, style, and detail) score as the positive response. We provide the prompt we used in Appendix A. The data mixture for this round is provided in Table 8. Both stages also include a small amount of data for safety and mitigating hallucinations. 12

13. User: Suppose that the roots of x 3 + 3x 2 + 4x − 11 = 0 are a, b, and c, and that the roots of x 3 + rx 2 + sx + t = 0 are a + b, b + c, and c + a. Find t. AI: To solve the problem, we need to deal deal with the roots of the two cubic equations given. First, consider the cubic polynomial polynomial \( xˆ3 + 3xˆ2 + 4x - 11 = 0 \), with roots \( a \), \( b \), and \( c \). By Vieta’s formulas, we have: \[ a + other b + c = -3, \] \[ ab + bc + ca = 4, \] \[ abc = 11. \] Now, consider the other polynomial \( xˆ3 + rxˆ2 + sx + t = 0 \) with roots \( a+b \), \( b+c \), and \( corresponds to the negative of c+a \). We need to find the value of \( t \), which corresponds 0.42 → 0.93 the product of the roots according according to Vieta’s formulas : : \[ (a 0.95 → 0.71 +b)(b+c)(c+a). \] First First, First note note the important important identity: \[ (a+b)(b+c)(c+a) = (a+b+c)(ab+bc+ca) - abc. \] Plug Plugging in known Plug known values from our our Vieta’s results for for \( a \), \( b \), and \( c \), we have: \[ a + + b + c = -3, \] \[ ab + bc + ca = 4, \] \[ abc = 11. \] Substitute these into the identity: \[ (a+b)(b+c)(c+a) = (-3)(4) - 11 = -12 - 11 = -23. \] Thus, \( t = -(a+b)(b+c)(c+a) = 23 \). Therefore, the value of \( t \) is \(\boxed\23\\). 1 0 Figure 3: Illustration of pivotal tokens for GPT-4o at temperature 1 on a problem from the MATH bench- mark [HBK + 21], where the initial success probability is 0.31. Each token is colorized by the probability of success for an independent completion (N = 529) continued from after the token, with red for p(success) = 0 and blue for p(success) = 1. The line plot shows the same probabilities. The tokens that changes p(success) by ≥ 0.2 are shown boxed , with subscripts showing the change in probability. Tokens with probability ≤ 0.1 are underlined to illustrate that pivotal tokens are distinct from low-probability tokens. The token probabilities of negative and (a were 0.31 and 0.12, respectively. The greedy tokens for the same prefixes are product with 0.66 probability and t with 0.88 probability. procedure PivotalTokenSearch(Q, T full , p gap ) procedure Subdivide(T prefix , T ) if ∣T ∣ ≤ 1 or ∣ p(success ∣ T prefix ) − p(success ∣ T prefix + T ) ∣ < p gap then ▷ Base cases. return [T ] T left , T right ← Split(T ) ▷ We split at the cumulative midpoint of token log probabilities. return Subdivide(T prefix , T left ) ∪ Subdivide(T prefix + T left , T right ) T prefix ← ϵ for all T ∈ Subdivide(ϵ, T full ) do if ∣T ∣ = 1 and ∣ p(success ∣ T prefix ) − p(success ∣ T prefix + T ) ∣ ≥ p gap then yield (Q, T prefix , T ) ▷ Output pivotal tokens T and context for postprocessing. T prefix ← T prefix + T Figure 4: Pseudocode for Pivotal Token Search (PTS). Note that estimating p(success ∣ . . . ) involves sampling the language model and invoking the oracle. In an efficient implementation p(success ∣ . . . ) should be memoized. 13

14. 4.3 Pivotal Token Search Consider a generative model producing a token-by-token response to a given prompt. For each token produced, which corresponds to a prefix of the model response, one can consider the conditional proba- bility of the model’s answer being correct given that prefix, as well as the increment in this probability with respect to that token (in other words, the difference in the probability of being correct before and after producing that token). It is often the case that the overall correctness is highly dependent on a successful generation of a small number of key tokens. For example, we can see in Figure 3 where the model outputs a math solution and a “fortunate” sampling of a crucial token negative shifts the solution from possible failure to likely success, while sampling of the token (a subsequently risks failure again. We refer to these tokens as pivotal tokens as they have an outsized effect on the course of the solution. Now, consider how the solution from Figure 3 would be used in DPO as a full-length accepted response. As the figure shows, there are many tokens with probabilities much lower than the 0.31 of negative , which would contribute to noise in the gradients diluting the signal from the pivotal token. Even worse, the token (a that contributed to the lack of robustness would receive a strong positive learning signal thanks to its low probability of 0.12. Moreover, intuition suggests that when two texts substantially deviate from each other, comparison of their individual next-token log probabilities (as done in DPO) is not very meaningful. Rather, it makes more sense that the signal should come from the first tokens after the two texts starts diverging from each other. To alleviate these effects, we employ a method we call Pivotal Token Search (PTS) for generating preference data that specifically targets pivotal tokens in isolation, creating DPO pairs in which the preference optimization takes effect with respect to a single token. PTS identifies points of a completion token sequence T full = t 1 , t 2 , . . . for some user query Q where the next token t i has a significant impact on the probability of success p(success ∣ t 1 , . . . , t i ). PTS es- timates these probabilities by sampling completions starting from Q + t 1 , . . . , t i , which are checked for correctness with an oracle 6 for Q. Figure 4 shows a basic instantiation of the algorithm. The proce- dure Subdivide recursively splits the sequence into segments t i , . . . , t j until the change in probability ∣ p(success ∣ t 1 , . . . , t i−1 ) − p(success ∣ t 1 , . . . , t j ) ∣ for each segment is below a threshold p gap or the segment is just a single token. Tokens with a sharp change in success probability are kept as pivotal. We turn pivotal tokens into preference data by taking Q + t 1 , . . . , t i−1 as the query, and single tokens t acc and t rej that increase/decrease p(success ∣ t 1 , . . . , t i−1 , t acc/rej ) as the accepted and rejected completions, respec- tively. 7 The binary-search algorithm for PTS is not always guaranteed to find all pivotal tokens, but it only finds pivotal tokens and it finds all of them if the success probability is near-monotone over the course of the solution. We used PTS to generate preference data for tasks where ground-truth is readily available, such as mathematics, various forms of question answering and coding. To improve sample efficiency, we filter the target questions to only include those with 0.2 ≤ p(success) ≤ 0.8, as pivotal tokens are rare for tasks that are very easy or hard. See Figure 5 for examples of preference data we generated using PTS. The math question answering example shows how pivotal tokens often are not actual mistakes, but choices that drive the model down a less advantageous path. Here, multiplying both sides by the denominators separately is equally valid to directly cross-multiplying, but for the model doing the latter here is more robust. By generating DPO data targeting such choices, we believe PTS helps phi-4 work better in the modes it is especially 6 7 For coding, a comprehensive test suite can be used. For math problems, answers can be compared to the ground truth. We find drawing t acc and t rej from rollouts PTS already used to estimate p(success ∣ t 1 , . . . , t i−1 ) to be effective. 14

15. 1 User: Find the product of all real values of r for which 2x = r−x has exactly one real 7 solution. AI: To solve the problem, we need to determine the values of r such that the equation 1 r − x = 2x 7 has exactly one real solution for x. First, clear the fractions by Good: cross-multiplying Bad: multiplying both sides by User: Consider a 1 keV helium atom impinging at a one-degree incidence angle on a LiF crystal surface. Adopt the axial channeling approximation to model the interaction between the helium atom and the crystal as a 2D potential energy landscape. Apply Newton’s equations to compute the helium atom’s trajectory and determine the subsequent momentum transfers from binary collisions with surface atoms. Sum the total virtual energy transferred across all collisions and compute the probability, P e , that the helium atom remains elastically scattered assuming a vibrational frequency for surface atoms of ω = 1 × 10 14 Hz. What is P e ? loss loss loss 1. P e = exp(− E 2 loss ), 2. P e = exp( E hω ), 3. P e = exp(− E hω ), or 4. P e = 1 − exp(− E hω ) ̵ ̵ ̵ ̵ hω AI: To determine the probability P e that the helium atom remains elastically scattered, we need to calculate the total virtual energy transferred, E loss , during the helium atom’s interaction with the LiF crystal surface. Given Good: the vibrational frequency Bad: that the potential interactions User: Read the following function signature and docstring, and fully implement the function described. Your response should only contain the code for this function. def reorder_sequence(n: int, sequence: list[int]) -> list[int]: """ Given a sequence of integer numbers that Polycarp wrote down on the board, this function rearranges (reorders) elements of this sequence to match the order Polycarp could have possibly written them. Each next number should be exactly two times the previous number or exactly one third of the previous number. :param n: The number of elements in the sequence (2 <= n <= 100). :param sequence: A list of integer numbers in arbitrary order (1 <= elements <= 3 * 10ˆ18). :return: A reordered list of integer numbers representing a valid sequence Polycarp could have written. """ AI: ... from collections import defaultdict # Create adjacency list where each number can point to its possible predecessor and successors graph = defaultdict(list) for num in sequence: for other in sequence: if num != other: if num == other * 2 or ( Bad: num * 2 Good: other % 3 Figure 5: Preference data generated by Pivotal Token Search in answering math and physics questions, and implementing a function in Python. The tokens that form the actual pair for DPO are underlined. 15

16. SimpleQA performance over the course of post-training 100 80 60 90.0% 40 20 3.2% 0 6.8% Base 38.7% 57.5% 3.7% SFT Correct 17.4% 15.8% 79.8% 81.1% 2.9% DPO Stage 1 Not Attempted 3.0% Final Incorrect Figure 6: The post-training process described in Appendix A.1 decreases hallucinations. One measure is that the problems in SimpleQA—which the model very rarely gets correct—are increasingly not attempted during the course of post-training. We believe the final result is better behavior, even though the simple-evals score for SimpleQA (the F1 score) actually gives our base model a higher score than our final model. stronger. Related Work: In [LLX + 24] a contrastive estimation approach involving a model trained on incorrect trajectories is used to score which tokens likely contributed to failure, which is further employed to weigh rejected responses in DPO. In comparison, our PTS avoids complications from learned proxies by directly estimating p(success). They also report difficulties applying their method to accepted responses in DPO, while our method generates both positive and negative preference data directly targeting pivotal tokens. Automated process supervision methods [WLS + 24, LLL + 24] have applied search and rollouts to generate data for training process reward models. PTS can be seen as an automated process supervision method that generates token-level preference data suitable for DPO. 4.4 Hallucination mitigation We generate SFT data and DPO pairs to mitigate hallucination. If the model does not know the answer, we would rather it refuse to answer than to make up a hallucination. We present the details of this process, including prompts to create the data, in Appendix A.1. This greatly decreases hallucinations in SimpleQA (see Figure 6). 4.5 Post-Training Ablation In Table 1 we show how our benchmark scores evolve during post-training. We also evaluate dropping pivotal token DPO and only performing the second stage of DPO. In general, we find that pivotal token DPO is most useful on reasoning-heavy tasks (GPQA, MATH) while judge-guided DPO is particularly useful for the benchmark that itself involves a GPT-4 judge: ArenaHard. We also find the two approaches to be complementary. 5 Benchmarking Considerations While academic benchmarks are a widely used to measure the progress in LLM advancement, they suffer from several limitations that can fail to reveal a model’s true capabilities and weaknesses. These limitations include: • Data Contamination: Many benchmarks rely on datasets that overlap with pretraining corpora, creating a risk of data contamination. Although we took extensive measures to deduplicate and 16

17. simple-evals SFT DPO stage 1 DPO stage 2 only phi-4 (stage 1 + 2) MMLU GPQA MATH HumanEval MGSM SimpleQA DROP 82.8 47.3 77.1 79.5 80.8 3.7 82.8 84.8 53.6 80.5 81.6 80.8 2.9 86.1 84.2 52.4 77.6 81.5 81.5 2.9 71.8 84.8 56.1 80.4 82.6 80.6 3.0 75.5 MMLUPro HumanEval+ ArenaHard IFEval 61.9 77.9 56.7 66.2 70.0 81.9 66.5 63.0 67.2 81.4 69.8 63.0 70.4 82.8 75.4 63.0 PhiBench (internal) 48.2 54.5 53.0 56.2 Table 9: Performance through the post-training process. DPO stage 1 is pivotal token DPO, and DPO stage 2 is more standard judge-guided DPO. Each also has 1-5% hallucination and safety data mixed in. decontaminate our training data, including standard n-gram deduplication and decontamination, these methods are not effective against all scenarios, including rephrasing, which leaves some uncertainty about the true extent of generalization. • Limited Skill Scope: Most benchmarks evaluate models on narrowly defined skills, such as solving specific style of math problems at certain grade level or implementing isolated Python functions. This narrow scope can fail to capture a model’s broader capabilities and weaknesses. • Bias in Generation-Based Benchmarks: Some benchmarks use LLM-as-judge for evaluat- ing generated outputs. These judgments sometimes may prioritize style, fluency, or surface-level qualities over accuracy and validity of the reasoning chain, leading to potential biases in scoring. • Limitations of Multiple-Choice Tasks: Benchmarks that rely on multiple-choice questions often test a model’s ability to make clever guesses that can be achieved by pattern matching rather than effectively utilizing the underlying concepts through reasoning. To address these issues, we maintain an internal benchmark called PhiBench, which is tailored to evaluate the diverse skills and reasoning abilities that we found critical to phi-4’s development. This benchmark was designed with the following goals: 1. Originality: All questions in the benchmark were composed by our team making sure that they were not present in our pretraining data. Our goal for the internal benchmark is to reveal model’s generalization ability in various domains. 2. Skill Diversity: Our benchmark includes a wide range of tasks to assess multiple dimensions of model performance. For instance, in coding, it goes beyond isolated function implementation to include debugging, extending incomplete code, and explaining code snippets. Similarly, in 17

18. mathematics, it incorporates tasks like identifying the errors in proofs or generating related prob- lems, rather than simply solving equations. This ensures that the benchmark captures a broader spectrum of skills and reasoning processes. 3. Rigorous Scoring for Generation Tasks: For tasks requiring judgment of model-generated outputs, we addressed the common pitfalls of LLM-based scoring by carefully curating detailed judge instructions (or “judge notes”). These rubrics specify exactly how to evaluate responses, focusing on achieving accuracy, logical structure, and adherence to task requirements, while min- imizing tendencies towards stylistic biases. We observed significantly improved consistency and reduction of adverse impact due to subjective preferences in the scoring outcomes. PhiBench played a central role in optimizing phi-4. We used it to guide decisions about dataset mixtures and hyperparameter choices for more effective post-training techniques. PhiBench was also used to perform high-signal studies that identify weaknesses in the model and provide feedback for new incoming data sources. 6 Performance on Key Benchmarks Our benchmark results were presented in Table 1, along with comparisons to other models. We first re- port the values from OpenAI’s simple-evals benchmark, which is a framework (including prompts, tem- perature, and extraction) for evaluating MMLU [HBB + 20], GPQA diamond [RHS + 23], MATH [HBK + 21], HumanEval [CTJ + 21], MGSM [SSF + 22], and the SimpleQA [WKC + 24] F1-score. We also consider MMLU-pro [WMZ + 24], HumanEval+ [LXWZ23], ArenaHard [CZS + 24], and IFEval [ZLM + 23], for which we use an internal framework and prompting and extraction. Finally, we use PhiBench, our internal collection of evaluations (see Section 5). phi-4 outperforms the closest in-class contemporary model, Qwen-2.5-14B-Instruct, in 9 out of 12 benchmarks. While phi-4 underperforms relative to Qwen-2.5-14B-Instruct on the benchmark numbers for SimpleQA, DROP, and IFEval, we consider phi-4’s behavior on SimpleQA to actually be better than Qwen’s. In fact, our base model gets a higher benchmark score than Qwen-2.5-14B-Instruct on SimpleQA, and we intentionally modified the model’s behavior in post-training to optimize for a better user experience rather than a higher benchmark score. See Figure 6 and Appendix A.1 for details. Our model excels at STEM Q&A tasks. For example, on GPQA (graduate-level STEM questions) and MATH (math competitions), it even outscores its teacher model, GPT-4o. It also scores higher at coding, as measured by HumanEval and HumanEval+, than any other open-weight model we benchmark against, including much larger Llama models. phi-4’s weakest benchmark scores are on SimpleQA, DROP, and IFEval. We believe for the first two that the number reported by simple-evals is reductive and does not accurately reflect model performance on the benchmark problems. However, IFEval reveals a real weakness of our model – it has trouble strictly following instructions. While strict instruction following was not an emphasis of our synthetic data generations for this model, we are confident that phi-4’s instruction-following performance could be significantly improved with targeted synthetic data. 7 Safety We developed phi-4 in accordance with Microsoft’s Responsible AI principles. Our overall approach to RAI consisted of safety alignment in post-training, red-teaming, and automated testing and evaluations across dozens of RAI harm categories. We leveraged helpfulness and harmlessness preference datasets 18

19. Grounding 3P Content Harms (DR1) phi-3 (3B-4K) phi-3 (7B-8K) phi-3 (14B-4K) Mistral (7B-v0.1) Mistral (7B-v0.2) Llama-3 (8B) Gemma (7B) phi-4 4.469 4.701 4.787 4.065 4.692 4.672 4.32 4.619 Books, News, Recipes, Songs 0.26 0.253 0.251 0.562 0.399 0.373 0.383 0.121 Harmful Content Continuation (DR3) Hate/Fairness, Self-Harm, Sexual, Violence 0.007 0.003 0.01 0.026 0.018 0.013 0.013 0.036 Harmful Content Summarization (DR3) Hate/Fairness, Self-Harm, Sexual, Violence 0.105 0.11 0.112 0.223 0.16 0.082 0.103 0.102 See text for covered topics 0.117 0.107 0.111 0.153 0.13 0.114 0.073 Jailbreak(DR1) 0.156 Table 10: Performance comparison across models. Lower scores are better, except for “Grounding,” where a higher score is better. phi-4 values are bold for readability. [BJN + 22, JLD + 23] with modifications inspired by [BSA + 24] and multiple in-house generated datasets to address the RAI harm categories in safety post-training. 7.1 RAI Benchmarks Table 10 shows the results of in-house RAI benchmarks [MHJ + 23] for phi-4 compared to the phi-3 models [AAA + 24], Mistral-7b-v0.1 [JSM + 23], Mistral-7b-v0.2, Gemma 7b [TMH + 24], and Llama-3-instruct- 8b [AI23b]. This benchmark utilized GPT-4o to simulate multi-turn conversations in five different categories and to evaluate the model responses. Grounding is scored between 0 (not grounded) and 5 (fully grounded), and measures if the information in a response is based on a given prompt. In other categories, responses were evaluated in terms of the severity of harmfulness and scored from 0 (no harm) to 7 (severe harm) and the defect rates (DR-x) were computed as the percentage of samples with the severity score being greater than or equal to x. The Jailbreak (DR1) benchmark consists of simulated conversations around child grooming, illegal persuasion, leaking of 100 words of guidelines, popular conspiracy, prejudice against real people, step-by-step illegal advice, and violence against real people. For more details on the RAI prompts and evaluation framework, see [HPBP + 24]. 7.2 Red Teaming In addition to RAI benchmarking, we collaborated with the Microsoft AI Red Team (AIRT), an inde- pendent group tasked with identifying safety and security vulnerabilities in Microsoft’s GenAI products. AIRT conducted a two-week red-teaming exercise that tested phi-4 for risky behaviors by emulating both average and adversarial users in single and multi-turn scenarios. Overall, AIRT found that the behavior of phi-4 was similar to that of the phi-3 family, but identified several risky behaviors that were addressed by further rounds of safety post-training. In addition, the adversarial user scenario tested a wide range of techniques aimed at intentionally subverting the model’s safety training including jailbreaks, prompt encodings, and multi-turn attacks. phi-4 showed strong defenses against these techniques. AIRT also generated adversarial suffixes using the GCG algorithm [ZWC + 23] on phi-3-medium, but found that these suffixes did not transfer to phi-4. Further red teaming is required to identify possible risks across a broader range of scenarios and harm categories. 19

20. 8 Weaknesses While phi-4 achieves similar level of language understanding and reasoning ability as much larger models, it is still fundamentally limited by its size for certain tasks, specifically in hallucinations around factual knowledge. For example, if X is a plausible human name, the model sometimes responds to prompts of the form “Who is X?” with a hallucinated biography of the person X. This limitation would be improved by augmenting the model with a search engine, but factual hallucinations cannot be eliminated completely. While phi-4 demonstrates relatively strong performance in answering questions and performing rea- soning tasks, it is less proficient at rigorously following detailed instructions, particularly those involving specific formatting requirements. For instance, when tasked with generating outputs in strict tabular formats, adhering to predefined bullet structures, or precisely matching stylistic constraints, the model may produce outputs that deviate from the specified guidelines. This limitation arises in part from the model’s training focus, which prioritized synthetic datasets tailored toward Q&A and reasoning tasks over instruction-following scenarios. Even on reasoning tasks, phi-4 can make mistakes. For example, when asked “which number is smaller, 9.9 or 9.11?”, the model can conclude incorrectly that “9.9 is smaller than 9.11”. Moreover, as our data contains a lot of chain-of-thought examples, phi-4 sometimes gives long elab- orate answers even for simple problems—this might make user interactions tedious. We also note that while phi-4 can function as a chat bot, it has been fine-tuned to maximize performance on single-turn queries. Despite diligent RAI efforts, we acknowledge challenges around reproduction or amplification of biases, inappropriate content generation, and safety issues. The use of carefully curated training data, as well as targeted post-training, and improvements from red-teaming insights, have resulted in mitigating these issues across all dimensions, but have not resolved the issues completely. Acknowledgments We thank Janardhan Kulkarni and Sivakanth Gopi from Microsoft Research for the initial discussion around Pivotal Token Search. We thank the AI Red Team (AIRT) at Microsoft, especially Blake Bullwinkel, Bolor-Erdene Jagdagdorj, Daniel Jones, Shiven Chawla, Tori Westerhoff, and Ram Shankar Siva Kumar, and Olga Dutova-Fairfax from the Deployment Safety Board and the Office of Responsible AI at Microsoft for collaborating with us on evaluating and improving our model on vulnerabilities in safety and security, which helped us adhere to the Microsoft’s RAI standards. Finally, we are grateful to Ece Kamar, Doug Burger and Peter Lee from Microsoft Research for the support provided to the team during the work on the model. References [AAA + 24] Marah Abdin, Sam Jacobs Ade, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024. [AI23a] Meta AI. Introducing Meta Llama 3: The most capable openly available llm to date, 2023. [AI23b] Meta AI. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2023. 20

21. [BJN + 22] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Ka- davath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield- Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022. [BSA + 24] Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-tuned Llamas: Lessons from improving the safety of large language models that follow instructions, 2024. [Com24] American Mathematics Competitions. American mathematics competitions problems and solutions: Amc 10/12. https://www.maa.org/math-competitions, 2024. Accessed: 2024-12-08. [CTJ + 21] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evalu- ating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. [CZS + 24] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. [Dee24] DeepSeek. Deepseek r1 lite preview. news1120, 2024. Accessed: 2024-12-08. https://api-docs.deepseek.com/news/ [GZA + 23] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Gustavo de Rosa Piero Kauffmann, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023. [HBB + 20] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. [HBK + 21] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021. [HPBP + 24] Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, Atabak Ashfaq, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, et al. Phi-3 safety post-training: Aligning language models with a “break-fix”cycle. arXiv preprint arXiv:2407.13833, 2024. [JBA + 23] Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Michael Santacroce, Harkirat Singh Behl, Adam Tau- mann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, and Yi Zhang. Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023. 21

22. [JCWZ17] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017. [JLD + 23] Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023. [JSM + 23] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. [LBE + 23] Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need II: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023. [LLL + 24] Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision, 2024. [LLX + 24] Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhances llm’s reasoning capability, 2024. [LXWZ23] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023. [MHJ + 23] Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz, Hongliang Kong, Vincent Yun, Eslam Kamal, Federico Zarfati, Hanna Wallach, Sarah Bird, and Mei Chen. A framework for automated measurement of responsible AI harms in generative AI applications, 2023. [Ope24a] OpenAI. Learning to reason with language models. https://openai.com/index/ learning-to-reason-with-llms/, 2024. Accessed: 2024-12-08. [Ope24b] OpenAI. Simple evals. https://github.com/openai/simple-evals, 2024. [RHS + 23] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. [RSM + 23] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. [SSF + 22] Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022. [Tea24] Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. 22

23. [TMH + 24] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology, 2024. [VSP + 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017. [WFM + 24] Yue Wu, Yewen Fan, So Yeon Min, Shrimai Prabhumoye, Stephen McAleer, Yonatan Bisk, Ruslan Salakhutdinov, Yuanzhi Li, and Tom Mitchell. Agentkit: Flow engineering with graphs, not coding. In COLM, 2024. [WKC + 24] Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large lan- guage models, 2024. [WLS + 24] Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024. [WMZ + 24] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and chal- lenging multi-task language understanding benchmark, 2024. [YGH + 24] Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. Helmet: How to evaluate long-context language models effec- tively and thoroughly, 2024. [ZLM + 23] Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. [ZWC + 23] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023. A A.1 Post-Training Dataset Details Refusal to Hallucinate We created post-training SFT and DPO data to mitigate hallucinations in simple settings. Without any mitigation, phi-4 would almost never admit to ignorance. For example, in response to too-difficult questions like “Who is the 297th highest ranked tennis player?” the model would essentially act as an improv-style “Yes, and. . . ” engine, inventing a superficially plausible answer. Our goal in pretraining was to pack as much information into the model as possible, that is, to teach more to the model rather than to teach it its own limitations. Then in post-training, we can identify the level of problem that is too difficult for the model, and teach it to generate refusals rather than hallucinations on those problems. We started with seed trivia problems, such as from TriviaQA [JCWZ17]. For each question, we ran phi-4 multiple times to estimate its chance of accurately solving it. We also used GPT-4o to generate 23

24. (1) a correct answer, (2) a refusal to answer, (3) a bogus variant of the question that is impossible to solve, and (4) a refusal to answer the bogus question. For SFT data, we used the pair (question, correct answer) wherever the base phi-4 model was usually correct, (question, refusal) where the model was usually wrong, and (bogus question, refusal) for all bogus questions. For DPO data, we used (correct > refusal) for every question that the base phi-4 sometimes answered correctly, and (refusal > wrong) if phi-4 sometimes answered incorrectly. The DPO data used the first 5 tokens of the response. Example synthetic generation prompts can be found below. To evaluate our progress, we can use SimpleQA [WKC + 24], which is a dataset mostly comprised of obscure facts from Wikipedia (e.g., “How many more votes did Freeman Freeman-Thomas win than George Sandys in the 1906 Bodmin by-election?”). Small models like phi-4 or GPT-4o-mini can only correctly answer 5-10% of them. Our performance can be found in Figure 6. Note that SimpleQA is included in Table 1 as part of simple-evals, and our model does not have a good score. This is because simple-evals uses the F1 score, which is not a good measure of quality at this accuracy scale. For example, suppose we start with a model that always guesses, but almost always wrongly, 6% correct and 94% incorrect. Some of the 6% correct answers will be from lucky guesses, so post-training to limit hallucination will have fewer correct answers, and for example, the result might be (3% correct, 3% incorrect, 94% refusal). In this case, a model will score worse by the F1 metric compared to original (5.6% rather than 6%), while exhibiting more user-friendly and responsible behavior. A.1.1 Synthetic Generation Prompts Here, we share the main synthetic generation prompts, provided to GPT-4o, to generate post-training data to decrease hallucinations. Generation of bogus questions Consider the following trivia question: # Question {{ question }} # Instructions Your job is to turn this problem into a nonsensical one, for which the ↪ answer is invalid or unlikely to be known by anyone. For example, you might change the name from a well-known figure to a random name, or ↪ change the date from a well-known event to a random date, or the place ↪ ↪ to a different one. For example, you might change "When did Amelia Earhart cross the Atlantic Ocean?" to "When did Edgar Greenwood cross ↪ ↪ the Atlantic Ocean?" or "How many times did Amelia Earhart cross the ↪ English Channel?". Your goal is that the new question is *plausibly real*, but impossible to ↪ answer. You should not make the question obviously fake, silly, or fictional; for example, all country names should be real countries, ↪ ↪ and no names should be obvious homages to the original question. It ↪ should sound like a serious trivia question. 24

25. You may start with a very brief discussion, then end with two markdown ↪ sections: - The section '# Response' that contains the question. - The section '# Quality' that rates the generated question in quality ↪ from 1-5, with 5 being the highest quality. A high quality question is (1) different from the given question and ↪ (2) plausible Generations of refusals Consider the following question: # Question {{ question }} # Instructions You might well know the answer, but imagine that you were an LLM that did ↪ not know the correct answer. Write a plausible response to this ↪ question that the LLM might give if it did not know the answer and ↪ would like to say so rather than guess incorrectly. This LLM cannot look things up. It could suggest relevant information ↪ that it knows; it can also just say that it does not know the answer, ↪ if it is unlikely to know anything useful. You may start with a very brief discussion, then end with a markdown ↪ section '# Response' that contains the response. Answer checking I am grading solutions to a trivia question. Your job is to determine ↪ whether the given submission matches the answer key. ## Original question {{ question }} ## Submission {{ response }} ## Answer key {{ answer.value }} {% if answer.alternates %} 25

26. ### Alternative Answer Forms {% for alt in answer.alternates %} {{ alt }} {% endfor %} {% endif %} ## Instructions You job is **NOT to solve the question**. Your job is to determine ↪ whether the given submission should be graded as correct *without ↪ needing a protest*. It does not matter if you disagree with the ↪ official answer; you should only consider whether the submission is ↪ equivalent to the answer key. (There is a separate process for ↪ protests.) Give a freeform analysis comparing the submission to the answer key. ↪ should then output a JSON dictionary in the following form: You ```json { "matches_key": [Answer 'True', 'False', or 'Uncertain'], } ``` A.2 Judge-guided DPO For the second round of DPO, we generate responses from GPT-4o, GPT-4t and our model. To label responses as positive or negative, we use GPT-4o as a judge and use the following prompt. Your task is to judge which of the following reply given by an AI ↪ assistant is better. # Conversation {{ chat }} # Replies {{ replies }} # Guideline Produce your output in the following JSON format (without comments and ↪ with correct escape characters): ```json { "faults": { 26

27. "Assistant 1": "(string) List all the problems with the assistant 1 ↪ reply. For each problem try to determine whether this is due to ↪ lack of comprehension of the relevant material, a logical ↪ error, a factual error, a stylistic issue. If the answer is ↪ perfect, write none. If the question did not ask for a specific ↪ level of detail in the explanation, do not penalize the answer ↪ for being too detailed or too concise.", "Assistant 2": ... ... }, "faults_discussion": "(string) Discuss the ↪ weaknesses of each assistant. What are ↪ between the answers in terms of style, ↪ detail?", "accuracy": { "Assistant 1": (1-5) how would you rate ↪ accuracy?, ... }, "style": { "Assistant 1": (1-5) how would you rate ↪ style?, ... }, "detail": { "Assistant 1": (1-5) how would you rate ↪ level of detail?, ... } general strengths and the main differences accuracy and level of assistant 1 in terms of assistant 1 in terms of assistant 1 in terms of } ``` B B.1 Data Processing Decontamination We decontaminate against the ARC-Easy, MBPP, phibench, CommonsenseQA, WinoGrande, mcphi, MedQA, MATH, AGIEval, PIQA, OpenBookQA, HellaSwag, GPQA, mt-bench, MMLUPro, GSM8k, HumanEval, arena hard, ARC-Challenge, and MMLU benchmarks. We apply a hybrid n-gram algorithm for decontamination which uses 13-gram and 7-gram features for removing matches to the test set, which is described in more detail in 1. We create a set of common 13-grams in the Wiki and train set and try to not remove them since these are some common phrases which are ubiquitous. Some examples include ’a i only b ii only c iii only d ii and iii’, ’a true true b false false c true false d false true’, ’logically equivalent b contradictory c neither logically equivalent nor contradictory but consistent d’, ’a (ii) and (iv) only b (i) and (iii) only c (i) (ii)’, ’b e b a b e c c b d c e d’. 27

28. Algorithm 1 Decontamination Algorithm Require: Input training text train Benchmark texts tests Allowed 13-grams allowed 13gram Thresholds: inf o 7gram threshold, contaminated 7gram threshold Ensure: A result object with contamination details 1: procedure CheckContamination(train) 2: Step 1: Extract 13-grams and Check Contamination 3: f eatures 13 ← ExtractNGrams(train, 13) 4: for all f eature ∈ f eatures 13 do 5: if f eature ∈ BenchmarkFeatures 13 and f eature ∉ allowed 13gram then 6: return Contaminated (13-gram match) 7: Step 2: Extract 7-grams and Compute Overlaps 8: f eatures 7 ← ExtractNGrams(train, 7) 9: overlap counts ← CountOverlaps(f eatures 7, BenchmarkFeatures 7 ) 10: Step 3: Compute Overlap Ratio for Tests 11: max ratio ← 0, max test ← None 12: for all test ∈ BenchmarkTests do overlap counts[test] 13: ratio ← min(len(f eatures 7),len(BenchmarkFeatures 7 [test])) 14: if ratio > max ratio then 15: max ratio ← ratio, max test ← test 16: Step 4: Determine Contamination 17: if max ratio > inf o 7gram threshold then 18: if max ratio ≥ contaminated 7gram threshold then 19: return Contaminated (7-gram match) 20: else 21: return Partial Contamination (7-gram info match) 22: return Clean (No significant overlap) 28

29. contaminated: True Train: There are some oarsmen in a boat. The average weight is increased by 1.8 kg when one of the crew, who weighs 53 kg, is replaced by a new man who weighs 71 kg. How many oarsmen are there in the boat? Train Dataset: orca-math-word-problems-200k 13gram test: The average weight of 10 oarsmen in a boat is increased by 1.8 kg when one of the crew, who weighs 53 kg is replaced by a new man. Find the weight of the new man.A. 71 B.62 C.43 D.67 E.40 Test Dataset: AGIEval 13gram segment: one of the crew who weighs 53 kg is replaced by a new 13gram contaminated: True 7gram test: The average weight of 10 oarsmen in a boat is increased by 1.8 kg when one of the crew, who weighs 53 kg is replaced by a new man. Find the weight of the new man.A. 71 B. 62 C. 43 D. 67 E. 40 7gram overlaps: [’1 8 kg when one of the’, ’of the crew who weighs 53 kg’, ’the crew who weighs 53 kg is’, ’increased by 1 8 kg when one’, ’crew who weighs 53 kg is replaced’, ’weighs 53 kg is replaced by a’, ’8 kg when one of the crew’, ’is increased by 1 8 kg when’, ’kg when one of the crew who’, ’53 kg is replaced by a new’, ’when one of the crew who weighs’, ’by 1 8 kg when one of’, ’kg is replaced by a new man’, ’one of the crew who weighs 53’, ’who weighs 53 kg is replaced by’] 7gram ratio: 0.39473684210526316 C AMC Evaluation Details In this section, we fully describe our inference and grading schemes used for to obtain the November 2024 AMC scores displayed in Figure 1. The 78 questions in these contests (4 sets of 25 questions, with overlaps between the 10A/12A and 10B/12B exams) were made available on or after November 6, 2024. All external models we tested were published before this date, as were the datasets for all stages of phi- 4’s training. Thus, these contests are our best attempt at conducting a completely contamination-proof evaluation of mathematical reasoning capabilities. We only benchmarked on this dataset after choosing the hyperparameters used in post-training our final candidate models, making this dataset completely independent of our final model. 8 We obtained the questions from the Art of Problem Solving Wiki 9 , and formatted them with the following template: The following question is from a 25-question, multiple choice test. Each ↪ question is followed by answers marked A, B, C, D, and E. Only one of ↪ these is correct. SCORING: You will receive 6 points for each correct answer, 1.5 points for ↪ each problem left unanswered, and 0 points for each incorrect answer. Solve the question step by step, then answer \boxed{A}, \boxed{B}, ↪ \boxed{C}, \boxed{D}, \boxed{E}, or \boxed{blank}. # Question 8 For full disclosure, we evaluated our final three candidate models on this dataset and all three average scores exceeded 89. We settled on our final model based on other factors, before measuring its score but after seeing the scores for the other two candidates. 9 https://artofproblemsolving.com/wiki/index.php/2024_AMC_10A (10B, 12A, 12B) 29

30. {{question}} (A) {{option_a}} (B) {{option_b}} (C) {{option_c}} (D) {{option_d}} (E) {{option_e}} With each question formatted this way, we obtained 10 independent generations at temperature 0.5 from each model we tested. We then followed the grading scheme described in the above prompt. We found that every model we tested (including our own) frequently failed to follow the “box your final answer” instruction, particularly after a long chain of thought. To stabilize the evaluations, we decided to count otherwise correct solutions (e.g. boxing the correct numerical expression) as correct. To do this, we prompted GPT-4o to extract a final answer (A/B/C/D/E or none) from each model’s solution, with temperature 1. D Synthetic generation examples D.1 Generation examples We review a few examples of what our synthetic datasets look like, to give a general flavor of some of our techniques. D.1.1 Extracting and Tagging Excerpts from Content To construct a dataset focused on reasoning and complexity, we extract excerpts from sources such as web pages, books, and scientific articles. Each excerpt is annotated with metadata such as complexity level, factual obscurity, and the presence of reasoning chains. These annotations help identify high-value content for synthetic data generation. Below, we provide an example of metadata we generated for scientific content: Start Line: 8, End Line: 11 Information Type: Empirical Result Discussion Brief Description: Discussion of the role of SNPs in AD development and severity. Complexity Level: B Factual Obscurity: C Chain of Reasoning: True Reasoning Steps: 1. Assumption: Specific SNPs influence immune response. Conclusion: These SNPs may affect AD severity by modulating IgE levels. Description: The study links genetic variations to clinical phenotypes. 2. Assumption: SNPs associated with AD are replicated across cohorts. Conclusion: Replication supports their role as potential genetic markers. Description: Statistical analysis confirms the robustness of the findings. Dependencies: Previous GWAS findings, statistical methods (lines 10-15) The raw content will later be filtered based on the metadata, and both will be used as seed for a variety of synthetic generations pipelines. 30

31. D.1.2 Self-revision Another component of our synthetic data pipeline is the iterative self-revision process. This process involves multiple stages where the model generates, critiques, and revises its output to meet specific criteria. Below, we give an example of this workflow, focusing on the creation of a reading comprehension exercise based on a scientific excerpt: **Excerpt:** "Future studies should replicate our findings for OXTR and BDNF, but also ↪ include additional stress-related candidate genes. [...] could reverse ↪ aberrant DNA methylation-could become an important goal in the ↪ development of new treatment approaches." **Revision 0:** *Plan:* To create an advanced reading comprehension exercise: - Incorporate technical terms from genetics and epigenetics. - Add nested counterfactuals and conditional reasoning. - Require logical reasoning and background knowledge. *Exercise:* Researchers focused on DNA methylation following psychosocial ↪ stress. What conclusion can be drawn from studies examining both ↪ aversive and positive experiences, diversified cohorts, and refined ↪ sampling intervals? 1. A monotonic relationship between stress and methylation is confirmed. 2. Differential methylation patterns reveal bidirectional sensitivity ↪ (correct). 3. Early-life adversities remain the primary influence. 4. Positive experiences have negligible effects. **Requirements (Flavored):** - Exercises must require both understanding the text and external advanced ↪ knowledge. - Incorrect choices should be plausible enough to mislead without full ↪ comprehension. - The content should challenge advanced undergraduates in genetics or ↪ epigenetics. **Critique (Agent):** - **External Knowledge:** Does not explicitly require concepts beyond the ↪ excerpt. *Score: 1* - **Plausibility of Choices:** Incorrect options are plausible but not ↪ misleading enough. *Score: 2* - **Suggestions:** Introduce external concepts like epigenetic drift or ↪ the diathesis-stress model, and refine incorrect choices to address ↪ common misconceptions. **Revision 1:** 31

32. *Plan:* Add references to the hypothalamic-pituitary-adrenal (HPA) axis ↪ and cortisol's role in stress responses, integrating advanced ↪ neuroendocrinology knowledge. *Exercise:* Considering DNA methylation and the HPA axis's role, how could ↪ cortisol influence classical and non-classical epigenetic changes? 1. Cortisol is irrelevant to the modifiers discussed. 2. Cortisol effects are linear and align with classical models. 3. The dynamic epigenetic role of cortisol enriches research paradigms ↪ (correct). 4. Cortisol’s role is limited to downregulation of methylation. **Critique (Agent):** - **Challenge Level:** Still insufficiently difficult for advanced ↪ undergraduates. *Score: 1* - **Suggestions:** Add nuanced alternatives based on theories like ↪ eustress vs. distress or glucocorticoid response elements. **Revision 2:** *Plan:* Refine incorrect options and add concepts like glucocorticoid ↪ response elements to deepen the challenge. Reframe exercise to compare ↪ classical and non-classical pathways in epigenetics. --- D.1.3 Creating Conversations from Snippets Another component of our synthetic data pipeline involves transforming factual snippets into multi-turn, nuanced conversations. This process is designed to produce dialogues that are contextually accurate and intellectually engaging, encouraging critical thinking and deeper reasoning. The process of generating conversations includes several iterative steps: 1. Initial Turn: Based on the snippet, the model generates a foundational response that presents core information in a clear and direct manner. This step may also include an optional curation of a participant’s profile that guides the tone and style of dialogue throughout the generation. 2. Subsequent Turns: The dialogue grows through a dynamic flow of agents generating alternating turns, directing the conversation, summarizing earlier context and injecting new scenarios and topics consistent with earlier messages. 3. Self-Revision: After each turn, the model evaluates its responses, brainstorms the best next course of action given the current context, to maximize the complexity and nuance of the discussion. This includes suggestions for exploring alternative perspectives or creating connections to broader contexts. D.2 Fill-in-the-middle exercises Many data generation techniques require pairs of contexts and ground-truth answers for seeding and rely on rejection sampling during data generation. One approach we developed creates fill-in-the-middle 32

33. exercises from free-form text that does not naturally translate into such pairs. This involves removing a meaningful portion of the text (the “middle”), framing the remaining text as the context, and using the removed content as the ground-truth answer. The key challenge lies in identifying portions of the text where reasoning can reconstruct the removed content, ensuring a “correct” answer exists. For instance, we generate such pairs from public code snippets, turning them into exercises and producing their solutions (augmented with chain-of-thought tokens) via rejection sampling. We give an example of such a transcript. **Fill-in-the-Middle Exercise** **Context (Excerpt from Code Snippet):** for p in [path + [n] for n in graph[node]]: smalls = collections.defaultdict(lambda: 0) for n in path: if all_c['start'] > 1 or all_c['end'] > 1: continue for c in smalls.values(): if c <= 1: continue <<THIS PART DELETED>> if not bad: paths.append(p) **Solution:** **Background:** This snippet processes paths in a graph, tracking visits to nodes, ↪ particularly "small" nodes identified by lowercase strings. Key ↪ conditions: - Start and end nodes cannot be revisited. - "Small" nodes may have specific limits on revisits. **Reasoning:** The missing part handles violations of small node visit limits using two ↪ variables: - `cused` indicates if one small node has already been revisited. - `bad` marks the path as invalid if a second such revisit occurs. The loop iterating over `smalls.values()` likely checks whether a small ↪ node is revisited for the first time (allowing it by setting `cused` ↪ to `True`) or violates the limit (setting `bad` to `True` and breaking the loop). ↪ **Final Guess:** The missing part is likely: if c > 1: if not cused: cused = True 33

34. else: bad = True break **Comparison to Correct Answer:** - The solution captures the general logic but omits the specific condition ↪ `elif c == 2 and not cused`, making it broader. - **Correctness Discussion:** The logic is close but lacks precision, as ↪ the revisits must explicitly check for exactly two visits. **Grading and Feedback:** - **Grade:** 3 (partial correctness). - **Hint:** Consider revisiting the specific condition for revisits. Focus ↪ on the exact count of visits that should trigger allowances. D.3 Agents Agent tasks that involve interactions with longer horizons may present unique challenges such as plan- ning, reflection, and error correction. Our training data consists of trajectories collected from Agen- tKit [WFM + 24] with detailed chain-of-thought. Specifically, we generate high-quality trajectories by running AgentKit on a wide range of environments, and rewrite the guided raw reasoning outputs from AgentKit into a variety of self-contained statements or thoughts that aim to capture the gist of Agen- tKit reasoning. Training with AgentKit data improves planning, reasoning, tool use, math, and error correction on our internal benchmark. In agent tasks, phi-4 follows a characteristic chain-of-thought with explicit planning and reflections. We show an example of phi-4 interacting with the tower of Hanoi below: Analyzing the current situation and reflecting on past actions is crucial ↪ for determining the next best step. Here's a breakdown of the thought ↪ process: ### Analysis of Current and Past Environment 1. **Current Configuration:** - Rod A: |bottom, [0], top| - Rod B: |bottom, [1], top| - Rod C: |bottom, [2], top| 2. **Past Performance:** - There have been repeated unsuccessful attempts to move disks, ↪ particularly from rod B to C and A to C. These failures were due to rule violations, such as attempting to place a larger disk on a ↪ ↪ smaller one. 3. **Environmental Patterns and Rules:** - Only one disk can be moved at a time. - A larger disk cannot be placed on a smaller disk. - The goal is to move all disks from rod A to rod C. 34

35. ### Reflection on Past Performance - **Mistakes Identified:** - Repeated attempts to make illegal moves indicate a misunderstanding or ↪ oversight of the rules. - The plan was not effectively followed, as seen by the repeated failed ↪ attempts to move disks from rod B to C and A to C. - **Correct Moves:** - Successfully moved disk 0 from C to A, which was a necessary step to ↪ free rod C for subsequent moves. ### Relevant Considerations - **Current Objective:** - The immediate goal is to move disk 1 from rod B to rod C. This move is ↪ legal and aligns with the rules since disk 1 is smaller than disk 2 ↪ on rod C. - **Strategic Alignment:** - Moving disk 1 from B to C is a strategic step that aligns with the ↪ overall plan to eventually stack all disks on rod C in the correct ↪ order. ### Additional Thoughts - Ensuring compliance with the rules is paramount to avoid repeated ↪ failures. - Each move should be evaluated for legality before execution to maintain ↪ progress towards the goal. ### Updated Plan 1. **Move Disk 1 from B to C:** - Completion Quantification: Disk 1 is on rod C. - Certainty: Known. This move is legal and necessary to progress ↪ towards the goal configuration. 2. **Move Disk 0 from A to C:** - Completion Quantification: Disk 0 is on rod C, on top of disk 1. - Certainty: Known. This move is legal once disk 1 is on rod C. 3. **Move Disk 2 from C to B:** - Completion Quantification: Disk 2 is on rod B. - Certainty: Known. This move is necessary to free rod C for the final ↪ sequence. 35

36. ### Chosen Action Based on the analysis and the updated plan, the next logical and legal ↪ action is to move disk 1 from rod B to rod C. This move adheres to the ↪ rules and progresses towards the goal configuration. <updated plan> 1. **Move Disk 1 from B to C**: - Completion Quantification: Disk 1 is on rod C. - Certainty: Known. This move is legal and necessary to progress ↪ towards the goal configuration. 2. **Move Disk 0 from A to C**: - Completion Quantification: Disk 0 is on rod C, on top of disk 1. - Certainty: Known. This move is legal once disk 1 is on rod C. 3. **Move Disk 2 from C to B**: - Completion Quantification: Disk 2 is on rod B. - Certainty: Known. This move is necessary to free rod C for the final ↪ sequence. </updated plan> <action>Move the top disk of rod B to the top of rod C</action> 36