LongCat-Flash Technical Report

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. LongCat-Flash Technical Report Meituan LongCat Team longcat-team@meituan.com A BSTRACT We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B–31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model- growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of $0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat Github: https://github.com/meituan-longcat Figure 1: Benchmark performance of LongCat-Flash.
2. LongCat-Flash Technical Report Contents 1 Introduction 4 2 Architecture 5 2.1 Zero-Computation Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Computational Budget Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Load Balance Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Shortcut-Connected MoE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Variance Alignment Design for Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Scale-Correction for MLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Variance Compensation for Experts Initialization . . . . . . . . . . . . . . . . . . . . . . . . 9 Model Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 3 4 Pre-Training 10 3.1 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.1 Hyperparameter Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.2 Model Growth Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.3 Training Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 General Pre-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Reasoning and Coding Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Long Context Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Decontamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.6.1 Evaluation Benchmarks and Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.6.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Post-Training 15 4.1 Reasoning and Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Agentic Tool Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 General Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.4.1 Evaluation Benchmarks and Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.4.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5 Training Infrastructures 6 21 5.1 Numerical Precision Control and Fault Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Kernel Optimization for Determinism and Performance . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3 Distributed Strategy for Large-scale Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.4 Reliability and Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Inference and Deployment 23 2
3. LongCat-Flash Technical Report 6.1 6.2 6.3 Model-Specific Inference Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 6.1.1 Computation and Communication Orchestration . . . . . . . . . . . . . . . . . . . . . . . . 23 6.1.2 Speculative Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.1.3 Reducing KV Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 System-Wide Inference Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2.1 Minimize Schedule Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2.2 Custom Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.2.3 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Deployment and Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.3.1 Measured Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.3.2 Theoretical Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 7 Conclusion 28 8 Contributions 29 A Appendix 35 A.1 Statistics and Case Studies of Dynamic Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 35
4. LongCat-Flash Technical Report 1 Introduction The rapid advancement of large language models (LLMs) such as DeepSeek-V3 [DeepSeek-AI et al., 2025], Qwen 3 [Yang et al., 2025], and Kimi-K2 [Team et al., 2025] has demonstrated the effectiveness of scaling model size and computational resources. While some recent progress raises concerns about potential scaling slowdowns, we believe that algorithmic design, underlying system optimizations, and data strategy all play equally critical roles in further pushing the frontier of scalable intelligence. This requires innovations in both model architecture and training strategies to improve the cost-effectiveness of scaling, as well as a systematic data strategy to enhance the model’s capability for solving real-world tasks. In this work, we introduce LongCat-Flash, an efficient yet powerful Mixture-of-Experts (MoE) language model designed to advance the frontier of language model along two synergistic directions: computational efficiency and agentic capability. Trained on tens of thousands of accelerators, LongCat-Flash combines architectural innovations with a sophisticated, multi-stage training methodology for scalable and intelligent models. Our contributions span both efficiency and agentic intelligence: • Scalable Architectural Design for Computational Efficiency LongCat-Flash is designed and optimized under two key principles: efficient computation utilization, as well as efficient training and inference. Specifically, (1) As not all tokens are equal, we introduce the zero-computation experts mechanism in MoE blocks to allocate a dynamic computation budget to important tokens based on their significance, i.e., activating 18.6 to 31.3 billion parameters (out of 560 billion total) based on contextual demands. To ensure consistent computation load, we employ expert bias adjusted by a PID-controller, maintaining an average of ∼27 billion activated parameters per token. (2) As communication overhead becomes a bottleneck during MoE model scaling, we incorporate the Shortcut-connected MoE (ScMoE) [Cai et al., 2024] design to expand the computation-communication overlap window. Combined with customized infrastructure optimizations, this design enables training at a massive scale of over tens of thousands accelerators and inference with high throughput and low latency. • Effective Model Scaling Strategy Effectively and efficiently scaling model size remains a key challenge in strategy design. To this end, we develop a comprehensive stability-and-scaling framework for robustly training large-scale models: (1) We successfully apply a hyperparameter transfer strategy to such a large model, predicting optimal hyperparameter configurations by leveraging results from smaller proxy models with theoretical guarantees. (2) We initialize the model using a model-growth mechanism based on a refined half-scale checkpoint, achieving improved performance compared to conventional initialization methods. (3) A multi-pronged stability suite incorporates principled router-gradient balancing, a hidden z-loss to suppress massive activations, and fine-tuned optimizer configurations. (4) To enhance the reliability of large-scale cluster training, we introduce deterministic computation. This guarantees the exact reproducibility of experiments and enables the detection of SDC (Silent Data Corruption) during the training process. These interventions ensure that LongCat-Flash ’s training remains stable, with no irrecoverable loss spikes. • Multi-Stage Training Pipeline for Agentic Capability Through a meticulously designed pipeline, LongCat-Flash is endowed with advanced agentic behaviors. Initial efforts focus on constructing a more suitable base model for agentic post-training, where we design a two-stage pretraining data fusion strategy to concentrate reasoning-intensive domain data. During mid-training, we enhance reasoning and coding capabilities while extending the context length to 128k to meet agentic post-training requirements. Building on this advanced base model, we proceed with a multi-stage post-training. Recognizing the scarcity of high-quality, high-difficulty training problems for agentic tasks, we design a multi-agent synthesis framework that defines task difficulty across three axes, i.e., information processing, tool-set complexity, and user interaction—using specialized controllers to generate complex tasks requiring iterative reasoning and environmental interaction. Overall, benefiting from our synergy among scalable architectural design, training strategies, and infrastructure efforts, LongCat-Flash achieves both high training throughput and low inference latency. Notably, we complete the pre-training of our 560B model over 20T tokens within 30 days and achieve 98.48% time availability without manual intervention for fault resolution. During inference, large-scale deployment efficiency exceeds 100 tokens per second (TPS) on H800, with a cost of $0.7 per million output tokens, demonstrating remarkable performance compared to models with similar size. We evaluate the base and instruction-tuned versions of LongCat-Flash across diverse benchmarks, with an overview summarized in Figure 1. As a non-thinking model, LongCat-Flash achieves performance comparable to state-of-the-art non-thinking models, including DeepSeek-V3.1 [DeepSeek-AI et al., 2025] and Kimi-K2 [Team et al., 2025], while using fewer parameters and offering faster inference speed. Specifically, LongCat-Flash scores 86.5 on ArenaHard-V2, 39.5 on TerminalBench, and 67.7 on τ 2 -Bench, demonstrating robust capabilities in general domains, coding, and agentic tool use. To mitigate potential contamination from existing open-source benchmarks and enhance evaluation confidence, 4
5. LongCat-Flash Technical Report Output Hidden Layers × 𝐿 FFN 1 3 2 ... N 1 ... Z MLA FFN FFN Expert Top-𝑘 Router Zero-computation Expert Multi-head Latent Attention (MLA) Input Hidden Figure 2: The architecture adopted in LongCat-Flash. Each layer employs Shortcut-connected Mixture of Experts (ScMoE) with zero-computation experts. ScMoE significantly expands the computation-communication window to boost training and inference efficiency. The zero-computation experts enable dynamic computation based on contextual importance, improving the efficiency of computational resource utilization. we meticulously constructed two new benchmarks: Meeseeks [Wang et al., 2025a] and VitaBench. Meeseeks simulates realistic human-LLM interactions through an iterative feedback framework to evaluate multi-turn instruction-following ability, where LongCat-Flash achieves scores on par with frontier LLMs. VitaBench leverages real-world business scenarios to access models’ proficiency in addressing complex real-world tasks, where LongCat-Flash delivers superior performance than other LLMs. In the remainder of this report, we first detail the architecture and innovations in LongCat-Flash. Then, we describe the pre-training and post-training processes, including our training strategies, data construction methods, and evaluation results. Finally, we discuss the challenges and solutions in training LongCat-Flash, along with optimized inference and deployment methods that leverage its unique architecture. 2 Architecture LongCat-Flash adopts a novel MoE architecture with two key innovations (Figure 2): (1) The MoE block incorporates zero-computation experts [Jin et al., 2024] to enable dynamic computation, allowing tokens to consume variable computational resources based on their contextual significance. Furthermore, the average computational load is regulated through an adaptive expert bias. (2) Each layer integrates two Multi-head Latent Attention (MLA) block [Liu et al., 2024a] and multiple heterogeneous Feed-Forward Network (FFN) blocks. A shortcut connection from the first MLA output directly to the MoE block [Cai et al., 2024] is employed. To further enhance performance, we refine both the MLA and fine-grained FFN experts via variance alignment. The following subsections will detail each of these components. 2.1 Zero-Computation Experts Next-token prediction exhibits inherent computational heterogeneity. Difficult tokens may demand more resources for accurate prediction, while easy tokens require negligible computation. This phenomenon is also empirically evidenced by speculative decoding, where small draft models reliably predict the outputs of large models for most easy tokens [Leviathan et al., 2023]. Motivated by this, LongCat-Flash presents a dynamical computational resource allocation mechanism by activating a variable number of FFN experts per token through zero-computation experts [Jin et al., 2024, Zeng et al., 2024], enabling a more reasonable allocation of computations according to contextual significance. Specifically, LongCat-Flash expands its expert pool with Z zero-computation experts in addition to N standard FFN experts. Zero-computation 5
6.
7. LongCat-Flash Technical Report
8. LongCat-Flash Technical Report models [Rajbhandari et al., 2022, Liu et al., 2024a]. However, the efficiency of large-scale MoE models is largely constrained by communication overhead. In the conventional execution paradigm, expert parallelism imposes a sequential workflow: an collective operation must first route tokens to their designated experts before computation can begin. This communication latency becomes a bottleneck, leading to device underutilization and limiting overall system throughput. While shared-expert architectures attempt to mitigate this by overlapping communication with a single expert’s computation, their efficiency is limited by the small computational window of that one expert. We overcome this limitation by employing the Shortcut-connected MoE (ScMoE) architecture [Cai et al., 2024]. ScMoE introduces a cross-layer shortcut that reorders the execution pipeline. This key innovation allows the dense FFN from the preceding block to execute in parallel with the dispatch/combine communication of the current MoE layer, creating a more substantial overlap window than shared-expert designs. Furthermore, the architecture design choice is verified by the following key findings. First, ScMoE structure does not compromise model quality. As shown in Figure 4, the training loss curves of our architecture and the baseline without ScMoE are nearly identical, confirming this reordered execution does not impair model performance. Consistent results are observed across multiple settings, including a 2.4B-16B MoE model with MLA, a 3B-20B model with MHA [Vaswani et al., 2017], and 15B-193B models with GQA [Ainslie et al., 2023]. Importantly, these findings demonstrate that the stability and benefits of ScMoE are orthogonal to the choice of attention mechanism. Second, the ScMoE architecture delivers substantial system-level efficiency gains for both training and inference. For Large-Scale Training: The expanded overlap window allows the computation of the preceding block to be fully parallel with its dispatch and combine communication phases in the MoE layer, achieved by partitioning operations into fine-grained chunks along the token dimension. For Efficient Inference: ScMoE enables a Single Batch Overlap pipeline, reducing the theoretical Time-Per-Output- Token (TPOT) by nearly 50% compared to leading models such as DeepSeek-V3. Moreover, it allows for the concurrent execution of distinct communication patterns: intra-node Tensor Parallelism communication (via NVLink) on the dense FFN can be fully overlapped with inter-node Expert Parallelism communication (via RDMA), thereby maximizing total network utilization. In summary, ScMoE delivers substantial performance gains without sacrificing model quality. These efficiency gains are not achieved through trade-offs but are the direct outcome of a rigorously validated, quality-neutral architectural innovation. 2.3 Variance Alignment Design for Scalability Architectural designs that excel at small scales may become suboptimal as models are scaled up, and vice versa, rendering initial design choices unreliable. Through extensive experimentation and theoretical analysis, we identify variance misalignment in specific modules as a key factor contributing to this discrepancy, which can lead to instability and degraded performance during scaling. To address this challenge, we propose variance alignment techniques for both MLA and MoE blocks. 2.3.1 Scale-Correction for MLA LongCat-Flash employs a modified Multi-head Latent Attention (MLA) mechanism [Liu et al., 2024a], which in- corporates scale-correction factors α q and α kv to address the variance imbalances inherent in asymmetric low-rank factorization. Our full mathematical formulation, which integrates these correction factors, is given as follows: DQ c Q h t ∈ R d q , t = α q W c KV = α kv W DKV h t ∈ R d kv , t C q t,i = W U Q c Q t , C k t,i = W U K c KV , t R q t,i = RoPE(W QR c Q t ),  C  R q t,i = q t,i ; q t,i , k t R = RoPE(W KR h t ),  C  k t,i = k t,i ; k t R ,   u t = W O o t,1 ; o t,2 ; . . . ; o t,n h , o t,i = Attention(q t,i , k 1:t,i , v 1:t,i ) , v t,i = W U V c KV , t (6) where h t ∈ R d model is the input hidden state, and the final query and key for each head i are formed by concatenating a content part (C) and a rotary part (R). 8
9. LongCat-Flash Technical Report
10. LongCat-Flash Technical Report 2.4 Model Information Tokenizer LongCat-Flash employs byte-pair encoding (BPE) [Shibata et al., 1999, Sennrich et al., 2015] for tok- enization. Our tokenizer is trained on a comprehensive multilingual corpus spanning web pages, books, source code, etc, ensuring robust cross-domain performance. While inheriting GPT-4’s pre-tokenization framework, we introduce the following modifications: (1) Enhanced CJK character segmentation for improved Chinese text handling, and (2) Independent digit tokenization to boost mathematical capabilities. The vocabulary size is optimized at 131,072 tokens, striking an effective balance between computational efficiency and linguistic coverage. Multi-Token Prediction To enhance inference efficiency, we integrate Multi-Token Prediction (MTP) [Gloeckle et al., 2024, DeepSeek-AI et al., 2025] as an auxiliary training objective. For optimal inference performance, we employ a single dense layer rather than a MoE layer as the MTP head. Empirical observations reveal rapid convergence of MTP loss, prompting us to strategically introduce MTP training in the middle of the training pipeline to balance model performance with prediction accuracy. The MTP head achieves >90% acceptance rate in evaluations (Table 5). Model Configurations LongCat-Flash consists of 28 layers (excluding the MTP layer) with a 6144-dimensional hidden state. Each MLA block uses 64 attention heads with per-head dimension 128 for balanced performance-efficiency tradeoff. Following DeepSeek-V3 [Liu et al., 2024a], we set the KV compression dimension to 512, and the query compression dimension to 1536. The FFNs in the dense path employ 12288 intermediate dimensions, while each FFN expert uses 2048 dimensions. The scaling factors in MLA blocks and FFN blocks follow the methodology in Section 2.3.1. Each layer contains 512 FFN experts and 256 zero-computation experts, with exactly 12 experts activated per token (selected from both types). LongCat-Flash has 560B total parameters, activating between 18.6B and 31.3B parameters per token depending on context, with an average activation of approximately 27B parameters. 3 Pre-Training The pre-training of LongCat-Flash follows a three-stage curriculum: (1) We train the model on approximately 20 trillion tokens with 8192 sequence length to establish a robust base model. (2) Reasoning and coding capabilities are further enhanced using trillions of data. (3) The context length is extended to 128k through training on long context corpora. Each stage implements tailored data strategies accompanied by rigorous decontamination procedures to prevent test set leakage. To optimize scalability, we introduce hyperparameter transfer and model growth strategies, significantly improving performance as model size increases. Given the inherent instability challenges in large-scale training, we identify and implement multiple effective techniques to enhance training stability. 3.1 3.1.1 Training Strategy Hyperparameter Transfer LongCat-Flash employs a hyperparameter transfer strategy based on width scaling [Everett et al., 2024] to efficiently train large-scale models. The methodology involves: (1) identifying optimal hyperparameters on a smaller proxy model, and (2) transferring these configurations to the target model through theoretically-motivated scaling rules. The transfer mechanism centers on the width scaling factor s = n target /n proxy , where n is the model’s hidden dimension. We specifically adopt the “Adam LR Full Align” rules for Standard Parameterization. These rules specify how to adapt the proxy model’s optimal initialization variance (σ 2 ) and learning rate (η) for the target architecture. The practical transfer rules are summarized in Table 1. Table 1: Practical hyperparameter transfer rules and their underlying scaling exponents, derived from the Adam LR Full Align principle for Standard Parameterization [Everett et al., 2024]. Here, s is the width scaling factor n target /n proxy . Layer & Parameter Target Model Setting Embedding (Init Var, σ 2 ) Embedding (Learning Rate, η) 2 2 σ target = σ proxy η target = η proxy Hidden/Unembedding (Init Var, σ 2 ) Hidden/Unembedding (Learning Rate, η) Following this methodology, our training involves the following steps: 10 2 2 σ target = σ proxy /s η target = η proxy /s
11. LongCat-Flash Technical Report 1. We set the width scaling factor s to 8 based on a trade-off analysis between computational efficiency and transfer performance. The proxy model is configured with a width of 768. 2. We then perform a comprehensive hyperparameter search on the proxy model to identify the optimal layer-specific 2 initializaton variances (σ proxy ) and learning rates (η proxy ). 3. The optimal hyperparameters from the proxy model were transferred to the target model following the rules detailed in Table 1. All other architectural attributes (depth, sparsity, and batch size) remain invariant during this transfer process. We conducted comprehensive experiments to validate the effectiveness of this approach. The results demonstrate that this method significantly reduces computational costs when identifying optimal hyperparameters (initialization variance and learning rate) for large-scale model training, while establishing a robust, theoretically grounded framework for model scaling. 3.1.2 Model Growth Initialization LongCat-Flash employs model growth as its initialization strategy, starting from a half-scale model pre-trained on tens of billions of tokens. Among existing model growth methods [Chen et al., 2015, Du et al., 2024, Wang et al., 2023a, Shen et al., 2022, Wang et al., 2023b, Gong et al., 2019], we adopt the layer stacking technique [Du et al., 2024, Kim et al., 2023] to expand parameters and enhance performance. Disregarding the embedding and unembedding processes temporarily, the whole procedure is formulated as: L small = l 1 ◦ l 2 ◦ · · · ◦ l n L target = L small ◦ L small ◦ · · · ◦ L small | {z } r where l i denotes the transformation of the ith layer in the model, r denotes the expansion rate, L small denotes the small model’s transformation from token embeddings to final hidden states, and L target represents the transformation of the target (large) model constructed by stacking r copies of the small model. We use r = 2 for our architecture. Through extensive experiments, we consistently observed that models initialized via model growth exhibit a charac- teristic loss trajectory: an initial increase followed by accelerated convergence, ultimately outperforming randomly initialized baselines. Figure 5b presents a representative case from our 6B activated model experiments, demonstrating the advantage of model growth initialization. We conjecture that this improvement arises from two synergistic factors: (1) the faster convergence of smaller models likely provides higher-quality parameter initializations for scaled training, and (2) growth operations potentially serve as implicit regularization against parameter collapse. Experimental evidence further suggests that over-optimizing predecessor models may negatively impact token efficiency in target models, indicating the need for judicious growth timing. For LongCat-Flash initialization, we first train a 14-layer model with identical architecture to the target model, using random initialization on the initial data segment. The trained model is then stacked to create a 28-layer checkpoint, preserving all training states including sample counters and learning rate schedules from the predecessor. 3.1.3 Training Stability We enhance the training stability of LongCat-Flash from three perspectives: router stability, activation stability, and optimizer stability. Router Stability A fundamental challenge in training MoE models is router stability, which stems from the tension between two competing gradients: • The language modeling (LM) loss, driving expert specialization (assigning tokens to the most suitable experts), • The auxiliary load balancing (LB) loss, enforcing routing uniformity (distributing tokens evenly across experts). When the LB gradient dominates, router parameters for all experts converge toward similarity, leading to uniform routing decisions regardless of input tokens. This nullifies the benefits of conditional computation and severely degrades model performance. To diagnose and control this behavior, we propose a monitoring framework with two key metrics: 11
12.
13. LongCat-Flash Technical Report
14. LongCat-Flash Technical Report The training corpus is built upon naturally occurring long-text data, such as high-quality books and novels. Additionally, we developed a systematic approach to organize repository-level source code to improve the model’s long-context capabilities. We carefully selected high-quality repositories and applied a multi-stage filtering process to remove non-textual content, build artifacts, and auto-generated code, resulting in a curated 20B-token dataset for long-context pre-training. To ensure that the model’s general capabilities remain stable during the length extension, we adopt a data mixture strategy identical to that of our main pre-training phase and augment this mixture with an additional 25% of long-context data to enhance the model’s long-context performance. 3.5 Decontamination We perform rigorous decontamination on all training data to prevent data leakage from test sets of common benchmarks. For web and code data, we remove documents containing any 13-gram overlap with predefined test sets. For synthetic data and question-answering pairs, we employ a stricter strategy based on semantic similarity using BGE-m3 [Chen et al., 2024] embeddings. Documents are discarded if they meet either of the following criteria: (1) Semantic similarity score > 0.9 to any test case; (2) Lexical overlap (measured by sparse embeddings) combined with a similarity score between 0.7–0.9. 3.6 Evaluation This section presents a comprehensive evaluation of the LongCat-Flash base model, including the methodology and results. 3.6.1 Evaluation Benchmarks and Configurations The model evaluation covers four core capabilities: general tasks, general reasoning, mathematical reasoning, and coding. The benchmarks used for assessment include: • General Tasks: MMLU [Hendrycks et al., 2021a], MMLU-Pro [Wang et al., 2024b], C-Eval [Huang et al., 2023], and CMMLU [Li et al., 2023a]. • Reasoning Tasks: GPQA [Rein et al., 2023], SuperGPQA [M-A-P Team, ByteDance., 2025], BBH [Suzgun et al., 2023], PIQA [Bisk et al., 2019], DROP [Dua et al., 2019], CLUEWSC [Xu et al., 2020], and WinoGrande [Sakaguchi et al., 2019]. • Math Tasks: GSM8K [Cobbe et al., 2021], MATH [Hendrycks et al., 2021b]. • Coding Tasks: MBPP+ [Liu et al., 2024b], HumanEval+ [Liu et al., 2024b], MultiPL-E [Cassano et al., 2022], and CRUXEval [Gu et al., 2024]. We compare the LongCat-Flash base model with state-of-the-art open-source base MoE models, including DeepSeek- V3.1 Base [DeepSeek-AI et al., 2025], Llama-4-Maverick Base [Meta AI, 2025], and Kimi-K2 Base [MoonshotAI, 2025]. To ensure fairness, all models are evaluated under identical pipelines and configurations. For minority results that cannot be reproduced, we directly adopt metrics from public reports and explicitly annotate them in Table 2. The evaluation settings are as follows: • General/reasoning/math tasks: Use few-shot prompts to guide output format. Performance is measured via accuracy or F1 score. • HumanEval+ and MBPP+: Follow OpenAI’s recommended setting [Chen et al., 2021]. • MultiPL-E: Follow BigCode Evaluation Harness[Ben Allal et al., 2022]. • CRUXEval: Follow the official configuration 1 , employing 2-shots examples. 3.6.2 Evaluation Results Table 2 presents the evaluation results across diverse benchmarks. LongCat-Flash Base model achieves performance on par with state-of-the-art base models despite its compact active/total parameter size. Although Llama-4-Maverick has fewer activated and total parameters, LongCat-Flash Base surpasses both on nearly all benchmarks. 1 https://github.com/facebookresearch/cruxeval 14
15. LongCat-Flash Technical Report A comparative analysis reveals that LongCat-Flash Base matches DeepSeek-V3.1 Base’s performance across all domains despite containing fewer parameters. While the two models perform similarly in general tasks, LongCat-Flash Base demonstrates a notably advantage on the MMLU-Pro benchmark (featuring challenging questions). For reasoning tasks, LongCat-Flash Base attains a higher average score. In math and coding tasks, it outperforms DeepSeek-V3.1 Base on most benchmarks, with only marginal performance gaps observed on CRUXEval and MultiPL-E. Against Kimi K2 Base, LongCat-Flash Base shows modestly lower performance in general tasks but achieves parity or superiority in reasoning, math, and coding tasks. These results collectively underscore LongCat-Flash Base’s parameter efficiency, as it delivers competitive or superior performance to larger models across the majority of evaluated benchmarks. Table 2: Comparison between LongCat-Flash and other base models. Values marked with * are sourced from public reports. DeepSeek-V3.1 Llama-4-Maverick Kimi-K2 LongCat-Flash Benchmark Base Base Base Base Architecture # Total Params # Activated Params MoE 671B 37B MoE 402B 17B MoE 1043B 32B MoE 560B 27B 87.47 68.36 91.24 90.35 87.05 70.32 87.73 87.19 45.89 44.70* 89.19 69.81 95.10 82.87 76.32 51.09 54.19 90.54 78.39 92.33 85.08 91.12 92.27 66.74 92.19 64.82 80.49 69.84 59.22 65.87 68.75 77.25 65.85 69.25 71.63 75.88 General Domains MMLU (acc) MMLU-Pro (acc) CEval (acc) CMMLU (acc) 87.46 59.29 89.33 88.21 84.41 63.90 81.93 80.71 General Reasoning GPQA (acc) SuperGPQA (acc) BBH (acc) DROP (f1) PIQA (acc) WinoGrande (acc) CLUEWSC (acc) 47.16 - 89.46 80.74 93.00 83.50 88.16 48.08 40.58* 87.56 77.44 90.59 73.32 88.00 Mathematical Reasoning GSM8K (acc) MATH (acc) 92.22 61.56 84.61 63.34 Coding MBPP+ (pass@1) HumanEval+ (pass@1) MultiPL-E (pass@1) CRUXEval-I (pass@1) CRUXEval-O (pass@1) 4 59.26 67.07 62.00 65.87 71.25 70.11 60.37 58.35 62.00 64.25 Post-Training We implement a conventional multi-stage post-training framework to augment the base model’s performance across diverse domains, ranging from sophisticated reasoning, coding and agentic tool use tasks to general-purpose capabilities. During this process, we observed that the limited availability of high-quality problem sets is a significant bottleneck across all domains. In the subsequent sections, we present key insights derived from our post-training methodology, organized into three distinct phases: (1) Reasoning and coding, (2) Agentic tool use, and (3) General capability. 4.1 Reasoning and Coding Mathematics To generate high-quality and novel problems, we use a persona [Ge et al., 2024], self-instruct [Wang et al., 2022] paradigm. This process is guided by a comprehensive mathematical framework that spans topics from elementary to advanced levels. We leverage a diverse set of math-related “expert” personas to ask questions, steering 15
16. LongCat-Flash Technical Report LLMs to synthesize queries that cover underrepresented subjects. Each query is structured to elicit Chain-of-Thought (CoT) reasoning, promoting step-by-step problem-solving in the generated answers. Details of persona curation and answer verification are as follows: • Persona Curation: The personas are constructed from multiple sources: we generate them from our high-quality pretraining data, derive them from existing math queries, and incorporate relevant collections from Persona Hub. Each persona is systematically labeled by its STEM discipline. To ensure maximum diversity and alignment with our subject framework, we use the MinHash algorithm to select the final set of personas for query generation. • Answer Verification: We employ a two-stage process to ensure the accuracy of the synthesized solutions: (1) We generate answers for each problem using several different LLMs and select the most consistent solution as the final answer. (2) We train a generative reward model, specifically enhanced with reasoning data, to automatically score and verify the logical soundness of the problem-solving steps. Coding We assemble a diverse set of coding queries from multiple sources, including public datasets, queries generated from GitHub code snippets [Wei et al., 2024] and coding-related forums, as well as queries evolved using the Code Evol-Instruct method [Luo et al., 2024]. The data distribution is balanced according to topic diversity and difficulty. Specifically, we train a model to select queries that are clear, consistent, and correct, with sufficient explanatory detail, and implement a filtering pipeline to eliminate responses containing garbled content, repetitive patterns, or logical errors. For software engineering tasks, we curate and validate ten thousands of Docker images containing test cases. Each image is used to verify whether model-generated code can resolve specific issues in the corresponding repository. We develop an agent-based system that leverages various tools to autonomously analyze code structures, identify relevant files, fix bugs, and implement new features. This process yields thousands of successful trajectories that pass all test cases, thereby enhancing the model’s ability to autonomously solve real-world software engineering problems. Logical Reasoning We construct logical reasoning datasets covering deductive, hypothetical, and inductive reasoning, which include tasks such as LogicPro [Jiang et al., 2025], PODA [Wang et al., 2025b], and Zebra-style logic puzzles. To manage difficulty, we first use the Pass@k metric for an initial balance, then filter out intractable problems where advanced thinking models failed. We also convert multiple-choice questions to a fill-in-the-blank format to mitigate random guessing. The evaluation of responses focused on four key areas: (1) correctness of the final answer; (2) completeness and clarity of reasoning; (3) avoidance of excessive repetition; and (4) consistent use of language. 4.2 Agentic Tool Use We define agentic tasks as complex problem resolution through systematic environment interaction. In this paradigm, models must iteratively analyze existing information and determine when environmental interaction is needed. Specifi- cally, within the agentic tool utilization framework, the environment comprises user and tools with distinct characteristics. User functions as an autonomous information-providing entity without upstream or downstream dependencies, but exhibit reluctance to be disturbed and non-spontaneous information disclosure. Consequently, models must minimize user queries while employing strategic questioning techniques to elicit maximally precise information when interaction becomes necessary. Tools can be invoked extensively with high frequency, but exhibit intricate interdependencies. From this perspective, excluding domain-specific expertise such as advanced programming capabilities or mathematical computation, we attribute task difficulty escalation to three factors: • Information processing complexity Models must engage in sophisticated reasoning processes to integrate and transform information into required components. • Tool set complexity By modeling the tool set as a directed graph based on intertool dependencies, complexity can be quantitatively characterized by the graph’s node cardinality and edge density. • User interaction complexity Models must learn to engage in multi-round strategic questioning with minimal frequency, adapting to various conversational styles, levels of communication willingness and patterns of information disclosure, thus facilitating effective user interaction while ensuring adequate information acquisition. Building on these insights, we construct a multi-agent data synthesis framework that generates high-quality challenging tasks by systematically addressing three complexity dimensions critical for agent training: (1) tool set complexity, (2) information processing complexity, and (3) user interaction complexity. The framework comprises the following specialized agents: • UserProfileAgent Beyond generating fundamental user profiles encompassing personal information and preferences, we further implement controls over user conversational styles, communication willingness levels, and information 16
17. LongCat-Flash Technical Report disclosure patterns to more accurately simulate authentic user interaction scenarios while simultaneously enhancing task complexity. • ToolSetAgent To maximize data diversity and prevent overfitting to specific scenarios, we adopt an approach analogous to Kimi-K2 [Team et al., 2025], enumerating 40 distinct domains and subsequently leveraging models to enumerate 1,600 applications. Based on these applications, we construct 80,000 mock tools, forming an extensive tool graph. Through random walk methodologies, we systematically sample subgraphs with predetermined node quantities from this comprehensive tool graph, and hence tool graph complexity is controlled via node quantity. • InstructionAgent The difficulty of reasoning is quantified in the following dimensions: constraint complexity, quantity of reasoning points, and length of the reasoning chain. The model is required to generate instructions that comprehensively describe complete tasks based on the tool set extracted by the ToolSetAgent. • EnvironmentAgent We augment environmental information including item details, location specifics, temporal pa- rameters, and meteorological conditions based on content generated by the UserProfileAgent and InstructionAgent. Additionally, we introduce confounding elements for items and locations to further increase reasoning complexity. • RubricAgent We construct a comprehensive series of specific checklists based on various task-related information. During final evaluation, considering the long-context characteristics inherent to agentic tasks, we employ a sliding window approach to assess the entire trajectory, continuously updating the completion status of checklist items. • ValidatorAgent and DeduplicatorAgent We check the quality of our final tasks from several angles and remove any that are too similar. This process ensures we have a diverse and high-quality set of tasks. With these high-quality challenging tasks, we further conduct rigorous response selection to construct our cold start training set with an appropriate quantity, revealing diverse patterns and preserving high exploration ability. We also carefully select a subset of these generated task for further post-training procedure, to make sure each task worth massive exploration. 4.3 General Capability Instruction-following We curate both single-turn and multi-turn instruction-Following datasets, with varying levels of constraint complexity and quantity. For multiple-constraint queries, we adopt the insight from Ye et al. [2025] to filter queries with low semantic quality or constraint conflicts. For different query types, we employ verifiable rules, model-based verification, and customized strategies to ensure responses satisfy all constraints. Additionally, we compile critique datasets targeting challenging tasks to enhance the model’s critical thinking abilities [Wang et al., 2025c]. We observe that certain constraint types are inherently difficult to follow, making direct generation of valid query-answer pairs unreliable. To address this, we propose a reverse prompt generation strategy: generating queries from predefined answers guaranteed to meet constraints. Long Context To enable the model to identify and analyze relevant information within complex, lengthy contexts, we develop three types of long-sequence datasets: reading comprehension, table-based question answering, and custom- designed tasks. To facilitate the learning of salient information in long sequences, we aggregate topically related context segments for data construction. We specifically enhance the model’s multi-hop reasoning, multi-turn dialogue, and complex calculation abilities. To mitigate hallucination when confronted with an incomplete context, we optimize the model’s refusal capabilities, thereby improving its awareness of knowledge boundaries and limitations. Safety Building on the framework of Mu et al. [2024] and aligned with our internal content guidelines, we develop a content safety policy that categorizes queries into more than 40 distinct safety categories across five response types: comply, comply with guideline, soft refuse, soft refuse with guideline, or hard refuse. Explicit criteria ensure consistent, safety standards-compliant responses for each response type. This system operates as a context-aware data synthesizer through two stages: (1) Query Classification: Queries from diverse sources (open-domain corpora, internal business risk reports, government Q&A, and adversarial LLM-synthesized red-team content) are classified into safety categories using automated labeling with human verification. (2) Response Mapping & Optimization: Classified queries are mapped to response types and generate optimized, type-specific responses that undergo human evaluation before serving as training targets. 4.4 Evaluation We conduct a comprehensive and rigorous evaluation of LongCat-Flash after post-training. Specifically, we assess its capabilities across multiple dimensions, including general domains, instruction following, mathematical reasoning, general reasoning, and coding & agent tasks. 17
18. LongCat-Flash Technical Report 4.4.1 Evaluation Benchmarks and Configurations The evaluation employs the following benchmarks: • General Domains: MMLU [Li et al., 2023a], MMLU-Pro [Wang et al., 2024b], ArenaHard [Li et al., 2024a,b], CEval [Huang et al., 2023], and CMMLU [Li et al., 2023a]. • Instruction Following: IFEval [Zhou et al., 2023], COLLIE [Yao et al., 2024], and Meeseeks [Wang et al., 2025a], Meeseeks evaluates models’ instruction-following capabilities in multi-turn scenarios through an iterative feedback framework that simulates realistic human-LLM interactions, enabling models to self-correct based on turn-specific failures and better reflect real-world usage patterns. • Mathematical Reasoning: MATH500 [Lightman et al., 2023], AIME24 [MAA, 2024], AIME25 [MAA, 2025], and BeyondAIME [ByteDance-Seed, 2025]. • General Reasoning: GPQA-diamond [Rein et al., 2024], DROP [Dua et al., 2019], ZebraLogic [Lin et al., 2025], and GraphWalks [OpenAI, 2025a]. • Coding: Humaneval+, MBPP+ [Liu et al., 2023, 2024c], LiveCodeBench (2024.08-2025.05) [Jain et al., 2025], SWE-Bench-Verified [Jimenez et al., 2024], and TerminalBench [Team, 2025a]. • Agentic Tool Use: τ 2 -Bench [Barres et al., 2025] and AceBench [Chen et al., 2025]. Furthermore, we develop a high-quality proprietary benchmark, VitaBench, leveraging Meituan’s comprehensive real-world business scenarios to systematically evaluate models’ capabilities in addressing complex real-world tasks. Within VitaBench, to comprehensively assess models’ generalized agentic capabilities, we deliberately curate cross-domain quotidian scenarios and explicitly delineate inter-tool dependencies, eschewing the provision of extensive domain-specific policies. Our benchmark emphasizes three critical dimensions of complexity: tool set complexity (characterized by dense tool graphs averaging over 30 available tools per task), reasoning complexity, and user interaction complexity (featuring challenging user personas with an average exceeding 60 interaction rounds per task for evaluated models). The complete benchmark dataset, along with detailed construction methodologies and comprehensive result analysis, will be fully released in subsequent work. We also evaluate the safety performance of LongCat-Flash. Specifically, we conduct evaluations on four major risk categories: • • • • Harmful: Violence, hate Speech, insulting, harassment and bullying, self-harm and suicide, adult content, etc. Criminal: Illegal activities, underage violations, extreme terrorism and violence, etc. Misinformation: misinformation and disinformation, unsafe practices, hallucination, etc. Privacy: privacy violation, infringement, etc. Within each category, a sufficient number of private test queries are constructed, followed by a comprehensive manual review to ensure the accuracy of their classification and the reliability of their quality. We compare the chat version of LongCat-Flash with several contemporary non-thinking chat models, including DeepSeek-V3.1 [DeepSeek-AI et al., 2025], Qwen3-235B-A22B (2507 version) [Yang et al., 2025], Kimi-K2 [Moon- shotAI, 2025], GPT-4.1 [OpenAI, 2025b], Claude4-Sonnet [Anthropic, 2025], and Gemini2.5-Flash [Comanici et al., 2025]. For closed-source models, we conduct evaluations through their official APIs. For models supporting both thinking and non-thinking modes (Qwen3-235B-A22B, Gemini2.5-Flash, and Claude4-Sonnet), we explicitly configure these models to operate in non-thinking mode for a fair comparison. For each benchmark category, we employ the following specialized metrics and settings: • General domain benchmarks: We use accuracy as the evaluation metric. Unlike the original benchmarks that rely on exact-match (EM) for correctness judgment, we employ a scoring model to assess whether model responses align with reference answers. Since our scoring model recognizes semantically correct answers even without exact textual matches, reported values may be slightly higher than originally documented. • Instruction following benchmarks: We design regular expressions based on instruction rules to verify compliance. Rule-based and model-based answer span extraction tools are additionally employed to support this evaluation. • Mathematical reasoning benchmarks: We apply the aforementioned scoring model for MATH500, and the averaged EM scores over 10 runs for AIME-related benchmarks. • General reasoning benchmarks: We apply the scoring model for GPQA-diamond, calculate the F1 score for DROP, adopt rule-based matching for ZebraLogic, and use the precision metric for GraphWalk following the official implementation on its 128k context length subset. 18
19. LongCat-Flash Technical Report • Coding benchmarks: Each problem is scored 1 if the model’s response passes all test cases in a sandbox environment or matches a specific state, otherwise 0. The final score is the average across all problems. We adopt the script provided by OpenAI 2 to evaluate Humaneval+ and MBPP+, and the official scripts for the others. Specifically, for SWE-Bench-Verified, we use R2E-Gym 3 (Openhands scraffold) with runs limited to 100 iterations for evaluation except DeepSeek V3.1 (using Openhands 4 ). For Terminal-Bench, we use the Terminus framework with direct prompting for evaluation. • Agentic tool use benchmarks: We utilize official benchmark frameworks to ensure fairness and reproducibility. For AceBench, we use direct prompting rather than function calling. For our proposed VitaBench, given the inherent long-context characteristics of agentic tasks, we employ a sliding window mechanism to systematically evaluate task completion status throughout the entire execution trajectory, facilitating continuous updates to the completion status of individual checklist components. 4.4.2 Evaluation Results As detailed in Table 3, our comprehensive evaluation reveals that LongCat-Flash is a powerful and versatile model. It consistently demonstrates leading performance in different domains, often outperforming contemporary models across a wide array of challenging tasks with relatively fewer activated parameters. The following analysis provides a detailed breakdown of its impressive capabilities across different dimensions. General Domains In general domain knowledge, LongCat-Flash demonstrates a strong and well-rounded performance. It achieves an excellent score of 86.50 on ArenaHard-V2, ranking second among all evaluated models and showcasing its robust capabilities in challenging head-to-head comparisons. On foundational benchmarks, it remains highly competitive, scoring 89.71 on MMLU and 90.44 on CEval. These results are comparable to leading models, and notably, are achieved with fewer parameters than competitors like DeepSeek-V3.1 and Kimi-K2, indicating high efficiency. Instruction Following LongCat-Flash exhibits state-of-the-art instruction following capabilities. It achieves the highest score of 89.65 on IFEval, outperforming all other models and demonstrating superior reliability in adhering to complex and nuanced directives. Furthermore, it secures the best score on COLLIE (57.10) and Meeseeks-zh (43.03), underscoring its exceptional proficiency across diverse and challenging instruction sets in both English and Chinese. Mathematical Reasoning In mathematical reasoning, LongCat-Flash shows powerful and advanced capabilities. While its score on MATH500 (96.40) is highly competent, its strength is particularly evident in more complex, competition- level benchmarks. It delivers excellent, top-tier results on AIME25 (61.25) and BeyondAIME (43.00), ranking among the best-performing models in these challenging domains. This highlights its advanced capacity for sophisticated, multi-step logical deduction and problem-solving. General Reasoning For general reasoning tasks, LongCat-Flash’s performance is also solid. It demonstrates exceptional strength in structured logical deduction, achieving a score of 89.30 on ZebraLogic, which is among the top competitors. It also obtains a competitive score of 79.06 on the reading comprehension benchmark DROP. Conversely, its results on GPQA-diamond (73.23) and GraphWalks (51.05) indicate an opportunity for further improvement, particularly in enhancing its capabilities for analyzing structured data within extremely long contexts. Coding LongCat-Flash displays a promising and capable profile in the coding domain. Its standout performance is on TerminalBench, where it achieves a score of 39.51, ranking second and demonstrating excellent proficiency in practical, agentic command-line tasks. It is also competitive on the SWE-Bench-Verified benchmark with a score of 60.4. On foundational code generation tasks such as Humaneval+ and MBPP+, its performance is solid, yet there remains potential for future optimization to align with the leading models. Agentic Tool Use LongCat-Flash demonstrates a clear advantage in using agentic tool use domain, notably outperform- ing other models on τ 2 -Bench even when compared to models with more parameters. In highly complex scenarios, it achieves the highest score of 24.30 on VitaBench, demonstrated strong capability in complex scenarios. Safety LongCat-Flash showed outstanding capability in identifying and mitigating risks on the whole, particularly in the domains of Harmful and Criminal compared to other models. 2 https://github.com/bigcode-project/bigcode-evaluation-harness https://github.com/R2E-Gym/R2E-Gym 4 https://github.com/All-Hands-AI/OpenHands 3 19
20. LongCat-Flash Technical Report Table 3: Evaluation results of frontier chat models. Values marked with * are sourced from other public reports. Note that DeepSeek-V3.1, Qwen3-235B-A22B, Gemini2.5-Flash, and Claude4-Sonnet are evaluated under their non-thinking mode. DeepSeek V3.1 Qwen3 MoE-2507 Kimi-K2 GPT-4.1 Claude4 Sonnet Gemini2.5 Flash LongCat-Flash Architecture # Total Params # Activated Params MoE 671B 37B MoE 235B 22B MoE 1043B 32B - - - - - - - - - MoE 560B 27B MMLU (acc) MMLU-Pro (acc) ArenaHard-V2 (acc) CEval (acc) CMMLU (acc) 90.96 84.45 84.10 89.21 88.04 90.23 84.83 88.20 92.70 88.14 89.64 81.72 61.50 79.53 77.65 91.75 83.74 62.10 86.63 86.51 86.33 81.95 77.00 78.78 78.30 89.71 82.68 86.50 90.44 84.34 88.35 51.22 35.07 83.92 48.60 34.84 89.65 57.10 43.03 93.80 47.00 37.00 20.50 98.40 79.67 67.33 44.20 96.40 70.42 61.25 43.00 67.68 66.94 56.30* 85.02 70.71 73.06 75.85 80.57 80.30 45.03 51.78 64.83 73.23 79.06 89.30 51.05 39.21 93.29 79.37 48.60 28.40 45.59 94.51 80.16 68.00* 40.74 39.65 87.80 76.19 40.60 12.35 48.02 88.41 79.63 60.40 39.51 35.20 56.00 74.10 80.10* 19.00 46.20 60.00 80.00 76.20* 23.00 16.50 41.50 64.80 74.50* 8.00 73.68 58.00 71.27 76.10 24.30 56.19 81.58 45.49 98.80 66.56 87.58 54.91 100.00 - - - - 83.98 91.24 81.72 93.98 Benchmark General Domains 89.86 82.06 85.70 91.26 89.66 Instruction Following IFEval (acc) COLLIE (acc) Meeseeks-zh (acc) 86.69 43.80 33.83 88.54 49.71 35.32 88.91 56.34 42.79 85.58 50.00 41.54 Mathematical Reasoning MATH500 (acc) AIME24 (avg@10) AIME25 (avg@10) BeyondAIME (avg@10) 96.08 66.30* 49.27 36.50 98.80 81.67 68.33 57.60 97.60 69.60* 50.66 36.60 90.60 47.00 32.00 22.10 General Reasoning GPQA-diamond (acc) DROP (f1) ZebraLogic (acc) GraphWalks-128k (precision) 74.90* 84.19 85.30 73.54 77.43 78.57 94.22 80.72 75.76 89.04 89.11 47.50 Coding LiveCodeBench (pass@1) Humaneval+ (pass@1) MBPP+ (pass@1) SWE-Bench-Verified (acc) TerminalBench (acc) 56.40* 92.68 79.89 66.00* 31.30* 46.48 94.51 79.89 42.00 17.28 τ 2 -Bench (telecom) (avg@4) τ 2 -Bench (airline) (avg@4) τ 2 -Bench (retail) (avg@4) AceBench (acc) VitaBench (avg@4) 38.50 46.00 64.90 69.70 20.30 22.50 36.00 70.50 71.10 8.50 46.70 85.98 81.75 64.60 25.93 Agentic Tool Use 67.50 54.20 70.80 82.20 18.20 Safety Harmful Criminal Misinformation Privacy 82.79 87.83 83.17 98.80 80.82 89.13 77.76 98.80 53.91 77.19 42.68 96.39 20
21. LongCat-Flash Technical Report 5 Training Infrastructures The core design principle of our training infrastructure is scalability with precision. We developed a systematic method to verify operator precision and embedded online Silent Data Corruption (SDC) detection into idle computation phases to minimize numerical errors. To guarantee reproducibility and ensure consistent results between small-scale experiments and full-scale training, we enforced determinism across all computation and communication operators. This enabled bitwise-aligned loss values across multiple re-runs of any training step. With correctness ensured, we focused on accelerating training efficiency. Wall-clock time is critical for rapid algorithm iteration, yet single accelerator provides limited capability. We therefore scaled training across tens of thousands of accelerators, confronting challenges in scalability and stability. Through model–system co-design, multi-dimensional parallelism, and fully automated fault detection and recovery, we achieved near-linear scaling and 98.48% availability, completing training within 30 days. 5.1 Numerical Precision Control and Fault Detection ULP Evaluation Floating-point errors are influenced by multiple factors, even varying between accelerators of the same vendor across generations. To quantify and mitigate these errors, we adopt ULP (Unit in the Last Place) as a metric, where ULP error measures the deviation of accelerator BF16 results from CPU FP32 ground truth. A zero ULP error indicates perfect accuracy, while larger values imply worse precision. We collect all operator types and shapes used in training and compare their ULP errors. Table 4 shows the ULP error for GEMM between two solutions. Table 4: GEMM Precision Comparison (ULP) Solution 1 Solution 2 Case 1 2 3 4 5 6 7 8 Output Shape [1024,1536] [1024,576] [1024,16384] [1024,12288] [1024,6144] [1024,24576] [1024,131072] [1024,6144] Value Range [-5,5] [-5,5] [-19,15] [-4,4] [-1,1] [-5,5] [5,5] [-1,1] Max 2292 65362 544 202 5376 7200 8128 5344 Min -568 -82046 -104 -88 -1376 -510 -6976 -8064 Max 112 6.5 224 72 304 104 2528 80 Min -100 -9 -112 -41 -224 -294 -368 -258 SDC Detection Mechanism SDC faults are typically unavoidable in large-scale training and can severely degrade model performance by altering data without system warnings. To address this, we implement an efficient on-chip in-place operator recomputation mechanism. Specifically, we find that the backward computation for FlashAttention Gradients (FAG) is most sensitive to SDC because it simultaneously mixes tensor and vector computations. Bitwise differences between recomputed results indicate potential SDC risks. The detection computations are orchestrated within compute streams, and the recomputation interval is manually adjustable, enabling a flexible trade-off between detection coverage and computational cost. Notably, operator precision control is necessary but insufficient for ensuring model accuracy. Experiments with different operator implementations may show training loss discrepancies within 1e-3∼1e-4 yet exhibit larger than 5 pp variation on benchmarks. Cost-effectively evaluating the impact of operator precision errors on model performance remains an open challenge. 5.2 Kernel Optimization for Determinism and Performance Determinism serves as the gold standard for computational correctness, eliminating floating-point errors as experimental variables. However, achieving determinism often incurs significant performance overhead. We address this through kernel redesigns, maintaining deterministic computation and communication throughout LongCat-Flash’s training. Deterministic FAG The default FAG implementation is non-deterministic because dQ, dK, and dV are reduced along different dimensions, where atomic addition lacks order preservation. We develop an efficient deterministic FAG kernel using limited extra workspace to accumulate tiles in a deterministic order. With co-optimizations including double-buffer pipelining, tuned tiling schedules, and load balancing, our implementation achieves 1.6x the performance of the original deterministic version and 0.95x that of the non-deterministic version, striking a balance between determinism and efficiency. 21
22. LongCat-Flash Technical Report Deterministic ScatterAdd ScatterAdd in backward passes is essential for gradient aggregation but suffers from input-output operand counts mismatches. The default implementation enforces sequential execution within a single compute unit, causing up to 50x slowdown. We propose a hierarchical reduction algorithm that parallelizes gradient aggregation across all available processors, achieving performance parity with the non-deterministic version. Optimized Grouped GEMM Grouped GEMM’s performance is critical given its high computational volume but low compute density versus dense GEMM. We optimize it via: (1) Double-buffer pipelining to overlap computation, memory I/O, and epilogue; (2) Diagonal tiling to mitigate L2 cache conflicts; (3) HBM bandwidth control via compute unit limits to overlap Grouped GEMM with dispatch/combine communication. These optimizations yield 5%–45% speedups over the default version. Fused GemmAdd The dw computation suffers bandwidth-bound bottlenecks during gradient accumulation. We fuse FP32 addition into the GEMM epilogue, avoiding intermediate write-backs and hiding addition within tile GEMM pipelines. This significantly reduces latency and eliminates the precision loss caused by the conversion of BF16 data to HBM, achieving a speedup of 3.12x to 3.86x on the fused GroupedGemmAdd benchmark. Furthermore, we re-implement IO-bound kernels (e.g., MoE layer permute/unpermute) with integrated functionalities like drop-token and zero-computation experts handling, ensuring both determinism and performance. 5.3 Distributed Strategy for Large-scale Training The training architecture is centered on Expert Parallelism Groups (EP), each comprising 32 accelerators. Within an EP Group, the attention layer employs Context Parallelism (CP=8) instead of Tensor Parallelism (TP) to minimize communication overhead, and the FFN layer uses EP partitioning without TP. Multiple EP groups are scaled across Pipeline Parallelism (PP) and Data Parallelism (DP) dimensions. Expert parallelism (EP) is adopted to reduce static memory usage, including weights and optimizer states. However, EP inherently introduces costly dispatch and combine communication operations. To mitigate this, LongCat-Flash adopts the ScMoE structure, which enables dispatch/combine communication to overlap by more computation in a single batch. Furthermore, the MoE layer is divided into two chunks along the token dimension. These subchunks achieve two objectives: (1) Overlap with the dense FFN computation. (2) Overlap with each other (see Figure 8). Compute Stream Attn Router EP Communication Stream MoE GEMM Attn Router Combine Dispatch MoE GEMM Dispatch Combine (a) Two Layers of Typical MoE Compute Stream Attn Router EP Communication Stream Dense FFN MoE GEMM Attn Dispatch Dense FFN Combine (b) ScMoE Layer Compute Stream EP Communication Stream Attn Router Dense FFN Dispatch Chunk0 MoE GEMM Chunk0 Dispatch Chunk1 MoE GEMM Chunk1 Attn Combine Chunk0 Combine Chunk1 Dense FFN (c) ScMoE Layer with Chunk Figure 8: These architectures have the same total and activated number of experts. ScMoE with chunk achieves the highest efficiency because more communication is overlapped by computation. There are two optimized strategies for dispatch/combine communication: (1) All-gather/reduce-scatter kernel with pipeline in the intranode and the internode; (2) Optimized all-to-all kernel. The native all-to-all expands the local data size by top-k times, increasing traffic through the 200Gb/s per accelerator RDMA network. Additionally, all-to-all performance is unstable due to inadequate congestion control. We select pipelined all-gather/reduce-scatter with 22
23. LongCat-Flash Technical Report deterministic as the primary solution, the proportion of time to non-overlapping dispatch/combine communication decreases from 25.3% to 8.4% with ScMoE architecture. Existing pipeline strategies (e.g., 1F1B, interleaved-1F1B, Zero-bubble [Qi and Others, 2023]) suffer from imbalanced memory usage across pipeline stages. To this end, we adopt the V-ZB algorithm [Qi et al., 2024], which balances memory usage at all stages and reduces peak memory to less than 60GB in the training of LongCat-Flash. Additionally, we enable the post-validation strategy from zero bubble, achieving zero theoretical bubbles. A key refinement is replacing inverse operations with backup data from the previous step during optimizer state rollback, preserving numerical bitwise alignment. 5.4 Reliability and Observability Reliability is measured by the proportion of time contributing to the final training trajectory (Availability), where un- available time includes fault recovery and wasted time between the last checkpoint and fault occurrence. Asynchronous checkpointing reduces training stall to 2∼4 seconds, allowing higher frequency and minimizing fault-induced loss. Combined with online critical log filtering, optimized initialization, and full automation, recovery time is reduced to <10 minutes. These mechanisms achieve 98.48% availability, with all 20 faults handled automatically without manual intervention. Observability combines fine- and coarse-grained profiling with a metric platform. Fine-grained PyTorch profiler timelines enable distributed, parallel-aware co-analysis to identify pipeline parallelism "bubbles" and inter-rank communication waits. Coarse-grained monitoring adds low-overhead runtime analysis of stragglers. The metric platform tracks loss, weights, gradients, and activations for rapid model state assessment. 6 Inference and Deployment LongCat-Flash employs a model-system co-design, which significantly contributes to its high throughput and low latency. This section focuses on inference optimizations implemented in one of our deployment clusters, presenting methods to simultaneously boost system throughput and significantly reduce latency to 100 TPS on H800. We first present our parallel inference architecture co-designed with the model architecture. Following the inference architecture, optimization methods such as quantization and custom kernel are described. Finally, we present our deployment strategy and performance results. 6.1 Model-Specific Inference Optimization To achieve an efficient inference system, two key challenges must be addressed: (1) Computation and communication orchestration, and (2) KV cache I/O and storage. For the first challenge, existing approaches typically exploit parallelism at three conventional granularities: operator-level overlap like NanoFlow [Zhu et al., 2025], expert-level overlap represented by EPS-MoE [Qian et al., 2025], and layer-level overlap demonstrated in DeepSeek-V3 TBO (Two Batch Overlap) [Team, 2025b]. LongCat-Flash’s ScMoE architecture introduces a fourth dimension—module-level overlap—for which we designed the SBO (Single Batch Overlap) scheduling strategy to optimize both latency and throughput. For the second challenge—KV cache I/O and storage—LongCat-Flash addresses these issues through architectural innovations in its attention mechanism and MTP structure to reduce the effective I/O overhead. 6.1.1 Computation and Communication Orchestration LongCat-Flash naturally exhibits computation-communication overlap properties in its structure, which is the key to achieving lower latency while maintaining generation throughput. We carefully design Single Batch Overlap (SBO), a four-stage pipeline execution that uses module-level overlap to fully unleash LongCat-Flash’s potential as shown in Figure 9. SBO differs from TBO by hiding communication overhead within a single batch. In SBO, stage 1 requires separate execution because the MLA output serves as input for subsequent stages. In stage 2, we overlap all-to-all dispatch with Dense FFN and Attn 0 (QKV Projection). This overlap is crucial because communication overhead is excessive, prompting us to split the attention process. Stage 3 independently executes MoE GEMM. The latency of this stage will benefit from the wide EP deployment strategy. In stage 4, we overlap Attn 1 (Core Attention and Output Projection) and Dense FFN with the all-to-all combine. This orchestration effectively mitigates the communication overhead, ensuring efficient inference for LongCat-Flash. Additionally, the ScMoE architecture, under the wide EP deployment scheme, facilitates the overlap of intra-node NVLink bandwidth utilization and inter-node RDMA communication through GPUDirect RDMA [Choquette, 2022], thereby improving overall bandwidth efficiency. Dense FFN in ScMoE has a relatively large intermediate size, so TP 23
24. LongCat-Flash Technical Report RS w/ NVLS (multimem.ld_reduce) AG all-gather Attn 0 QKV Projection RS reduce-scatter Attn 1 Core Attention & Output Projection A2A all-to-all LN LayerNorm Attn 0 Attn 1 LN AG Dense FFN Router acc NVLS RS LN A2A w/ GPU Direct NIC NVLS Attn 1 Attn 0 A2A Dispatch LN MoE GEMM Stage 2 Stage 1 AG w/ NVLS (multimem.st) AG Dense FFN RS A2A Combine Stage 4 Stage 3 Figure 9: An overview of overlapping strategy. deployment is employed to minimize memory footprint, necessitating all-gather and reduce-scatter communication before and after Dense FFN, respectively. To reduce this communication overhead, we develop custom kernels and adopt TP2 or TP4 instead of TP8. 6.1.2 Speculative Decoding LongCat-Flash employs MTP as the draft model for speculative decoding. Our optimization framework originates from a systematic breakdown of Speculative Decoding’s speedup formulation, as Sadhukhan et al. [2025] has mentioned: SD T Avg 1 = T T Ω(γ, α)  γ · T D T V (γ) + T T T T  , SD where T Avg , T T , T D are expected latency per token for speculative decoding, target model and draft model. γ represents number of draft token in one decoding step. Ω(γ, α) is expected accept length for a given step γ and acceptance rate α. And T V (γ) is expected latency for target verification. Our approach targets three key factors: • Expected accept length Ω(γ, α), which is positively correlated with the acceptance rate α of draft tokens. To maximize acceptance rate α, we employ MTP. Integrate a single MTP head during late-phase pre-training, achieving approximately 90% acceptance rate on test sets. , which is dominated by the structures of both target model and draft model. As noted by • Draft to target cost ratio γ T T D T Liu et al. [2024d], balancing draft quality and speed is critical. To minimize generation overhead while maintaining comparable acceptance rates, LongCat-Flash adopts a lightweight MTP architecture with reduced parameters. Our experiments (Table 5) show that a single dense layer for MTP heads optimizes this trade-off, outperforming ScMoE layers in latency. • Target verification to decoding cost ratio T V T T (γ) . In order to reduce this ratio, we adopt the C2T [Huo et al., 2025] method, using a classification model to filter out tokens that are unlikely to be accepted before verification. Table 5: Draft token acceptance rate on MT-Bench of different MTP head structures with a 6B activated model. The ratio of MTP head parameters to main model parameters is also reported. MTP layer Activated parameters ratio Acceptance rate α Dense layer ScMoE layer 6.1.3 1.41% 4.17% 92.1% 92.9% Reducing KV Cache To balance performance and efficiency, LongCat-Flash adopts MLA with 64 heads for its attention mechanism, which reduces the computational load of the attention component while achieves exceptional KV cache compression and thus reduces storage and bandwidth pressure. This is crucial for orchestrating LongCat-Flash’s pipeline, as noted in Figure 9 24
25. LongCat-Flash Technical Report the model always features an attention computation that cannot be overlapped with communication. Specifically, the MQA-like structure of the MLA absorb method shares KV across the m-dimension (64 heads), aligning with the shape of the WGMMA instruction for maximal hardware utilization. 6.2 6.2.1 System-Wide Inference Techniques Minimize Schedule Overhead The decoding phase in LLM inference systems can become launch-bound due to kernel launch overhead. This issue is exacerbated when introducing speculative decoding—particularly with LongCat-Flash’s lightweight MTP, where separate scheduling of verification kernels and draft forward passes introduces significant overhead. To mitigate this, a TVD fusing strategy is used to fuse Target forward, Verification, and Draft forward into a single CUDA graph. To further improve GPU utilization, we implement an overlapped scheduler. However, experimental results reveal that the low latency of LongCat-Flash’s forward pass renders a single-step pre-schedule strategy insufficient to fully eliminate scheduling overhead. As shown in Figure 10, we introduce a multi-step overlapped scheduler to launch the kernel for multiple forward steps in a single schedule iteration. This approach effectively hides CPU scheduling and synchronization within the GPU forward process, ensuring continuous GPU occupancy. next 4 steps forward GPU schedule next 4 steps … CPU Iteration 0 Iteration 1 Iteration 2 Iteration 3 Figure 10: Multi-step overlapped scheduler (4 steps as a example here). In a multi-step overlapped scheduler, we need to dynamically pre-allocate KV cache slots for multiple future steps without prior knowledge of the accept length of speculative decoding in previous iterations. An important issue is whether multi-step overlapped scheduling causes divergent KV cache allocation. We illustrate this with M T P = 1 and the number of steps, n = 4. Let R i represents available KV entries during the GPU’s i-th iteration forward pass, thus R 0 = (M T P + 1) × n = 2n. U i,s ∈ [1, 2] represents the accept length in the i-th iteration for the s step, with the initial value U −1,s = 2. Then, while the GPU is performing the i-th iteration of forward computation, the scheduler pre-allocates the KV cache slots needed for the (i + 1)-th forward iteration based on the accept length in the (i − 1)-th forward iteration, where A i represents the KV cache slots that is allocated. Formally: A i = n−1 X U i−1,s , i ≥ 0 s=0 R i = R i−1 − n−1 X U i−1,s + A i−1 , i ≥ 1 s=0 By induction, we obtain the closed-form expression: R i = 4n − n−1 X U i−1,s , i ≥ 1 s=0 which means: R i ∈ [2n, 3n], i ≥ 1 Through mathematical induction, this ensures safe KV cache allocation for the next iteration even without knowing the current iteration’s accept length, while guaranteeing convergence in allocated KV cache size. 25
26. LongCat-Flash Technical Report 6.2.2 Custom Kernel The autoregressive nature of LLM inference creates distinct efficiency challenges. The prefilling phase is compute- bound, and methods like chunk-prefill [Agrawal et al., 2023] regularize data for optimal processing. In contrast, the decoding phase is often memory-bound due to small, irregular batch sizes from traffic patterns, which hurts kernel performance. Therefore, optimizing these specific cases is crucial for minimizing Time-Per-Output-Token (TPOT). MoE GEMM Existing libraries like DeepGEMM [Zhao et al., 2025a] map model weights to right-hand matrices (B in A×B=C) aligned with k/n dimensions, while input activations become left-hand matrices mapped to m/k dimensions, where m represents token count. This conventional approach requires padding when token counts fall below m’s 64-element minimum. To address this inefficiency, we leverage the SwapAB [Dege et al., 2025] technique: treating weights as left-hand matrices and activations as right-hand matrices. By exploiting the n-dimension’s flexible 8-element granularity, SwapAB maximizes tensor core utilization. Communication Kernels The inference system leverages NVLink Sharp’s hardware-accelerated broadcast (multi- mem.st) and in-switch reduction (multimem.ld_reduce) to minimize data movement and SM occupancy, as shown in Figure 9. By using inline PTX assembly, the reduce-scatter and all-gather kernels enable high-efficiency data transmission. These kernels support both uniform and nonuniform token distributions across GPUs, and consistently outperform NCCL [NVIDIA] and MSCCL++ [Shah et al., 2025] across 4KB to 96MB message sizes, using only 4 thread blocks. 6.2.3 Quantization LongCat-Flash employs the same quantization scheme as DeepSeek-V3, using fine-grained block-wise quantization: activations per [1,128] blocks and weights per [128,128] blocks. Besides, to achieve an optimal performance-accuracy trade-off, we applied layer-wise mixed-precision quantization based on two methodologies: The first scheme follows our approaches in FPTQ [Li et al., 2023b] and Super-Expert [Su et al., 2025], where we observed that certain linear layers (particularly Downproj) exhibited input activations with extreme magnitudes reaching 10 6 . The second scheme involves computing block-wise FP8 quantization errors (both relative and absolute) layer by layer, which revealed significant quantization errors in specific expert layers. By taking the intersection of both schemes, we achieved substantial accuracy improvements. 6.3 6.3.1 Deployment and Performance Measured Performance Model Table 6: Performance of LongCat-Flash under different settings. Attention Avg Context #Hopper GPUs TGS DeepSeek-V3-profile DeepSeek-V3-blog LongCat-Flash LongCat-Flash LongCat-Flash LongCat-Flash LongCat-Flash bf16 bf16 bf16 bf16 bf16 fp8 fp8 4096 4989 5000 5000 5000 5000 8192 128 144 128 128 128 128 128 2324 1850 3785 2205 804 4230 3240 TPS/u 20 20 ~22 35 68.9 100.5 26.4 33.8 To enable independent optimization of prefilling and decoding phases, PD-Disaggregated architecture is adopted. A key challenge in this design is the overhead of transmitting KV caches from prefilling to decoding nodes. To mitigate this, we implement layer-wise transmission, which significantly reduces Time-To-First-Token (TTFT) under high QPS workloads. For prefilling and decoding nodes, the minimum deployment unit consists of 2 nodes with 16 H800-80GB GPUs. Meanwhile, wide EP is deployed with DeepEP [Zhao et al., 2025b] to minimize communication overhead. Besides, we modify DeepEP and EPLB (Expert Parallelism Load Balancer) to support zero-computation experts, where the outputs of zero-computation experts can be obtained without communication. Table 6 compares the throughput and latency of LongCat-Flash with DeepSeek-V3 (DeepSeek-V3-profile from DeepSeek [2025a], DeepSeek-V3-blog from DeepSeek [2025b] ), where TGS (token per GPU per second) represents generation throughput per device (higher values indicate lower cost), and TPS/u (tokens per second per user) represents the generation speed for one user (higher values are better). During testing, the steady-state generation throughput 26
27. LongCat-Flash Technical Report under a given sequence length is used for calculation. LongCat-Flash achieves higher generation throughput and faster generation speed across different sequence lengths. In Agent applications based on the ReACT [Yao et al., 2023] pattern, completing a single task requires multiple rounds of model interactions, where interaction latency directly impacts user experience. Analysis of typical Agent invocation patterns reveals differentiated speed requirements for model outputs: • Reasoning content (user-visible): consisting of cognitive processes and explanations, must match human reading speed ( 20 tokens/s). • Action commands (user-invisible): structured data such as function names and parameters, typically 30~100 tokens, yet directly affect tool invocation startup time—demanding the highest possible speed. To address this scenario, LongCat-Flash achieves a generation speed of nearly 100 tokens/s for action commands. Under a cost assumption of $2 per hour for an H800 GPU, this translates to a price of $0.7 per million output tokens. This performance constrains the single-round tool-call latency to under one second, thereby significantly enhancing the interactivity of Agent applications. 6.3.2 Theoretical Performance Figure 9 shows that LongCat-Flash’s latency is primarily determined by three components: • MLA: Its time consumption cannot be reduced by increasing the number of EP. • All-to-all dispatch/combine: Both are constrained by single-device batch size and topk. • MoE: Its time consumption in the memory-bound region decreases with increasing EP count. Assuming the number of EP is 128, MLA uses DP for DeepSeek-V3 and LongCat-Flash, GQA uses TP4 for Qwen3- 235B-A22B as it has 4 kv heads, and the batch size per device is 96. Actually, the GQA feature of Qwen-235B-A22B results in a relatively high memory footprint for its KV cache, making it difficult to achieve a per-GPU batch size of 96 in practice. The assumption that it can reach this value is made here solely for the purpose of theoretical analysis. As pointed out by [Jiashi Li, 2025], FlashMLA can achieve up to 660 TFlops on NVIDIA H800 SXM5 GPUs; Zhao et al. [2025b] indicates that DeepEP bandwidth can reach 40GB/s. Both of these metrics were utilized in our computations. Assuming the cost for per H800 is $2 per hour. Considering MTP=1 with an acceptance rate of 80%, we can calculate the theoretical time consumption and cost of each module in one layer of DeepSeek-V3, Qwen3-235B-A22B and LongCat-Flash, as listed in Table 7. For Qwen3-235B-A22B, which does not natively support MTP, we assume a speculative sampling strategy with a comparable acceptance rate. Table 7: Theoretical decoding time and cost of different models. DeepSeek-V3 Qwen3-235B-A22B LongCat-Flash MTP n_layer batch per device w/ 61 96 w/o 94 96 w/ 28 96 Time cost of different modules in one layer attention all-to-all dispatch MoE all-to-all combine 471 us 275 us 77 us 551 us 314 us 157 us 29 us 315 us 264 us 236 us 60 us 472 us TPOT and Price overlap strategy TPOT (ms) $/1M output token TBO 30 0.17 TBO 26.2 0.15 SBO 16 0.09 Under this configuration, the theoretical extreme TPOT for LongCat-Flash with SBO can be expressed as: TPL = 264 + 236 + 60 + 472 = 1032 us, TPOT = 28 × TPL ≈ 16 ms, 1000 × 1.8 27
28. LongCat-Flash Technical Report where TPL denotes the time cost per layer. The measured value under batch size 96 is approximately TPOT = 26 ms, which is about 61.5% of the theoretical value and is on par with DeepSeek-V3 (~64%). The gap between measured value and theoretical speed mainly comes from the overhead of small operators and the loss in communication bandwidth. We apply the same method to calculate the theoretical limits of TPOT and generation cost for DeepSeek-V3 and Qwen3-235B-A22B under TBO scheduling. It can be observed from Table 7 that through model system co-design, LongCat-Flash achieves significant theoretical improvements in both throughput and latency. Furthermore, we observed two key insights about LongCat-Flash: (1) LongCat-Flash exposes not only all-to-all communication and MoE computation, but also an MLA computation. As a result, at the same batch size, LongCat- Flash incurs slightly longer per-layer time than DeepSeek-V3. However, due to its significantly reduced layer count, LongCat-Flash achieves lower overall latency. (2) LongCat-Flash’s second MLA is overlapped by the all-to-all combine. This means that in the decoding phase, LongCat-Flash can increase the sequence length to a certain extent without substantial latency increase. 7 Conclusion We introduce LongCat-Flash, a 560B-parameter MoE model featuring three key innovations: (1) a context-aware dynamical computation mechanism and shortcut-connection MoE, enabling high efficiency in both training and inference, (2) integrated strategies that ensure stable large-scale training, (3) a multi-stage training pipeline that cultivates LongCat-Flash’s agentic capabilities, allowing it to perform complex tasks requiring iterative reasoning and environmental interaction. By releasing LongCat-Flash as an open-source model, we aim to advance research in efficient MoE architectures, high-quality data strategies, and agentic model development, fostering community-driven innovation in large language models. 28
29. LongCat-Flash Technical Report 8 Contributions The listing of authors is in alphabetical order. Names marked with an asterisk (*) indicate people who have left our team. Bayan Jiahuan Li Qiyuan Duan Xuemiao Zhang Bei Li Jiajun Yang Ran Meng Xueyuan Hao Bingye Lei Jiaming Wang Rongxiang Weng Xuezhi Cao Bo Wang Jian Yang Ruichen Shao Xunliang Cai Bolin Rong Jianchao Tan Rumei Li Xurui Yang Chao Wang Jiaqi Sun Shizhe Wu Yan Feng Chao Zhang Jiaqi Zhang Shuai Liang Yang Bai Chen Gao Jiawei Fu Shuo Wang Yang Chen Chen Zhang Jiawei Yang Suogui Dang Yang Yang Cheng Sun Jiaxi Hu Tao Fang Yaqi Huo Chengcheng Han Jiayu Qin Tao Li Yerui Sun Chenguang Xi Jingang Wang Tefeng Chen Yifan Lu Chi Zhang Jiyuan He Tianhao Bai Yifan Zhang Chong Peng Jun Kuang Tianhao Zhou Yipeng Zang Chuan Qin Junhui Mei Tingwen Xie Yitao Zhai Chuyu Zhang Kai Liang Wei He Yiyang Li Cong Chen Ke He Wei Huang Yongjing Yin Congkui Wang Kefeng Zhang Wei Liu Yongkang Lv Dan Ma Keheng Wang Wei Shi Yongwei Zhou Daoru Pan Keqing He* Wei Wang Yu Yang Defei Bu Liang Gao Wei Wu Yuchen Xie Dengchang Zhao Liang Shi Weikang Zhao Yueqing Sun Deyang Kong Lianhui Ma Wen Zan Yuewen Zheng Dishan Liu Lin Qiu Wenjie Shi Yuhua Wei Feiye Huo Lingbin Kong Xi Nan Yulei Qian Fengcun Li Lingtong Si Xi Su Yunfan Liang Fubao Zhang Linkun Lyu Xiang Li Yunfang Tai Gan Dong Linsen Guo Xiang Mei Yunke Zhao Gang Liu Liqi Yang Xiangyang Ji Zeyang Yu Gang Xu Lizhi Yan Xiangyu Xi Zhao Zhang Ge Li Mai Xia Xiangzhou Huang Zhaohua Yang Guoqiang Tan Man Gao Xianpeng Li Zhenchao Zhang Guoyuan Lin Manyuan Zhang Xiao Fu Zhikang Xia Haihang Jing Meng Zhou Xiao Liu Zhiye Zou Haomin Fu Mengxia Shen Xiao Wei Zhizhao Zeng Haonan Yan Mingxiang Tuo Xiaodong Cai Zhongda Su Haoxing Wen Mingyang Zhu Xiaolong Chen Zhuofan Chen Haozhe Zhao Peiguang Li Xiaoqing Liu Zijian Zhang Hong Liu Peng Pei Xiaotong Li Ziwen Wang Hongmei Shi* Peng Zhao Xiaowei Shi Zixu Jiang Hongyan Hao Pengcheng Jia Xiaoyu Li Zizhe Zhao Hongyin Tang Pingwei Sun Xili Wang Zongyu Wang Huantian Lv Qi Gu Xin Chen Zunhai Su* Hui Su Qianyun Li Xing Hu LongCat-Flash Jiacheng Li Qingyuan Li* Xingyu Miao Jiahao Liu Qiong Huang Xinyan He References DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:arXiv preprint arXiv:2412.19437, 2025. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025. 29
30. LongCat-Flash Technical Report Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025. Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, and Jiayi Huang. Shortcut-connected expert parallelism for accelerating mixture-of-experts. arXiv preprint arXiv:2404.05019, 2024. Jiaming Wang, Yunke Zhao, Peng Ding, Jun Kuang, Zongyu Wang, Xuezhi Cao, and Xunliang Cai. Ask, fail, repeat: Meeseeks, an iterative feedback benchmark for llms’ multi-turn instruction-following ability. arXiv preprint arXiv:2504.21625, 2025a. Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. MoE++: Accelerating mixture-of-experts methods with zero- computation experts. arXiv preprint arXiv:2410.07348, 2024. Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024a. Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, 2023. Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, 2024. Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts. arXiv preprint arXiv:2408.15664, 2024a. Stuart Bennett. A History of Control Engineering 1930-1955. Peter Peregrinus, GBR, 1st edition, 1993. ISBN 0863412998. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. Yusuxke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, and Setsuo Arikawa. Byte pair encoding: A text compression scheme that accelerates pattern matching. 1999. Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015. Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024. Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. arXiv preprint arXiv:2407.05872, 2024. Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015. Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, and Jie Fu. Stacking your transformers: A closer look at model growth for efficient LLM pre-training. arXiv preprint arXiv:2405.15319, 2024. Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, and Yoon Kim. Learning to grow pretrained models for efficient trans- former training. arXiv preprint arXiv:2303.00980, 2023a. Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy. Staged training for transformer language models. In International Conference on Machine Learning, 2022. Yite Wang, Jiahao Su, Hanlin Lu, Cong Xie, Tianyi Liu, Jianbo Yuan, Haibin Lin, Ruoyu Sun, and Hongxia Yang. Lemon: Lossless model expansion. arXiv preprint arXiv:2310.07999, 2023b. Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. Efficient training of BERT by progressively stacking. In Proceedings of the 36th International Conference on Machine Learning, 2019. 30
31. LongCat-Flash Technical Report Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166, 2023. Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022. Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024. Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024. Biao Zhang and Rico Sennrich. Root mean square layer normalization. 2019. Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, 2021. Xiangyu Xi, Deyang Kong, Jian Yang, Jiawei Yang, Zhengyu Chen, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, and Wei Ye. Samplemix: A sample-wise pre-training data mixing strategey by coordinating data quality and diversity. arXiv preprint arXiv:2503.01506, 2025. Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE M3-Embedding: Multi- lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021a. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024b. Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems, 2023. Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. CMMLU: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023a. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof qa benchmark. arXiv preprint arXiv:2311.12022, 2023. M-A-P Team, ByteDance. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines. arXiv preprint arXiv:2502.14739, 2025. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Challenging BIG-bench tasks and whether chain-of- thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, 2023. Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641, 2019. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019. Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan. CLUE: A Chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics, 2020. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019. 31
32. LongCat-Flash Technical Report Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021b. Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. Evaluating language models for efficient code generation. arXiv preprint arXiv:2408.06450, 2024b. Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming- Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, and Abhinav Jangda. MultiPL-E: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022. Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065, 2024. Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025. URL https: //ai.meta.com/blog/llama-4-multimodal-intelligence/. MoonshotAI. Kimi-K2 documentation, 2025. URL https://moonshotai.github.io/Kimi-K2/. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. A framework for the evaluation of code generation models. https://github.com/bigcode-project/ bigcode-evaluation-harness, 2022. Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas. arXiv preprint arXiv:2406.20094, 2024. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022. Yuxiang Wei, Federico Cassano, Jiawei Liu, Yifeng Ding, Naman Jain, Zachary Mueller, Harm de Vries, Leandro von Werra, Arjun Guha, and Lingming Zhang. Selfcodealign: Self-alignment for code generation. In Advances in Neural Information Processing Systems, 2024. Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024. Jin Jiang, Yuchen Yan, Yang Liu, Jianing Wang, Shuai Peng, Xunliang Cai, Yixin Cao, Mengdi Zhang, and Liangcai Gao. LogicPro: Improving complex logical reasoning via program-guided learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. Chenxu Wang, Ping Jian, and Zhen Yang. Thought-path contrastive learning via premise-oriented data augmentation for logical reading comprehension. In AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, 2025b. Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, et al. A multi-dimensional constraint framework for evaluating and improving instruction following in large language models. arXiv preprint arXiv:2505.07591, 2025. Yubo Wang, Xiang Yue, and Wenhu Chen. Critique fine-tuning: Learning to critique is more effective than learning to imitate. arXiv preprint arXiv:2501.17703, 2025c. Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. Rule based rewards for language model safety. In Advances in Neural Information Processing Systems, 2024. Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024a. Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024b. URL https://lmsys.org/ blog/2024-04-19-arena-hard/. 32
33. LongCat-Flash Technical Report Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, and Karthik R Narasimhan. COLLIE: Systematic con- struction of constrained text generation tasks. In The Twelfth International Conference on Learning Representations, 2024. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023. MAA. Aime 2024, 2024. URL https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime. MAA. Aime 2025, 2025. URL https://artofproblemsolving.com/wiki/index.php/ AIMEProblemsandSolutions. ByteDance-Seed. Beyondaime: Advancing math reasoning evaluation beyond high school olympiads. https: //huggingface.co/datasets/ByteDance-Seed/BeyondAIME, 2025. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of LLMs for logical reasoning. In Forty-second International Conference on Machine Learning, 2025. OpenAI. Graphwalks dataset, 2025a. URL https://huggingface.co/datasets/openai/graphwalks. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, 2023. Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. Evaluating language models for efficient code generation. In First Conference on Language Modeling, 2024c. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. The Terminal-Bench Team. Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025a. URL https://github.com/laude-institute/terminal-bench. Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2 -bench: Evaluating conversational agents in a dual-control environment. arXiv preprint arXiv:2506.07982, 2025. Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, et al. ACEBench: Who wins the match point in tool learning? arXiv preprint arXiv:2501.12851, pages arXiv–2501, 2025. OpenAI. Introducing GPT-4.1 in the api, April 2025b. URL https://openai.com/index/gpt-4-1/. Anthropic. Introducing claude 4, May 2025. URL https://www.anthropic.com/news/claude-4. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. Author Qi and Others. Zero-bubble pipeline parallelism for large language models. arXiv preprint arXiv:2301.12345, 2023. Penghui Qi, Xinyi Wan, Nyamdavaa Amar, and Min Lin. Pipeline parallelism with controllable memory. 2024. Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. NanoFlow: Towards optimal large language model serving throughput. arXiv preprint arXiv:2408.12757, 2025. Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, and Xunliang Cai. EPS-MoE: Expert pipeline scheduler for cost-efficient moe inference. arXiv preprint arXiv:2410.12247, 2025. 33
34. LongCat-Flash Technical Report The SGLang Team. Deploying deepseek with pd disaggregation and large-scale expert parallelism on 96 h100 gpus. https://lmsys.org/blog/2025-05-05-large-scale-ep/, 2025b. Accessed: [May 2025]. Jack Choquette. Nvidia hopper gpu: Scaling performance. In 2022 IEEE Hot Chips 34 Symposium (HCS), 2022. Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian En-Hsu Yen, Avner May, Tianqi Chen, and Beidi Chen. MagicDec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding. arXiv preprint arXiv:2408.11049, 2025. Jiahao Liu, Qifan Wang, Jingang Wang, and Xunliang Cai. Speculative decoding via early-exiting for faster llm inference with thompson sampling control mechanism. arXiv preprint arXiv:2406.03853, 2024d. Feiye Huo, Jianchao Tan, Kefeng Zhang, Xunliang Cai, and Shengli Sun. C2T: A classifier-based tree construction method in speculative decoding. arXiv preprint arXiv:2502.13652, 2025. Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023. Chenggang Zhao, Liang Zhao, Jiashi Li, and Zhean Xu. DeepGEMM: clean and efficient fp8 gemm kernels with fine-grained scaling. https://github.com/deepseek-ai/DeepGEMM, 2025a. Pengcuo Dege, Qiuming Luo, Rui Mao, and Chang Kong. FlashMLA-ETAP: Efficient transpose attention pipeline for accelerating mla inference on nvidia h20 gpus. arXiv preprint arXiv:2506.01969, 2025. NVIDIA. NVIDIA Collective Communications Library (NCCL). https://github.com/NVIDIA/nccl. Version 2.21.5. Aashaka Shah, Abhinav Jangda, Binyang Li, Caio Rocha, Changho Hwang, Jithin Jose, Madan Musuvathi, Olli Saarikivi, Peng Cheng, Qinghua Zhou, Roshan Dathathri, Saeed Maleki, and Ziyue Yang. MSCCL++: Rethinking gpu communication abstractions for cutting-edge ai applications. arXiv preprint arXiv:2504.09014, 2025. Qingyuan Li, Yifan Zhang, Liang Li, Peng Yao, Bo Zhang, Xiangxiang Chu, Yerui Sun, Li Du, and Yuchen Xie. FPTQ: Fine-grained post-training quantization for large language models. arXiv preprint arXiv:2308.15987, 2023b. Zunhai Su, Qingyuan Li, Hao Zhang, YuLei Qian, Yuchen Xie, and Kehong Yuan. Unveiling super experts in mixture-of-experts large language models. arXiv preprint arXiv:2507.23279, 2025. Chenggang Zhao, Shangyan Zhou, Liyue Zhang, Chengqi Deng, Zhean Xu, Yuxuan Liu, Kuai Yu, Jiashi Li, and Liang Zhao. DeepEP: an efficient expert-parallel communication library. https://github.com/deepseek-ai/DeepEP, 2025b. DeepSeek. Profiling data in deepseek infra. https://github.com/deepseek-ai/profile-data, 2025a. Accessed: [May 2025]. DeepSeek. Day 6: One more thing, deepseek-v3/r1 inference system overview. https://github.com/deepseek-ai/ open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_ inference_system_overview.md, 2025b. Accessed: [May 2025]. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2023. Shengyu Liu Jiashi Li. FlashMLA: Efficient mla decoding kernels. https://github.com/deepseek-ai/FlashMLA, 2025. 34
35. LongCat-Flash Technical Report A A.1 Appendix Statistics and Case Studies of Dynamic Routing Figure 11 shows the average activated FFN experts of LongCat-Flash base model across benchmarks. A consistent computational bias favors English tokens over Chinese and mathematical ones. We present a more detailed expert Average Topk Across Benchmarks 8.5 8.03 8.13 8.24 8.32 7.71 7.5 6.5 7.46 K GSM8 U CMML MATH Eval Human MBPP MMLU Benchmark Datasets Figure 11: The average number of activated FFN experts across different benchmarks. selection across different layers for several cases in Table 8. These cases reveal different patterns of expert selection across layers. In the first layer, function words (including articles, conjunctions, prepositions), numbers and punctuation marks consistently receive lower computational resources. In contrast, the final layer (Layer 28) exhibits less specialized feature allocation compared to Layer 1, though identifiable patterns still exist. For example, in the Chinese text case, tokens preceding punctuation marks tend to be assigned fewer computational resources. We hypothesize that shallow layers prioritize token-internal semantics for allocation, while deeper layers dynamically adjust resources based on predictive complexity, potentially reflecting a hierarchical transition from local feature processing to global prediction optimization. 35
36. LongCat-Flash Technical Report Layer 1 - English Layer 1 - Math Layer 1 - Code Layer 1 - Chinese Layer 28 - English Layer 28 - Math Layer 28 - Code Layer 28 - Chinese Activated FFN experts 0 2 4 6 8 10 12 Table 8: The number of activated FFN experts per token across layers. 36

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.146.0. UTC+08:00, 2025-09-04 09:10
浙ICP备14020137号-1 $访客地图$