Kwaipilot 快手代码大模型研发领域落地实践

如果无法正常显示，请先停止浏览器的去广告插件。

1. Kwaipilot 快手代码大模型研发领域落地实践演讲人：詹子正快手代码大模型算法负责人

3. 01 什么是 Kwaipilot？

4. ✖ 4

5. ✖ 5

6. 02 现有代码模型的问题是什么？

7. 我们为什么要自研？模型上下文窗口小技术方案评测效果远超开源SOTA 模版代码真实研发场景复杂，当前开发项目往往依赖项众多，需要较多的先验知识行级别生成通过检索召回、程序分析技术等提升输入 context信息密度 1. 调用流、数据流 2. 父子类、依赖包 3. 相似代码、辅助片段一切对人类编程有效的、都是模型需要的自研One Attention算法，无训练情况无损外推8倍上下文窗口数据增强数据⻜轮大模型推理成本高，推理耗时⻓，特别是在⻓下文场景与实时性要求高的代码编写环节矛盾跨文件生成 40 CodeLlama Deekseek-Coder 开源模型 Long Context 模型筛选 StarCoder 私域代码 ne-tune StarCoderV2 Context Rich 真实部署后用户采纳率持续增加 35% MOE 多专家混合网络体验与成本的平衡速度与性能的拉扯 -16 2023-10 业务代码 60 20 数据标注* 数据回流中间件函数块生成 Long Context 模型上下文窗口小，而编写代码依赖多，难以给模型输入足够丰富的代码结构信息。加权总和逻辑块生成模型训练推理耗时⻓ kwaipilot CodeLlama(Meta) Context Rich 代码依赖复杂大模型遇到的问题 -07 2023-11 -29 2023-11 -21 2023-12 -12 2024-01 -05 2024-02

8. 模型强弱的根本不在结构，而是数据通过检索召回、程序分析技术等提升输入context信息密度 1. 调用流、数据流 2. 父子类、依赖包 3 .相似代码、辅助片段 (RAG for Code) 4. … 一切对人类编程有效的、都是模型需要的学术成果 Prompt-based Code Completion via Multi-Retrieval Augmented Generation (TOSEM, CCF-A) https://arxiv.org/abs/2405.07530 使用自研 “Context-Rich” 训练技术搭建逻辑推理链条模型效果在不同评估集上相比通用模型提升～70%

9. LLM as Agent 场景内部技术资料查询技术方案结合传统搜索以及新型RAG技术，依托⻓上下文能力加持，基于内部文档搭建领域智能 Oncall服务。相比传统向量数据库问答方案具有接入成本低，更新相应快，回答更准确的优势。通过用 AI理解，拦截用户问题，降低内部工具平台的答疑Oncall成本。利用大模型的理解和使用工具能力，快速挖掘与研发任务相关的信息。提供论文查询提要，公私域知识检索，代码解释执行等研发工具。通过检索召回、程序分析技术等提升问答 Context 信息质量密度, 进一步训练模型。 Long Context 自研One Attention算法，无训练情况无损外推8 倍上下文窗口论文要点总结 Agent：模型使用工具能力代码解释执行研发工作流 Context Rich for Chat 部分内部工具上手较难，入⻔成本较高，人工 Oncall成本较高。通过让大模型学习领域知识，掌握内部工具的基本使用技能，通过自然语言调度内部工具，给研发人员提供更沉浸式的开发环境，降低使用内部工具的⻔槛和时间成本。训练模型对于内外部工具的使用能力，模型通过自然语言识别用户意图，完成内外部工具的高效使用。学术成果, 相关工作被openai gpt-4o 官方技术报告引用 Agentless: Demystifying LLM-based Software Engineering Agents 天工 KDev Keep Team https://arxiv.org/abs/2407.01489

10. “编码即标注”的数据⻜轮, 高质量的合成数据全公司代码语料增强常态化 (2024 Q2) 插件日志数据如何利用以代码续写模型举例，我们每天收到数百万条编码过程中接受或者拒绝建议的数据，如何利用这些偏好数据，帮助大模型“自我进化”？高质量合成数据学术成果 1000+ Magicoder: Empowering Code Generation with OSS-INSTRUCT (ICML, CCF-A) https://arxiv.org/pdf/2312.02120 该成果被llama3.1技术论文引用 30000+ 位一线研发工程师条高质量代码语料 +20.35% +1.88pp 快手私域代码生成准确率采纳率提升

11. 03 以 1/30 的成本训练全尺寸 SOTA 代码续写大模型

12. 技术路线 10+ TB low quality Similar Random init Pre-Train 2. Knowledge Distillation Teacher Model ～ 0.2TB middle quality 3. Granular Upcycling 4.CPT & Decay < 1TB high quality 1. Pruning Student Model Why Pruning & Distill ？ Traditionally Pre-train • low quality data • huge amount of data • similar data, similar model Pruning & Distill vs • higher quality data • Small amount of data • More consistent training • Configurable model structure • Model knowledge compression, half size, 95%+ performance

13. Stage1: Structured Pruning Nonzero elements are removed at once from model weights Pruning Method: neuron (MLP), attention head (MHA), embeddings (LayerNorm), depth pruning (layer) Importance Estimation: [1] Layer L+1 4 Em transformers blocks Embeddings Layer Norm Attention … … • intra-layer (width) • activations Layer 2 ad Layer Norm MLP Layer N width pruning … … + Layer 1 • inter-layer (depth) Layer 0 • valid loss (ppl) Position Embedding • downstream task accuracy • … Input embeddings depth pruning

14. Stage2: Knowledge Distillation Teacher model correction (fine-tune) l 1 L is = Loss k ( h t ki , h s ki ) ∑ ∑ l k∈H i=1 100 ~ 200B Close to the original data Fine-tune Open-Source Model 1 l L logits = Loss ( p t k (x, τ), p s k (x, τ) ) l ∑ k=1 L = L CLM + L logits + α × L is Teacher Model Knowledge Distillation Teacher Embeddings Embedding loss Students Embeddings Attention MLP Layer L+1 MLP input loss Attention LM Head LM Head loss MLP Layer L+1 LM Head Logits Logits loss Logits

15. Stage3: Granular Moe Upcycling 1 Model Split (Tensor parallel) input tensor F1 result output tensor F1 W F2 W x weight output tensor + input tensor F1 W F1 result F2 W x weight output tensor

16. Stage3: Granular Moe Upcycling 2 Moe Router Init random init dense model ffn dup router chunk probs Took then softmax dup sort

17. Stage3: Granular Moe Upcycling 3 Weight Scaling F1 result = f (x1) * x2 f(x) = x * sigmoid(x) X1 X2 input tensor output tensor F1 W1 F1 W2 * W Probs = 1/(dup_num*chunk_num) F1 result Scale = [(Probs) * (active_num/chunk_num) ]^(-2) = [(1/(dup_num*chunk_num)) * (active_num/chunk_num) ]^(-2) F2 W * W

18. Step4: Multi-Stage CPT Warmup Cosine Decay Linear Decay 800B Tokens 70% source code 50B Tokens 40% source code

19. 模型能力 Kwaipilot-Coder Qwen2.5-Coder OpenCoder Qwen2.5 Coder DeepSeek-Coder-V2 Yi-Coder BigCodebench 23.3 14.2 9.5 25 8.1 14.2 BigCodebench 49.9 46.9 40.5 54 30.6 42.9 HumanEval 82.9 61.6 66.5 65.9 40.9 53.7 HumanEval 76.2 53 63.4 59.1 34.1 46.3 MultiPL-E 68.7 57.5 62.8 59.9 40.4 49.9 Fill-in-the-Middle 93.3 86.2 Not Support 88.3 86.4 Not Support MBPP 64.0 62.9 68.4 68.3 59.4 40.7 KwaiEval 67 51.2 Not Support 59.5 52.5 Not Support 23BA4-Base-V1 Hard Full Plus plus After SFT 模型开源地址：https://huggingface.co/Kwaipilot/KwaiCoder-23B-A4B-v1 7B-Base 8B-Base 32B-Base Base(16BA2.4) 9B

20. HumanEval BigCodeBench-Complete-Hard EvalPlus-HumanEval PASS@1 (greedy decoding) PASS@1 (greedy decoding) 30 90 Best performance/size ratio Best performance/size ratio KwaiPilot-Coder-23BA4-v1 (83.0) 80 Kwai-Coder-DS-V2-Lite (75.0) KwaiPilot-Coder-23A4-V1 (23) 70 DeepSeek-Coder-33B (19.6) Kwai-Coder-DS-V2-Lite (18.2) 20 Qwen2.5-Coder-7B (61.6) CodeLlama-70B (16.2) Llama-3-70B (14.9) Mixtral-8x22B-base (14.2) CodeLlama-34B (12.8) DeepSeek-Coder-V2-Lite (11.5) 50 StarCoder2-15B (46.3) Qwen2.5-Coder-1.5B (43.9) DeepSeek-Coder-V2-Lite (40.9) CodeGemma-2B (7.4) 40 CodeLlama-7B (33.5) CodeGemma-2B (31.1) DeepSeek-Coder-1.3B (3.4) 30 0 7B Qwen2.5-Coder-32B (65.9) CodeLlama-70B (55.5) CodeLlama-34B (51.8) Yi-Coder-9B (53.7) CodeLlama-13B (9.5) 2B Qwen2.5-Coder-14B (64.0) 60 CodeQwen1.5-7B (15.5) DeepSeek-Coder-6.7B (13.5) 10 OpenCoder-8B (66.5) 16B 34B Model Size 70B 176B 20 2B 7B 9B 15B Model Size 16B 34B 70B

21. BigCode Bench hard

22. 在代码续写任务实际线上效果 50%流量灰度实验采纳率提升 1-2pp 推理耗时降低 70ms

23. 结合模型特性重新设计的 workflow Analogy Context Rag 2.0 def funA( ): a = A( ) a.run( ) … Coding Behavior Analysis Incomplete Code Debounce Rank Truncated Generation Analyzer API Document Code Signature Similar code snippets Current Knowledge … Dynamic Prompting LLMS Method Signature Context Score Model Apply Filter Model Context in Repo def funB( ): b = B( ) b.run( ) … Class Signature Package Signature Code With Completion Block FIM Syntax Verification

24. 自动思考代码生成模型助力端到端需求生成 Kwaipilot-chat 人工抉择自动思考复杂需求简单代码题直接回答 autothink 深度思考 Kwaipilot-chat V1-40B Auto think V3-0324 671B DeepSeek DeepSeek Qwen3 Gsm8k 96 86.73 / 90.83 MBPP 92 83.6 / 75.4 Math 500 93.2 92.8 / 84 HumanEval 96.8 92.68 / 91.46 drop 91 90.20 / 88.52 LiveCodeBench 66 48.03 65.9 65.28 AIME 83.3 81.25 79.8 81.25 GPQA 66 63.64 71.5 67.68 passk 2024 diamond R1 671B 32B 端到端需求生成 Kwaipilot-chat-v1自动思考模式在思考和非思考场景同时领先现有 SOTA 开源大模型 Llama4 Scout 109B 32.8 57.2

25. Auto Think 简单问题直接回答复杂问题深度思考

26. 04 端到端的 RAG 进一步提升实际场景表现

27. RAG 2.0

28. 针对代码场景设计的 Embedding 模型技术细节以及开源模型： OASIS: Ordered Augmentation for Self-Improving Search in Code Repositories （ACL 2025）

29. 针对代码场景设计的 Embedding 模型技术细节以及开源模型： OASIS: Ordered Augmentation for Self-Improving Search in Code Repositories （ACL 2025）

30. 一些思考和认知细粒度MoE 训练细粒度合并训练的模型，相比重头开始训练效果更好，更快速收敛，能突破模型能力上限。减枝的认知现有的开源模型很多层是不必要的，中间偏后的层占据更大作用，模型的参数有冗余，模型宽度比深度更重要退火训练过高的学习率会导致loss不收敛，过小的学习率会导致模型学不会知识，在不同训练阶段根据问题选择不同学习率很重要模型蒸馏在模型蒸馏之前，对 teachermodel 做一个 sft/cpt，平衡teacher和student的知识差距很重要

31.

32. THANKS 探索 AI 应用边界 Explore the limits of AI applications