端侧大模型操作系统的架构、优化与展望

如果无法正常显示，请先停止浏览器的去广告插件。

1. 端侧大模型操作系统的架构、优化与展望 Mengwei Xu (徐梦炜)

2. 目录 01 Mobile intelligence before LLM 02 On - device LLM 03 The changes LLM brings: App 04 The changes LLM brings: OS 05 The changes LLM brings: H/W 06 Takeaways

4. Mobile intelligence before LLM • DNNs already run on devices at large scale DNN-embedded mobile apps ▪ Increased by almost 10x (2018 to 2021) [1,2] ▪ Downloaded billions of times in one year ▪ Include almost every high-popularity app ▪ Up to 200+ DNNs in a single app [3] [1] Mengwei Xu, et al. “A First Look at Deep Learning Apps on Smartphones”. In WWW 2019 [2] Mario Almeida, et al. “Smart at what cost? Characterising Mobile Deep Neural Networks in the wild”. In IMC 2021. [3] Through offline communication with application developers.

5. What we expect as mobile intelligence A device that can • Comprehend human language • Reasoning & Planning • Zero-shot & in- context learning • Multimodal alignment • Instruct. following

6. The opportunity: LLM • To bring mobile devices the “next-level” intelligence • Comprehend human language • Reasoning & Planning • Zero-shot & in- context learning • Multimodal alignment • Instruct. following

7. On - device LLM • On-device LLMs handle language tasks in a way that is .. ✓ cost-efficient (important, obviously) ✓ more available (even w/o network) ✓ faster (not always) ✓ privacy-preserving (very important, LLMs can leverage almost every bits of local data) • LLMs on devices does not obviate mega-scale LLMs on clouds! - Creating music/poetry, solving math problems, etc. [1] Jiajun Xu, et al. “On-Device Language Models: A Comprehensive Review”. In preprint’24.

8. On - device LLM • We already have a mobile device that can function with high intelligence! A mobile device that can comprehend, reason, and plan without a cloud!

9. Call for full - stack design • Our response: agent-model-runtime-OS co-design Agent Computer Use Agents (GUI or MCP) testbed [LlamaTouch, UIST’24] , datasets [DroidCall, EMNLP’25][SHORTCUTSBENCH, ICLR’25] , and privacy [SILENCE, NeurIPS’24] Model A training-from-scratch, fully-reproducible SLM family [PhoneLM, preprint’24] , Any-to-any modality mobile foundation model [M4, MobiCom’24] , and Efficient Training techniques [FwdLLM, ATC’24][AdaFL, MobiCom’23] [FeS, MobiCom’23] Runtime Acceleration through NPU [llm.npu, ASPLOS’25] , SpecDecoding [LLMCad, TMC’24] , Sparsity [EdgeMoE, TMC’25] , Early Exiting [Recall, Nature Communications’25] , etc OS LLMaaS Context Management [LLMS, SenSys’26] , Elasticity [ELMS, MobiCom’25] , and Forward-compatibility [LoRaSuite, NeurIPS’25]

10. Demo#1: DigitalAgent Agent Model Runtime

11. Demo#2: UAV Agent

12. mllm : an NPU - centric LLM engine Open source link: https://github.com/UbiquitousLearning/mllm GUI Agent API/Codegen Agent LLM Llama, Qwen, MiniCPM, etc OMNI* MOE VLA* MiniCPM-o, etc GPT-OSS, etc OpenVLA, Pi, etc VLM Qwen-VL, LLaVA, etc Embodied Agents Linear Attn. * Spec. Decoding Token Pruning KVCache Management Experts Offload. Qwen3-NEXT, etc Faster decoding (+NPU) Faster Prefill Paged Attn., Prefix Cache, etc w/ storage hierarchy Graph-level IR + Kernel-level optimization Flash Attention 2, ARM KleidiAI, Qualcomm QNN, etc Heterogeneous Hardware on Edge Qualcomm,Ascend,AMD, Intel, OrangePi, etc.. * : on the way to go

13. The changes LLM brings Pre-LLM Era LLM Era App: fragmented tasks → a unified agent OS: model-agnostic → LLM-native H/W: CPU/GPU-centric → NPU-centric

14. The changes LLM brings Pre-LLM Era LLM Era App: fragmented tasks → a unified agent OS: model-agnostic → LLM-native H/W: CPU/GPU-centric → NPU-centric How to build a capable, generalized, and personalized mobile agent?

15. General approaches: API (MCP) vs. GUI [1] Chaoyun Zhang, et al. "API Agents vs. GUI Agents: Divergence and Convergence." arXiv:2503.11069 (2025).

16. GUI Agent: Status Quo [1] Chaoyun Zhang, et al. "Large Language Model-Brained GUI Agents:A Survey." arXiv:2411.18279 (2024).

17. GUI Agent: Status Quo The reality: <60% accuracy, minutes-long for complex tasks.

18. Prompting GUI Agent: Status Quo Training Input <?xml version='1.0' encoding='UTF-8'> <node index=0 class=FrameLayout> <node index=0 class=TextView text='21 st Country Breakdown' /> <node index=1 class=TextView text='Album 2009’ /> </node> <node index=1 class=FrameLayout> </node> Set-of-Mark Prompting Reflection and Backtracking C k l ic Test-time Scaling Multi-agent Collaboration Working Memory Text-based View Hierarchy Tool Using GUI Represent ation Pixel-based Screenshot Agentic Workflow Task Instruction GUI Agent Backtrack Interactive Testbed (Smartphone, Android Emulator) Environment Perception • AndroidWorld • LlamaTouch

19. GUI Agent: Status Quo Training Input GUI Agent Agentic Workflow Tool Using Reflection and Backtracking Test-time Scaling Self-supervised Reinforcement Learning C k l ic Multi-agent Collaboration <?xml version='1.0' encoding='UTF-8'> <node index=0 class=FrameLayout> <node index=0 class=TextView text='21 st Country Breakdown' /> <node index=1 class=TextView text='Album 2009’ /> </node> <node index=1 class=FrameLayout> </node> Set-of-Mark Prompting Enhancing VLM-based GUI Agents through Self-supervised Reinforcement Learning Working Memory Text-based View Hierarchy GUI Represent [arxiv’25] UIShift: ation Pixel-based Screenshot Task Instruction Backtrack Interactive Testbed (Smartphone, Android Emulator) Environment Perception • AndroidWorld • LlamaTouch

20. Why self - supervised RL? • GUI trajectories (w/o annotations) can be scaled out easily • Using approaches like MobileViews • But human-labelled task instructions can not o DeepMind collects AndroidControl dataset (14K tasks) using 1 year – still far less than enough, and error-prone • RL is the way to generalize • GUI data easily drift • Online or offline? An open problem, still

21. Why K - step GUI transition task? • Rethinking the GUI tasks: • VLM can understand single GUI screen pretty well • What’s difficult is to understand/predict the GUI transitions • • • • Summary: the image shows.. VQA: how many app icons in this image? It has 21.. Grounding: where is the chrome located in the image? It is at [0.2, 0.5].. Other single-image task Easy tasks • • • • • GUI relation Multi-hop planning Complex task automation GUI world model .. Challenging tasks • The key is to embed GUI-to-GUI relation into the VLM/agent

22. K - step GUI transition • Asking VLM to predict the action that leads to specific GUI transition

23. K - step GUI transition • Asking VLM to predict the action that leads to specific GUI transition • Trained with data without human labels • Trained with GRPO [1] ‒ A unified sampling and scoring mechanism during data filtering and training. • Inspired by inverse dynamics control theory [1] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." arXiv preprint arXiv:2402.03300 (2024).

24. K - step GUI transition • Asking VLM to predict the action that leads to specific GUI transition • • • • Trained with data without human labels Trained with GRPO [1] A unified sampling and scoring mechanism during data filtering and training. Inspired by inverse dynamics control theory • We conduct small-scale experiments • 2K filtered K-step GUI Transition data sampled from AndroidControl, without using its task instructions • Models: Qwen2.5-VL-7B, InternVL3-8B, Mimo-VL-7B-SFT, Mimo-VL-7B-RL • We experiment w/ and w/o data filtering, w/ and w/o CoT reasoning • We want to show if it’s as good as SFT, and whether it can scale out

25. Results on GUI automation and grounding • GUI-Shift outperforms models trained with large-scale task instructions.

26. Ablation study：GRPO vs. SFT • GRPO is more suitable than SFT for K-step GUI Transition.

27. Our Roadmap to GUI Agent • Starting from a capable Visual Language Model • Step-1: Self-supervised Reinforcement Learning • Large quantity of trajectory data, no tasks labelled • Building “World Model” for GUI • Static data + Offline RL • Step-2: Task-guided Reinforcement Learning • Small scale with human labelled tasks and trajectories • Real Testbeds + Online RL • Step-3: Online Agent Design

28. The changes LLM brings Pre-LLM Era LLM Era App: fragmented tasks → a unified agent OS: model-agnostic → LLM-native H/W: CPU/GPU-centric → NPU-centric How should OS better serve/manage device-wise LLM requests?

29. LLM as an OS service • LLM integrated into OS as a system service • Scales to infinite number of tasks • Hardware-design-friendly • OS gains full visibility into LLM requests [1] Source: https://developer.android.com/ai/gemini-nano

30. Challenges of LLMaaS • Opening new research opportunities and challenges • Usability : How to design LLMaaS interface? How to upgrade LLM? • [MobiCom’24] Mobile Foundation Model as Firmware • [NeurIPS’25] Efficient LoRA Adaptation Across Large Language Model Upgrades • Efficiency : How to schedule, batch, and cache-reuse system-wise LLM requests? How to manage the LLM context states across apps? • [MobiCom'25] Elastic On-Device LLM Service • [SenSys'26] LLM as a System Service on Mobile Devices • Security : how to protect app-owned LoRa? How to isolate cross-app requests? • Etc..

31. Challenges of LLMaaS • Opening new research opportunities and challenges • Usability : How to design LLMaaS interface? How to upgrade LLM? • [MobiCom’24] Mobile Foundation Model as Firmware • [NeurIPS’25] Efficient LoRA Adaptation Across Large Language Model Upgrades • Efficiency : How to schedule, batch, and cache-reuse system-wise LLM requests? How to manage the LLM context states across apps? • [MobiCom'25] Elastic On-Device LLM Service • [SenSys'26] LLM as a System Service on Mobile Devices • Security : how to protect app-owned LoRa? How to isolate cross-app requests? • Etc..

32. M4: a one - size- fits- all mobile MLLM • Can one model (as an OS service) solve all mobile AI tasks? M4: Any-to-any modality MLLM Tested on 50 mobile AI tasks M4 outperforms prior arts on most tasks [1] Jinliang Yuan, et al. “Mobile Foundation Model as Firmware”. In MobiCom’24.

33. Challenges of LLMaaS • Opening new research opportunities and challenges • Usability : How to design LLMaaS interface? How to upgrade LLM? • [MobiCom’24] Mobile Foundation Model as Firmware • [NeurIPS’25] Efficient LoRA Adaptation Across Large Language Model Upgrades • Efficiency : How to schedule, batch, and cache-reuse system-wise LLM requests? How to manage the LLM context states across apps? • [MobiCom'25] Elastic On-Device LLM Service • [SenSys'26] LLM as a System Service on Mobile Devices • Security : how to protect app-owned LoRa? How to isolate cross-app requests? • Etc..

34. LoRASuite : Towards LLM Upgradability • Upgrading LLM without compromising LoRa (much) Base LLM upgrades at different dimensions CKA-based Layer Mapping LoRaSuite achieves comparative performance with only 1% finetune data [1] Yanan Li, et al. “LoRASuite: Efficient LoRA Adaptation Across LargeLanguage Model Upgrades”. In NeurIPS’25.

35. Challenges of LLMaaS • Opening new research opportunities and challenges • Usability : How to design LLMaaS interface? How to upgrade LLM? • [MobiCom’24] Mobile Foundation Model as Firmware • [NeurIPS’25] Efficient LoRA Adaptation Across Large Language Model Upgrades • Efficiency : How to schedule, batch, and cache-reuse system-wise LLM requests? How to manage the LLM context states across apps? • [MobiCom'25] Elastic On-Device LLM Service • [SenSys'26] LLM as a System Service on Mobile Devices • Security : how to protect app-owned LoRa? How to isolate cross-app requests? • Etc..

36. Serving LLM requests with different QoS • Key idea: a joint planning of token/model pruning Different apps demand diversified QoS A offline- guided, joint planning of token pruning and weights pruning Significant improvement over static approaches [1] Wangsong Yin, et al. “ELMS: Elasticized Large Language Models On Mobile Devices”. In MobiCom’25.

37. The changes LLM brings Pre-LLM Era LLM Era App: fragmented tasks → a unified agent OS: model-agnostic → LLM-native H/W: CPU/GPU-centric → NPU-centric How to serve LLM requests with low latency and energy efficiency?

38. On - device LLM needs LLM - processor • On-device resource scarcity further exacerbated. Pythia Cerebras-GPT OPT 2022.05 2023.03 *PanGu Phi-2 TinyLlama MobileLLaMA Phi-1 2023.12 2023.09 *MobileLLM Gemma Qwen 1.5 MobiLlama Qwen 2 2024.06 2024.02 DCLM 2024.08 vs. 2019.02 GPT2 ResNet, YoLo, LSTM, etc (<200M) <100ms to process one image <100MB memory footprint Easy to quantize (integer-only) Static shape and cost 2022.11 Bloom Galactica 2024.01 2023.11 2023.04 LaMini-GPT Stablelm-zephyr Qwen 2024.04 Stablelm-zephyr-2 MiniCPM OpenELM Phi-3-mini recurrentgemma Small Language Models (1B~5B) >10sec to process one prompt on CPU >1GB memory footprint Difficult to quantize (FP required) Dynamic shape and increased cost with longer prompt [1] Zhenyan Lu, et al. “Small Language Models: Survey, Measurements, and Insights”. In preprint’24. 2024.07 2024.09 Gemma-2 SmolLM danube3 Fox MiniCPM3 Phi-3.5 Qwen 2.5

39. On - device LLM needs NPU • DSA (LLM-processor) is the answer to on-device LLM. • The gap between CPU/GPU and NPU increases over time - Moore’s law still stands for NPU • The gap of energy efficiency is even larger [1] Jinliang Yuan. “Mobile Foundation Model as Firmware”. In MobiCom’24.

40. Filling the design gap between legacy NPUs and modern LLM inference [ASPLOS’25] Fast On-device LLM Inference with NPUs Code at https://github.com/UbiquitousLearning/mllm

41. llm.npu: accelerating LLM prefilling with NPU • Legacy mobile NPU has poor support for (1) Dynamic shape; (2) FP operations; (3) group-level quantization • llm.npu proposes • Chunked prefill with partial sharing • Shadow outlier execution across CPU/NPU • Our-of-order scheduling among CPU/NPU [1] Daliang Xu, et al. “Fast On-device LLM Inference with NPUs”. In ASPLOS’25.

42. Highlighted results Prefill speed under different prompt lengths on different devices (datasets: Longbench-2wiki-Multi-doc QA) Baselines: MLC-LLM (GPU), llama.cpp (CPU), MNN (CPU), PowerInfer-v2 (NPU), TFLite (GPU) 7.3×–18.4×faster than baselines on CPU, and 1.3×–43.6× on GPU with prompt length of 1024 Achieves >1000 tokens/second on Qwen1.5-1.8B (for the first time) [1] Daliang Xu, et al. “Fast On-device LLM Inference with NPUs”. In ASPLOS’25.

43.

44. Filling the design gap between legacy NPUs and modern LLM training [USENIX ATC’24] FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences Code at https://github.com/UbiquitousLearning/FwdLLM

45. FwdLLM : BP- free LLM finetuning • Key idea: leveraging forward gradient for LLM finetuning A random, independent perturbation with same size as trainable weights 𝜃 Forward gradient, an unbiased estimator of f 𝜃 ′ s gradient The directional derivative of 𝑓 at point 𝜃 in direction v. Computing it takes only forward, no need for backpropagation Compared to BP approach: • Legacy NPU-compatible • More memory efficient • Further optimizations: • var.-controlled perturbation pacing • discriminative perturbation sampling • Highlighted results: federated Llama-7B finetuning on devices, with significant speedup and memory saving [1] Mengwei Xu, et al. “FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences”. In ATC’24.

46. Looking into the future.. We shall probably look for hardware- software co-evolution, e.g., mortal computation by Geoffrey Hinton “Two paths to Intelligence” by Geoffrey Hinton

47. The Future: full - stack design! https://innogyan.in/2024/10/28/die-shot-of-snapdragon-8-elite-reveals- component-space-allocation/

48. The Future: full - stack design! A One-Size-Fits-All LLM Agent Workflow (large design space) A Tiny Kernel LLM Processor A Dedicated HBM Unit Time to sacrifice flexibility for efficiency! (we still have flexibility at agent workflow) https://innogyan.in/2024/10/28/die-shot-of-snapdragon-8-elite-reveals- component-space-allocation/

49. Takeaways • On-device LLM is reinventing the mobile devices • A total paradigm shift of mobile AI ecosystem • It calls for full-stack LLM research • OS, runtime, model, and application (agent)

50.

51. THANKS