端侧大模型操作系统的架构、优化与展望
如果无法正常显示,请先停止浏览器的去广告插件。
1. 端侧大模型操作系统的
架构、优化与展望
Mengwei Xu (徐梦炜)
2. 目录
01 Mobile intelligence before LLM
02 On - device LLM
03 The changes LLM brings: App
04 The changes LLM brings: OS
05 The changes LLM brings: H/W
06 Takeaways
3.
4. Mobile intelligence before LLM
• DNNs already run on devices at large scale
DNN-embedded mobile apps
▪ Increased by almost 10x (2018 to 2021) [1,2]
▪ Downloaded billions of times in one year
▪ Include almost every high-popularity app
▪ Up to 200+ DNNs in a single app [3]
[1] Mengwei Xu, et al. “A First Look at Deep Learning Apps on Smartphones”. In WWW 2019
[2] Mario Almeida, et al. “Smart at what cost? Characterising Mobile Deep Neural Networks in the wild”. In IMC 2021.
[3] Through offline communication with application developers.
5. What we expect as mobile intelligence
A device that can
• Comprehend
human language
• Reasoning &
Planning
• Zero-shot & in-
context learning
• Multimodal
alignment
• Instruct. following
6. The opportunity: LLM
• To bring mobile devices the “next-level” intelligence
• Comprehend
human language
• Reasoning &
Planning
• Zero-shot & in-
context learning
• Multimodal
alignment
• Instruct. following
7. On - device LLM
• On-device LLMs handle language tasks in a way that is ..
✓ cost-efficient (important, obviously)
✓ more available (even w/o network)
✓ faster (not always)
✓ privacy-preserving (very important, LLMs can leverage almost every bits of local data)
• LLMs on devices does not obviate mega-scale LLMs on clouds!
-
Creating music/poetry, solving math problems, etc.
[1] Jiajun Xu, et al. “On-Device Language Models: A Comprehensive Review”. In preprint’24.
8. On - device LLM
• We already have a mobile device that can function with high intelligence!
A mobile device that can comprehend,
reason, and plan without a cloud!
9. Call for full
- stack design
• Our response: agent-model-runtime-OS co-design
Agent Computer Use Agents (GUI or MCP) testbed [LlamaTouch, UIST’24] , datasets
[DroidCall, EMNLP’25][SHORTCUTSBENCH, ICLR’25] , and privacy [SILENCE, NeurIPS’24]
Model A training-from-scratch, fully-reproducible SLM family [PhoneLM, preprint’24] ,
Any-to-any modality mobile foundation model [M4, MobiCom’24] , and
Efficient Training techniques [FwdLLM, ATC’24][AdaFL, MobiCom’23] [FeS, MobiCom’23]
Runtime Acceleration through NPU [llm.npu, ASPLOS’25] , SpecDecoding [LLMCad, TMC’24] ,
Sparsity [EdgeMoE, TMC’25] , Early Exiting [Recall, Nature Communications’25] , etc
OS LLMaaS Context Management [LLMS, SenSys’26] , Elasticity [ELMS, MobiCom’25] ,
and Forward-compatibility [LoRaSuite, NeurIPS’25]
10. Demo#1: DigitalAgent
Agent
Model
Runtime
11. Demo#2:
UAV Agent
12. mllm : an NPU - centric LLM engine
Open source link: https://github.com/UbiquitousLearning/mllm
GUI Agent
API/Codegen Agent
LLM
Llama, Qwen, MiniCPM, etc
OMNI* MOE VLA*
MiniCPM-o, etc GPT-OSS, etc OpenVLA, Pi, etc
VLM
Qwen-VL, LLaVA, etc
Embodied Agents
Linear Attn. * Spec. Decoding Token Pruning KVCache Management Experts Offload.
Qwen3-NEXT, etc Faster decoding (+NPU) Faster Prefill Paged Attn., Prefix Cache, etc w/ storage hierarchy
Graph-level IR + Kernel-level optimization
Flash Attention 2, ARM KleidiAI, Qualcomm QNN, etc
Heterogeneous Hardware on Edge
Qualcomm,Ascend,AMD, Intel, OrangePi, etc..
* : on the way to go
13. The changes LLM brings
Pre-LLM Era
LLM Era
App: fragmented tasks → a unified agent
OS: model-agnostic → LLM-native
H/W: CPU/GPU-centric → NPU-centric
14. The changes LLM brings
Pre-LLM Era
LLM Era
App: fragmented tasks → a unified agent
OS: model-agnostic → LLM-native
H/W: CPU/GPU-centric → NPU-centric
How to build a capable, generalized, and personalized mobile agent?
15. General approaches: API (MCP) vs. GUI
[1] Chaoyun Zhang, et al. "API Agents vs. GUI Agents:
Divergence and Convergence." arXiv:2503.11069 (2025).
16. GUI Agent: Status Quo
[1] Chaoyun Zhang, et al. "Large Language Model-Brained GUI Agents:A Survey." arXiv:2411.18279 (2024).
17. GUI Agent: Status Quo
The reality: <60% accuracy,
minutes-long for complex tasks.
18. Prompting
GUI Agent: Status Quo
Training
Input
<?xml version='1.0'
encoding='UTF-8'>
<node index=0
class=FrameLayout>
<node index=0
class=TextView
text='21 st Country
Breakdown' />
<node index=1
class=TextView
text='Album 2009’
/>
</node>
<node index=1
class=FrameLayout>
</node>
Set-of-Mark
Prompting
Reflection and
Backtracking
C
k
l ic
Test-time Scaling
Multi-agent
Collaboration
Working
Memory
Text-based
View Hierarchy
Tool Using
GUI
Represent
ation
Pixel-based
Screenshot
Agentic Workflow
Task
Instruction
GUI Agent
Backtrack
Interactive Testbed
(Smartphone, Android Emulator)
Environment Perception
• AndroidWorld
• LlamaTouch
19. GUI Agent: Status Quo
Training
Input
GUI Agent
Agentic Workflow
Tool Using
Reflection and
Backtracking
Test-time Scaling
Self-supervised Reinforcement Learning
C
k
l ic
Multi-agent
Collaboration
<?xml version='1.0'
encoding='UTF-8'>
<node index=0
class=FrameLayout>
<node index=0
class=TextView
text='21 st Country
Breakdown' />
<node index=1
class=TextView
text='Album 2009’
/>
</node>
<node index=1
class=FrameLayout>
</node>
Set-of-Mark
Prompting
Enhancing VLM-based GUI Agents through Self-supervised
Reinforcement Learning
Working
Memory
Text-based
View Hierarchy
GUI
Represent
[arxiv’25]
UIShift:
ation
Pixel-based
Screenshot
Task
Instruction
Backtrack
Interactive Testbed
(Smartphone, Android Emulator)
Environment Perception
• AndroidWorld
• LlamaTouch
20. Why self - supervised RL?
• GUI trajectories (w/o annotations) can be scaled out easily
• Using approaches like MobileViews
• But human-labelled task instructions can not
o DeepMind collects AndroidControl dataset (14K tasks)
using 1 year – still far less than enough, and error-prone
• RL is the way to generalize
• GUI data easily drift
• Online or offline? An open problem, still
21. Why K - step GUI transition task?
• Rethinking the GUI tasks:
• VLM can understand single GUI screen pretty well
• What’s difficult is to understand/predict the GUI transitions
•
•
•
•
Summary: the image
shows..
VQA: how many app icons
in this image? It has 21..
Grounding: where is the
chrome located in the image?
It is at [0.2, 0.5]..
Other single-image task
Easy tasks
•
•
•
•
•
GUI relation
Multi-hop planning
Complex task automation
GUI world model
..
Challenging tasks
• The key is to embed GUI-to-GUI relation into the VLM/agent
22. K - step GUI transition
• Asking VLM to predict the action that leads to specific GUI transition
23. K - step GUI transition
• Asking VLM to predict the action that leads to specific GUI transition
• Trained with data without human labels
• Trained with GRPO [1]
‒ A unified sampling and scoring mechanism during data filtering and training.
• Inspired by inverse dynamics control theory
[1] Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language
models." arXiv preprint arXiv:2402.03300 (2024).
24. K - step GUI transition
• Asking VLM to predict the action that leads to specific GUI transition
•
•
•
•
Trained with data without human labels
Trained with GRPO [1]
A unified sampling and scoring mechanism during data filtering and training.
Inspired by inverse dynamics control theory
• We conduct small-scale experiments
• 2K filtered K-step GUI Transition data sampled from AndroidControl,
without using its task instructions
• Models: Qwen2.5-VL-7B, InternVL3-8B, Mimo-VL-7B-SFT, Mimo-VL-7B-RL
• We experiment w/ and w/o data filtering, w/ and w/o CoT reasoning
• We want to show if it’s as good as SFT, and whether it can scale out
25. Results on GUI automation and grounding
• GUI-Shift outperforms models trained with large-scale task instructions.
26. Ablation study:GRPO vs. SFT
• GRPO is more suitable than SFT for K-step GUI Transition.
27. Our Roadmap to GUI Agent
• Starting from a capable Visual Language Model
• Step-1: Self-supervised Reinforcement Learning
• Large quantity of trajectory data, no tasks labelled
• Building “World Model” for GUI
• Static data + Offline RL
• Step-2: Task-guided Reinforcement Learning
• Small scale with human labelled tasks and trajectories
• Real Testbeds + Online RL
• Step-3: Online Agent Design
28. The changes LLM brings
Pre-LLM Era
LLM Era
App: fragmented tasks → a unified agent
OS: model-agnostic → LLM-native
H/W: CPU/GPU-centric → NPU-centric
How should OS better serve/manage device-wise LLM requests?
29. LLM as an OS service
• LLM integrated into OS as a system service
• Scales to infinite number of tasks
• Hardware-design-friendly
• OS gains full visibility into LLM requests
[1] Source: https://developer.android.com/ai/gemini-nano
30. Challenges of LLMaaS
• Opening new research opportunities and challenges
• Usability : How to design LLMaaS interface? How to upgrade LLM?
• [MobiCom’24] Mobile Foundation Model as Firmware
• [NeurIPS’25] Efficient LoRA Adaptation Across Large Language Model Upgrades
• Efficiency : How to schedule, batch, and cache-reuse system-wise LLM
requests? How to manage the LLM context states across apps?
• [MobiCom'25] Elastic On-Device LLM Service
• [SenSys'26] LLM as a System Service on Mobile Devices
• Security : how to protect app-owned LoRa? How to isolate cross-app
requests?
• Etc..
31. Challenges of LLMaaS
• Opening new research opportunities and challenges
• Usability : How to design LLMaaS interface? How to upgrade LLM?
• [MobiCom’24] Mobile Foundation Model as Firmware
• [NeurIPS’25] Efficient LoRA Adaptation Across Large Language Model Upgrades
• Efficiency : How to schedule, batch, and cache-reuse system-wise LLM
requests? How to manage the LLM context states across apps?
• [MobiCom'25] Elastic On-Device LLM Service
• [SenSys'26] LLM as a System Service on Mobile Devices
• Security : how to protect app-owned LoRa? How to isolate cross-app
requests?
• Etc..
32. M4: a one - size- fits- all mobile MLLM
• Can one model (as an OS service) solve all mobile AI tasks?
M4: Any-to-any modality MLLM
Tested on 50 mobile AI tasks
M4 outperforms prior arts on most tasks
[1] Jinliang Yuan, et al. “Mobile Foundation Model as Firmware”. In MobiCom’24.
33. Challenges of LLMaaS
• Opening new research opportunities and challenges
• Usability : How to design LLMaaS interface? How to upgrade LLM?
• [MobiCom’24] Mobile Foundation Model as Firmware
• [NeurIPS’25] Efficient LoRA Adaptation Across Large Language Model
Upgrades
• Efficiency : How to schedule, batch, and cache-reuse system-wise LLM
requests? How to manage the LLM context states across apps?
• [MobiCom'25] Elastic On-Device LLM Service
• [SenSys'26] LLM as a System Service on Mobile Devices
• Security : how to protect app-owned LoRa? How to isolate cross-app
requests?
• Etc..
34. LoRASuite : Towards LLM Upgradability
• Upgrading LLM without compromising LoRa (much)
Base LLM
upgrades at
different
dimensions
CKA-based
Layer Mapping
LoRaSuite
achieves
comparative
performance
with only 1%
finetune data
[1] Yanan Li, et al. “LoRASuite: Efficient LoRA Adaptation Across LargeLanguage Model Upgrades”. In NeurIPS’25.
35. Challenges of LLMaaS
• Opening new research opportunities and challenges
• Usability : How to design LLMaaS interface? How to upgrade LLM?
• [MobiCom’24] Mobile Foundation Model as Firmware
• [NeurIPS’25] Efficient LoRA Adaptation Across Large Language Model Upgrades
• Efficiency : How to schedule, batch, and cache-reuse system-wise LLM
requests? How to manage the LLM context states across apps?
• [MobiCom'25] Elastic On-Device LLM Service
• [SenSys'26] LLM as a System Service on Mobile Devices
• Security : how to protect app-owned LoRa? How to isolate cross-app
requests?
• Etc..
36. Serving LLM requests with different QoS
• Key idea: a joint planning of token/model pruning
Different apps demand diversified QoS
A offline-
guided, joint
planning of
token
pruning and
weights
pruning
Significant improvement over static approaches
[1] Wangsong Yin, et al. “ELMS: Elasticized Large Language Models On Mobile Devices”. In MobiCom’25.
37. The changes LLM brings
Pre-LLM Era
LLM Era
App: fragmented tasks → a unified agent
OS: model-agnostic → LLM-native
H/W: CPU/GPU-centric → NPU-centric
How to serve LLM requests with low latency and energy efficiency?
38. On - device LLM needs LLM
- processor
• On-device resource scarcity further exacerbated.
Pythia
Cerebras-GPT
OPT
2022.05
2023.03
*PanGu
Phi-2
TinyLlama
MobileLLaMA
Phi-1
2023.12
2023.09
*MobileLLM
Gemma
Qwen 1.5
MobiLlama
Qwen 2
2024.06
2024.02
DCLM
2024.08
vs.
2019.02
GPT2
ResNet, YoLo, LSTM, etc
(<200M)
<100ms to process one image
<100MB memory footprint
Easy to quantize (integer-only)
Static shape and cost
2022.11
Bloom
Galactica
2024.01
2023.11
2023.04
LaMini-GPT
Stablelm-zephyr
Qwen
2024.04
Stablelm-zephyr-2
MiniCPM
OpenELM
Phi-3-mini
recurrentgemma
Small Language Models (1B~5B)
>10sec to process one prompt on CPU
>1GB memory footprint
Difficult to quantize (FP required)
Dynamic shape and increased cost with longer prompt
[1] Zhenyan Lu, et al. “Small Language Models: Survey, Measurements, and Insights”. In preprint’24.
2024.07
2024.09
Gemma-2
SmolLM
danube3
Fox
MiniCPM3
Phi-3.5
Qwen 2.5
39. On - device LLM needs NPU
• DSA (LLM-processor) is the answer to on-device LLM.
• The gap between CPU/GPU and NPU
increases over time
- Moore’s law still stands for NPU
• The gap of energy efficiency is even larger
[1] Jinliang Yuan. “Mobile Foundation Model as Firmware”. In MobiCom’24.
40. Filling the design gap between
legacy NPUs and modern LLM inference
[ASPLOS’25] Fast On-device LLM Inference with NPUs
Code at https://github.com/UbiquitousLearning/mllm
41. llm.npu: accelerating LLM prefilling with NPU
• Legacy mobile NPU has poor support for
(1) Dynamic shape; (2) FP operations; (3) group-level quantization
• llm.npu proposes
• Chunked prefill with partial sharing
• Shadow outlier execution across CPU/NPU
• Our-of-order scheduling among CPU/NPU
[1] Daliang Xu, et al. “Fast On-device LLM Inference with NPUs”. In ASPLOS’25.
42. Highlighted results
Prefill speed under different prompt lengths on different devices (datasets: Longbench-2wiki-Multi-doc QA)
Baselines: MLC-LLM (GPU), llama.cpp (CPU), MNN (CPU), PowerInfer-v2 (NPU), TFLite (GPU)
7.3×–18.4×faster than baselines on CPU, and 1.3×–43.6× on GPU with prompt length of 1024
Achieves >1000 tokens/second on Qwen1.5-1.8B (for the first time)
[1] Daliang Xu, et al. “Fast On-device LLM Inference with NPUs”. In ASPLOS’25.
43.
44. Filling the design gap between
legacy NPUs and modern LLM training
[USENIX ATC’24] FwdLLM: Efficient Federated Finetuning of
Large Language Models with Perturbed Inferences
Code at https://github.com/UbiquitousLearning/FwdLLM
45. FwdLLM
: BP- free LLM finetuning
• Key idea: leveraging forward gradient for LLM finetuning
A random, independent
perturbation with same
size as trainable weights 𝜃
Forward gradient,
an unbiased
estimator of f 𝜃 ′ s
gradient
The directional derivative of 𝑓 at point 𝜃 in
direction v. Computing it takes only
forward, no need for backpropagation
Compared to BP approach:
• Legacy NPU-compatible
• More memory efficient
• Further optimizations:
• var.-controlled perturbation pacing
• discriminative perturbation sampling
• Highlighted results: federated Llama-7B finetuning on devices,
with significant speedup and memory saving
[1] Mengwei Xu, et al. “FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed
Inferences”. In ATC’24.
46. Looking into the future..
We shall probably look for hardware-
software co-evolution, e.g., mortal
computation by Geoffrey Hinton
“Two paths to Intelligence” by Geoffrey Hinton
47. The Future: full
- stack design!
https://innogyan.in/2024/10/28/die-shot-of-snapdragon-8-elite-reveals-
component-space-allocation/
48. The Future: full
- stack design!
A One-Size-Fits-All LLM
Agent Workflow
(large design space)
A Tiny Kernel
LLM
Processor
A Dedicated
HBM Unit
Time to sacrifice flexibility for efficiency!
(we still have flexibility at agent workflow)
https://innogyan.in/2024/10/28/die-shot-of-snapdragon-8-elite-reveals-
component-space-allocation/
49. Takeaways
• On-device LLM is reinventing the mobile devices
• A total paradigm shift of mobile AI ecosystem
• It calls for full-stack LLM research
• OS, runtime, model, and application (agent)
50.
51. THANKS