ChatGLM- An Alternative to ChatGPT

1. ChatGLM: An Alternative to ChatGPT Jie Tang KEG, Tsinghua University Slides available at: http://keg.cs.tsinghua.edu.cn/jietang/ or Google Jie Tang 1

2. What is ChatGLM • ChatGPT and GPT4 has gained enormous popularity – However, techniques behind GPT become a secret to all • ChatGLM, an open-source ChatGPT alternative, toward unclosing the secret – GLM-130B: an open-source LLM base model – ChatGLM-6B: a lightweight open-source ChatGPT alternative – ChatGLM-130B: not open-sourced, but available through API https://github.com/THUDM/GLM-130B https://github.com/THUDM/ChatGLM3 2

3. ChatGLM-6B: An Open-Source Alternative • ChatGLM-6B: 6.2Bparameters, INT4 quantization (only need 6G memory) >600 Open-Sourced Apps developed based on ChatGLM • >50,000 stars on github • >10,000,000 downloads on Huggingface • No. 1 on Github Trending (2 week) • No. 1 on Huggingface Trending (2 weeks) https://github.com/THUDM/GLM-130B https://github.com/THUDM/ChatGLM3 3

4. ChatGPT vs. ChatGLM ChatGPT ChatGLM DALL.E CogView Codex GPT t VS GLM CodeGeeX WebGPT WebGLM GPT-4V GLM-4V on the way (CogVLM, Agent…) 4

5. chatglm.ai GLM XDAI GLM-130B CodeGeeX QAGLM ChatGLM Welcome to try 5

6. Story generation 6

7. Applied Math 7

8. Coding 8

9. GLM-4V (pre-release) 9

10. “draw a dog with a hat” 10

11. 大模型驱动的知识推理

12. GPT code-davinci-002 代码数据预训练 GPT-1 GPT-2 1B GPT-3 davinci text-davinci-002 Codex Supervised FT 100B GPT-3 + RLHF InstructGPT GitHub Copilot RLHF GPT-3.5 ChatGPT （RLHF) New Bing (GPT-4) GPT-4 2023 3 3.14 text-davinci-003 (RLHF) WebGPT (RLHF) 2018 6 2019 2 2020 5 1. 100B Base model 2021 7 2021 12 2. Supervised FT 2022 11 3. RLHF 2023 2

13. OpenAI’s GPT GPT-1 GPT-2 十亿模型 code-davinci-002 GPT-3 davinci text-davinci-002 Codex RLHF InstructGPT Supervised FT 100B GitHub Copilot GPT-3 + RLHF GPT-3.5 ChatGPT （RLHF) New Bing (GPT-4) GPT-4 2023 3 3.14 text-davinci-003 (RLHF) WebGPT (RLHF) 2018 6 2019 2 2020 5 2020 11 2021 5 GLM- 10B ACL’22 2021 7 2022 8 2021 12 mGLM GLM-130B Multi-lingual ICLR’23 KDD’23 100B CodeGeeX 2022 11 QAGLM WebGLM KDD’23 THU&Zhipu AI’s GLM VS Code/JetBrains CodeGeeX Plugin 2023 2 ChatGLM （SFT + RLHF） ChatGLM-6B (SFT + RLHF） NeurIPS’21 NeurIPS’22 ICLR’23 VisualGLM CogVLM

14. General Language Model (GLM) Framework NLU Cond. Gen. Uncond. Gen. Autoregressive (GPT) — — √ Autoencoding (BERT) √ × × Encoder-Decoder (T5) — √ — Autoregressive Blank-Infilling (GLM) √ √ √ Du and Qian et al. All NLP Tasks are Generation Tasks. ACL’22. arxiv: 2103.10360

15. General Language Model (GLM)

16. General Language Model (GLM) Du and Qian et al. All NLP Tasks are Generation Tasks. ACL’22. arxiv: 2103.10360 Du and Qian et al. All NLP Tasks are Generation Tasks. ACL’22. 16

17. General Language Model (GLM) LAMBADA Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23 17

18. Results on Natural Language Understanding • Better than BERT, T5, RoBERTa 18

19. Resutls on Generation • The most important thing is that one model can do all the things 19

20. Why 100B-scale model? • What is 16 mod 12? • 16 divided by 12 equals 1 remainder 4. So the answer is 4! 1. J Wei, et al. Emergent Abilities of Large Language Models. arXiv: 2206.07682 GPT-3 (OpenAI) LaMDA (谷歌)

21. Why 100B-scale model? 1. J Wei, et al. Emergent Abilities of Large Language Models. arXiv: 2206.07682

22. Scaling Law Scaling Law introduces complicated reasoning abilities Model scale (# parameters in billions) 22

23. “Emergent abilities” • OpenAI • GPT-3 175B • Google • LaMDA 137B • PaLM 540B • Microsoft • Turing-NLG 530B • DeepMind • Gopher 260B Gif Credit: Google

24. How to train a 100B–scale LLM? • 8 months have witnessed numerous challenges o Engineering: How to train 100B-scale models from scratch? § Hygon DCU, NVIDIA A100, Ascend 910, Sunway § Frequent & random hardware failures, Megatron-DeepSpeed 3D pipeline, CUDA kernel efficiency, GPU memory overflow, 10K+ threads TCP init & comms… o Algorithm: How to stabilize the training of 100B-scale models? § The gradient norms of embeddings, Post-LN / Pre-LN stability, dataloader state seeds, computation precision in Softmax / Attention Project System Conceived Debug 2021.12 2022.1 Data Hygon, NVIDIA Algo/Sys Large-Scale Tests Ascend, Sunway Tests 2022.2 2022.3 2022.4 2022.5 http://keg.cs.tsinghua.edu.cn/glm-130b/ Training Stability Issues 2022.6 Evaluations Quantization 2022.7 To be continued 24

25. Training Stability of 100B-Scale Models p Tradeoff: Stability (Slow) or Efficiency (Instable) p Existing Solutions p OPT-175B: manually adjust LR & skip data when collapses (performance drop) p BLOOM 176B: embedding norm & BF16（performance drop, few platform） Sources: OPT / BLOOM / GLM-130B

26. GLM-130B: Training Stability pAttention score: Softmax in 32 to avoid overflow <latexit sha1_base64="xd32kgbSheCDoMnbGsfzKeliFHY=">AAADznicrVPNa9swFFfsbuuyr7Q77iIaBt1hwW5Lt0uhbFAGvaSwtIUoC7IsJ6Ky5UrPpZkQu+7v6233/SGTU3d0Sdmg7IF4P73v30NKSikMRNGPVhCuPHj4aPVx+8nTZ89fdNbWj42qNOMDpqTSpwk1XIqCD0CA5Kel5jRPJD9Jzj7W/pMLro1QxWeYlXyU00khMsEoeNN4rfUTY0xUyTUFpQuac2tUBjm9dETyDDZJpimzJFEyNbPcK3vkxuL2/dDfv1gCqnTOEnOuwabOES0mU3iD9/C/GtyrDZXllP5uht9i4iv+j1LN3DcKRM5NE9IwwnvtJUoH/Xj3hs/fuC4kbW/dc8t3z7w8bKPGnW7Ui+aCl0HcgC5qpD/uXJFUsSrnBTBJjRnGUQkjSzUIJrlrk8rwkrIzOuFDD2s6ZmTnz9Hh196S4kxpfwrAc+vtDEtzU7PykTmFqVn01ca7fMMKsvcjK4qyAl6w60ZZJTEoXL9tnArNGciZB5Rp4WfFbEr9ZsH/gLZfQrxIeRkcb/Xi3d7O0U53/0OzjlX0Cm2gTRSjd2gffUJ9NEAsOAzOg6+BDfvhRejCb9ehQavJeYn+kPD7L8emTUY=</latexit> softmax Q i K > i p d ! = softmax Q i K > i p ↵ d max Q i K > i p ↵ d !! ⇥ ↵ ! = FP16 softmax FP32 Attention scores grow large --- exceeding FP16’s range Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23 Q i K > i p ↵ d ! ⇥ ↵ !!

27. GLM-130B: Training Stability p Embedding Layer Gradient Shrink (EGS) word_embedding = word_embedding * alpha + word_embedding .detach() * (1 alpha) Embedding Layer gradients can be magnitudes larger than others Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23

28. GLM-130B: Training Stability p The final training run of GLM-130B Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23

29. GLM-130B Training Lessons • Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23 • https://github.com/THUDM/GLM-130B

30. GLM-130B English: better than GPT-3/OPT/PaLM on MMLU, LAMBADA, BIG-bench-lite Chinese: better than ERNIE 260B & YUAN 245B Aug., 2022-Mar. 2023, research use requests from ~1000 orgs in 70 countries • Google • Microsoft • Facebook • Stanford • MIT • UC Berkely • CMU • Harvard • Princeton • Yale • Cornell • UIUC • Cambridge • Oxford • Huawei • Alibaba • Tencent • Baidu • Meituan • Bytedance • Didi • Xiaoice • Xiaodu • Xiaomi • Xiaopeng • Youdao • Face++ • Ping An Cap •Peking U. •Zhejiang U. •Shanghai JT U. •Fudan U. •USTC •U of CAS •Wuhan U. •Naikai U. •Hongkong U. •CUHK •HKUST •BAAI •Zhejiang Lab •Shanghai AI Lab

31. GLM-130B in HELM Stanford’s Holistic Evaluation of Language Models (HELM, Nov. 2022) https://crfm.stanford.edu/helm, 2023.0308 Liang et al., Holistic Evaluation of Language Models. arXiv: 2211.09110

32. GLM-130B in HELM 1.Liang et al., Holistic Evaluation of Language Models. arXiv: 2211.09110

33. GLM-130B in HELM 1.Liang et al., Holistic Evaluation of Language Models. arXiv: 2211.09110

34. INT4 Quantization for RTX 3090s/2080s GLM’s INT4 Weight Quantization Scaling Law 34

35. INT4 Quantization for RTX 3090s/2080s pGLM-130B INT4 Quant. w/o perform. degradation Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23 35

36. GLM-130B Backbone Training Objective Quantization Acceleration Cross- Platform GPT3-175B GPT SSL Only — — NVIDIA OPT-175B GPT SSL Only INT8 Megatron NVIDIA BLOOM-176B GPT SSL Only INT8 Megatron NVIDIA GLM SSL & MIP GLM-130B Performance Impro: Effects • • • • Big-bench-lite: +5.2% LAMBADA: +2.3% CLUE: +24.3% FewCLUE: +12.8% • INT8 • INT4 Faster Transformer Affordable Serving: Fast Inference: It saves 75% GPU mem in inference; 7-8.4 faster than Pytorch; It can infer on 3090x4 / 2080x8 2.5 faster than Megatron • • • • NVIDIA Hygon DCU Ascend 910 Sunway Diverse Supports: It supports more diverse adoptions of LLMs 36

37. Develop ChatGLM based on GLM-130B 37

38. Challenge 1: Simple NLP task vs. Complex task • Simple NLP task => Complex task (e.g., logic reasoning) QQP Paraphrase) Question1 How is air traffic controlled? Question1 How is air traffic controlled? Question1 How is air traffic controlled? Question1 How is air traffic controlled? Question2 How do you become an air traffic controller? Question2 How do you become an air traffic controller? Question2 How do you become an air traffic controller? Question2 How do you become an air traffic controller? Label 0 Label 0 Label 0 Label 0 XSum (Summary) Document The picture appeared on the wall of a Document The picture appeared on the wall of a store appeared on Whymark Avenue... Document Poundland The picture on the wall of a store appeared on Whymark Avenue... Document Poundland The picture on the wall of a Poundland store on Whymark Avenue... Poundland store on Whymark Avenue... Summary Graffiti artist Banksy is believed to be Summary Graffiti artist Banksy is believed to be Summary behind... Graffiti artist Banksy is believed to be Summary behind... Graffiti artist Banksy is believed to be behind... behind... Math (GSM8k): Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? {Question1} {Question2} Pick one: These questions are duplicates or not duplicates. I received the questions "{Question1}" and "{Question2}". Are they duplicates? {Document} How would you rephrase that in a few words? First, please read the article: {Document} Now, can you write me an extremely short abstract for it? {Choices[label]} {Choices[label]} {Summary} {Summary} Last Letter Concatenation: Q: Take the last letters of the words in “Lady Gaga” and concatenate them. 38

39. Challenge 2: Static NLP vs. Dynamic knowledge • GPT-3’s knowledge can be limited, obsolete, and uninterpretable – Limited: Limited long-tailed knowledge • Example: what is the sixth highest mountain in the world? (Answer: Mount K2) – Obsolete: GPT-3’s knowledge is before 2020 – Uninterpretable: No reference for answers that require knowledge Case Study: Limited Knowledge Case Study: Obsolete Knowledge 39

40. Challenge 3: Traditional NLP vs. Align with Human • Case Study: Explain the moon landing to a 6 year old in a few sentences. – Without proper ``prompt engineering’’, GPT-3 and GLM-130B won’t return satisfying answers, either 40

41. Develop ChatGLM based on GLM-130B CodeGeeX (KDD’23) Augmenting Code, Alignment, Web, Image understanding… Code→Reasoning ~0.4TB，4096 Supervised Fine-tune GLM-130B++(GLM 3.5) GLM-130B (ACL’22, ICLR’23) Text & Code Base model ~1TB，4096 2021.12 Instruction following ChatGLM RLHF Chat Product Image understanding Web WebGLM (KDD’23) Combining Search 2022.09 2022.12 Visual-ChatGLM (NeurIPS’21/22, ICLR’23) Multi-modal 2022.12 2023.05 41

42. CodeGeeX Generating over 10M lines codes 6B/13B parameters、100+ languages Support both Nvidia and 910A/B Free VSCode and JetBrains plugins CodeGeeX Optimization (OpenAI) (Huawei) (DeepMind) (Salesforce) (Meta AI) Operators (Layernorm/Gelu/BatchMatmul/Add) Auto search for optimizing matrix multiplication Performance Improve 257% on Ascend 910A Trained with over 1,000 Ascend 910A 42

43. Relay Diffusion Model (RDM) https://github.com/THUDM/RelayDiffusion • RDM transfers a low-resolution image into an equivalent high- resolution one via blurring diffusion and block noise. • RDM achieved state-of-the-art FID on CelebA-HQ and sFID ImageNet-256 (FID=1.87)! 43

44. “draw a dog with a hat” 44

45. CogVLM • CogVLM connects pretrained language model and image encoder by a trainable visual expert model ICLR’24（submitted） 45

46. CogVLM Achieved the best on 10+ benchmarks 46

47. GLM-4V (pre-release) 47

48. WebGLM = GLM + Search Liu et al. WebGLM: Towards An Efficient Web-enhanced Question Answering System with Human Preference. KDD’23

49. LLM Agent

50. AgentTuning: Enabling Generalized Agent Abilities For LLMs Six agentInstruct trajectory datasets • 1,866 high-quality CoTs codes & models: http://github.com/THUDM/AgentTuning Agent Tuning Mix-training • 20% AgentInstruct + 80% ShareGPT 51

51. Main Results +176% +76% +57% In-domain dist Out-domain dist Significant improvement Good generalization Better generalization 52

52. ChatGLM-6B • • • • • • https://github.com/THUDM/ChatGLM3 Download from Huggingface – git clone https://huggingface.co/THUDM/chatglm3 Download demo – git clone https://github.com/THUDM/ChatGLM3 – cd ChatGLM-6B Install demo – pip install gradio – python web_demo.py Run the demo – python cli_demo.py Install the api – pip install fastapi uvicorn – python api.py Run ChatGLM on your own MAC (w/ Apple Silicon) – model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps') 53

53. #star 35,471 14,125 7,315 7,215 4,850 4,635 3,541 Open LLM Research https://github.com/THUDM

54. Bigmodel.ai —API Platform ChatGLM-Pro ChatGLM ChatGLM-Lite Powerful Flexible Fast 0.01 /1000 Tokens 0.005 /1000 Tokens 0.002 /1000 Tokens High quality, Knowledge base, reasoning Balanced effect and cost, news writing, abstract generation, vertical search High speed, lower cost, chatting, customer service, classification, extraction 55

55. What’s the next? 56

56. Abstraction and Reasoning 1. Francois Chollet. On the Measure of Intelligence. 2019 57

57. Abstraction and Reasoning 58

58. Abstraction and Reasoning 59

59. Generative Agent • Generative agents: computational software agents that simulate believable human behavior – A “Westworld” with 25 agents; Auto-GPT; AgentGPT… 1. Joon Sung Park et al. Generative Agents: Interactive Simulacra of Human Behavior. 2023 60

60. 61

61. Summary • GPT vs GLM – – – – – ChatGPT vs. ChatGLM DALL.E vs. CogView Codex vs. CodeGeeX WebGPT vs. WebGLM GPT-4V vs. GLM-4V (CogVLM, AgentTuning…) • 2024-toward AGI 63

62. References • • • • • • • • • • • Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. KDD’23. Xiao Liu, Hanyu Lai, Yu Hao, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, and Jie Tang. WebGLM: Towards An Efficient Web-enhanced Question Answering System with Human Preference. KDD’23. Jing Zhang, Xiaokang Zhang, Daniel Zhang-Li, Jifan Yu, Zijun Yao, Zeyao Ma, Yiqi Xu, Haohua Wang, Xiaohan Zhang, Nianyi Lin, Sunrui Lu, Jie Tang, and Juanzi Li. GLM-Dialog: Noise-tolerant Pre-Training for Knowledge-grounded Dialogue Generation. KDD’23. Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23. Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. ICLR’23. Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers. NeurIPS’22. Jifan Yu, Xiaohan Zhang, Yifan Xu, Xuanyu Lei, Xinyu Guan, Jing Zhang, Lei Hou, Juanzi Li, and Jie Tang. XDAI: A Tuning-free Framework for Exploiting Pre-trained Language Models in Knowledge Grounded Dialogue Generation. KDD’22. Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. ACL’21. Zixuan Ma, Jiaao He, Jiezhong Qiu, Huanqi Cao, Yuanwei Wang, Zhenbo Sun, Liyan Zheng, Haojie Wang, Shizhi Tang, Tianyu Zheng, Junyang Lin, Guanyu Feng, Zeqiang Huang, Jie Gao, Aohan Zeng, JianWei Zhang, Runxin Zhong, Tianhui Shi, Sha Liu, Weimin Zheng, Jie Tang, Hongxia Yang, Xin Liu, Jidong Zhai, and Wenguang Chen. BAGUALU: Targeting Brain Scale Pretrained Models with over 37 Million Cores. PPOPP’22. Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. NeurIPS’21. Junyang Lin, Rui Men, An Yang, Chang Zhou, Yichang Zhang, Peng Wang, Jingren Zhou, Jie Tang, and Hongxia Yang. M6: Multi- Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining. KDD’21. 64

63. Thank you！ Many many collaborators from Tsinghua and Zhipu AI! https://github.com/THUDM/ 65