ChatGLM- An Alternative to ChatGPT
如果无法正常显示,请先停止浏览器的去广告插件。
1. ChatGLM: An Alternative to ChatGPT
Jie Tang
KEG, Tsinghua University
Slides available at: http://keg.cs.tsinghua.edu.cn/jietang/
or Google Jie Tang
1
2. What is ChatGLM
• ChatGPT and GPT4 has gained enormous popularity
– However, techniques behind GPT become a secret to all
• ChatGLM, an open-source ChatGPT alternative, toward
unclosing the secret
– GLM-130B: an open-source LLM base model
– ChatGLM-6B: a lightweight open-source ChatGPT alternative
– ChatGLM-130B: not open-sourced, but available through API
https://github.com/THUDM/GLM-130B
https://github.com/THUDM/ChatGLM3
2
3. ChatGLM-6B: An Open-Source Alternative
• ChatGLM-6B: 6.2Bparameters, INT4
quantization (only need 6G memory)
>600 Open-Sourced
Apps developed based
on ChatGLM
• >50,000 stars on github
• >10,000,000 downloads on Huggingface
• No. 1 on Github Trending (2 week)
• No. 1 on Huggingface Trending (2 weeks)
https://github.com/THUDM/GLM-130B
https://github.com/THUDM/ChatGLM3
3
4. ChatGPT vs. ChatGLM
ChatGPT ChatGLM
DALL.E CogView
Codex
GPT
t
VS
GLM
CodeGeeX
WebGPT WebGLM
GPT-4V GLM-4V on the way
(CogVLM, Agent…)
4
5. chatglm.ai
GLM
XDAI
GLM-130B
CodeGeeX
QAGLM
ChatGLM
Welcome to try
5
6. Story generation
6
7. Applied Math
7
8. Coding
8
9. GLM-4V (pre-release)
9
10. “draw a dog with a hat”
10
11. 大模型驱动的知识推理
12. GPT
code-davinci-002
代码数据预训练
GPT-1
GPT-2
1B
GPT-3
davinci
text-davinci-002
Codex
Supervised FT
100B
GPT-3 +
RLHF
InstructGPT
GitHub
Copilot
RLHF
GPT-3.5
ChatGPT
(RLHF)
New Bing
(GPT-4) GPT-4
2023
3 3.14
text-davinci-003
(RLHF)
WebGPT (RLHF)
2018
6
2019
2
2020
5
1. 100B Base model
2021
7
2021
12
2. Supervised FT
2022
11
3. RLHF
2023
2
13. OpenAI’s GPT
GPT-1
GPT-2
十亿模型
code-davinci-002
GPT-3
davinci
text-davinci-002
Codex
RLHF
InstructGPT
Supervised FT
100B
GitHub
Copilot
GPT-3 +
RLHF
GPT-3.5
ChatGPT
(RLHF)
New Bing
(GPT-4) GPT-4
2023
3 3.14
text-davinci-003
(RLHF)
WebGPT (RLHF)
2018
6
2019
2
2020
5
2020
11
2021
5
GLM-
10B
ACL’22
2021
7
2022
8
2021
12
mGLM
GLM-130B
Multi-lingual
ICLR’23
KDD’23
100B
CodeGeeX
2022
11
QAGLM
WebGLM
KDD’23
THU&Zhipu AI’s GLM
VS Code/JetBrains
CodeGeeX Plugin
2023
2
ChatGLM
(SFT + RLHF)
ChatGLM-6B
(SFT + RLHF)
NeurIPS’21
NeurIPS’22
ICLR’23
VisualGLM
CogVLM
14. General Language Model (GLM)
Framework NLU Cond. Gen. Uncond. Gen.
Autoregressive (GPT) — — √
Autoencoding (BERT) √ × ×
Encoder-Decoder (T5) — √ —
Autoregressive Blank-Infilling
(GLM) √ √ √
Du and Qian et al. All NLP Tasks are Generation Tasks. ACL’22. arxiv: 2103.10360
15. General Language Model (GLM)
16. General Language Model (GLM)
Du and Qian et al. All NLP Tasks are Generation Tasks. ACL’22. arxiv: 2103.10360
Du and Qian et al. All NLP Tasks are Generation Tasks. ACL’22.
16
17. General Language Model (GLM)
LAMBADA
Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23
17
18. Results on Natural Language Understanding
• Better than BERT, T5, RoBERTa
18
19. Resutls on Generation
• The most important thing is that one model can do all the things
19
20. Why 100B-scale model?
• What is 16 mod 12?
• 16 divided by 12
equals 1 remainder
4. So the answer is 4!
1. J Wei, et al. Emergent Abilities of Large Language Models. arXiv: 2206.07682
GPT-3 (OpenAI)
LaMDA (谷歌)
21. Why 100B-scale model?
1. J Wei, et al. Emergent Abilities of Large Language Models. arXiv: 2206.07682
22. Scaling Law
Scaling Law introduces complicated reasoning abilities
Model scale (# parameters in billions)
22
23. “Emergent abilities”
• OpenAI
• GPT-3 175B
• Google
• LaMDA 137B
• PaLM 540B
• Microsoft
• Turing-NLG 530B
• DeepMind
• Gopher 260B
Gif Credit: Google
24. How to train a 100B–scale LLM?
•
8 months have witnessed numerous challenges
o Engineering: How to train 100B-scale models from scratch?
§ Hygon DCU, NVIDIA A100, Ascend 910, Sunway
§ Frequent & random hardware failures, Megatron-DeepSpeed 3D pipeline, CUDA kernel
efficiency, GPU memory overflow, 10K+ threads TCP init & comms…
o Algorithm: How to stabilize the training of 100B-scale models?
§ The gradient norms of embeddings, Post-LN / Pre-LN stability, dataloader state seeds,
computation precision in Softmax / Attention
Project
System
Conceived Debug
2021.12
2022.1
Data
Hygon, NVIDIA Algo/Sys
Large-Scale Tests Ascend, Sunway
Tests
2022.2
2022.3
2022.4
2022.5
http://keg.cs.tsinghua.edu.cn/glm-130b/
Training
Stability Issues
2022.6
Evaluations
Quantization
2022.7
To be
continued
24
25. Training Stability of 100B-Scale Models
p Tradeoff: Stability (Slow) or Efficiency (Instable)
p Existing Solutions
p OPT-175B: manually adjust LR & skip data when collapses (performance drop)
p BLOOM 176B: embedding norm & BF16(performance drop, few platform)
Sources: OPT / BLOOM / GLM-130B
26. GLM-130B: Training Stability
pAttention score: Softmax in 32 to avoid overflow
<latexit sha1_base64="xd32kgbSheCDoMnbGsfzKeliFHY=">AAADznicrVPNa9swFFfsbuuyr7Q77iIaBt1hwW5Lt0uhbFAGvaSwtIUoC7IsJ6Ky5UrPpZkQu+7v6233/SGTU3d0Sdmg7IF4P73v30NKSikMRNGPVhCuPHj4aPVx+8nTZ89fdNbWj42qNOMDpqTSpwk1XIqCD0CA5Kel5jRPJD9Jzj7W/pMLro1QxWeYlXyU00khMsEoeNN4rfUTY0xUyTUFpQuac2tUBjm9dETyDDZJpimzJFEyNbPcK3vkxuL2/dDfv1gCqnTOEnOuwabOES0mU3iD9/C/GtyrDZXllP5uht9i4iv+j1LN3DcKRM5NE9IwwnvtJUoH/Xj3hs/fuC4kbW/dc8t3z7w8bKPGnW7Ui+aCl0HcgC5qpD/uXJFUsSrnBTBJjRnGUQkjSzUIJrlrk8rwkrIzOuFDD2s6ZmTnz9Hh196S4kxpfwrAc+vtDEtzU7PykTmFqVn01ca7fMMKsvcjK4qyAl6w60ZZJTEoXL9tnArNGciZB5Rp4WfFbEr9ZsH/gLZfQrxIeRkcb/Xi3d7O0U53/0OzjlX0Cm2gTRSjd2gffUJ9NEAsOAzOg6+BDfvhRejCb9ehQavJeYn+kPD7L8emTUY=</latexit>
softmax
Q i K >
i
p
d
!
= softmax
Q i K >
i
p
↵ d
max
Q i K >
i
p
↵ d
!!
⇥ ↵
!
= FP16 softmax FP32
Attention scores grow large --- exceeding FP16’s range
Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23
Q i K >
i
p
↵ d
!
⇥ ↵
!!
27. GLM-130B: Training Stability
p Embedding Layer Gradient Shrink (EGS)
word_embedding = word_embedding * alpha +
word_embedding .detach() * (1 alpha)
Embedding Layer gradients can be magnitudes larger than others
Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23
28. GLM-130B: Training Stability
p The final training run of GLM-130B
Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23
29. GLM-130B Training Lessons
• Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23
• https://github.com/THUDM/GLM-130B
30. GLM-130B
English: better than GPT-3/OPT/PaLM
on MMLU, LAMBADA, BIG-bench-lite
Chinese: better than ERNIE 260B & YUAN 245B
Aug., 2022-Mar. 2023, research use
requests from ~1000 orgs in 70 countries
• Google
• Microsoft
• Facebook
• Stanford
• MIT
• UC Berkely
• CMU
• Harvard
• Princeton
• Yale
• Cornell
• UIUC
• Cambridge
• Oxford
• Huawei
• Alibaba
• Tencent
• Baidu
• Meituan
• Bytedance
• Didi
• Xiaoice
• Xiaodu
• Xiaomi
• Xiaopeng
• Youdao
• Face++
• Ping An Cap
•Peking U.
•Zhejiang U.
•Shanghai JT U.
•Fudan U.
•USTC
•U of CAS
•Wuhan U.
•Naikai U.
•Hongkong U.
•CUHK
•HKUST
•BAAI
•Zhejiang Lab
•Shanghai AI Lab
31. GLM-130B in HELM
Stanford’s Holistic Evaluation of Language Models (HELM, Nov. 2022)
https://crfm.stanford.edu/helm, 2023.0308
Liang et al., Holistic Evaluation of Language Models. arXiv: 2211.09110
32. GLM-130B in HELM
1.Liang et al., Holistic Evaluation of Language Models. arXiv: 2211.09110
33. GLM-130B in HELM
1.Liang et al., Holistic Evaluation of Language Models. arXiv: 2211.09110
34. INT4 Quantization for RTX 3090s/2080s
GLM’s INT4 Weight Quantization Scaling Law
34
35. INT4 Quantization for RTX 3090s/2080s
pGLM-130B INT4 Quant. w/o perform. degradation
Zeng, Liu, et al. GLM-130B: An Open Bilingual Pre-trained Model. ICLR’23
35
36. GLM-130B
Backbone Training
Objective Quantization Acceleration Cross-
Platform
GPT3-175B GPT SSL Only — — NVIDIA
OPT-175B GPT SSL Only INT8 Megatron NVIDIA
BLOOM-176B GPT SSL Only INT8 Megatron NVIDIA
GLM SSL
&
MIP
GLM-130B
Performance Impro:
Effects
•
•
•
•
Big-bench-lite: +5.2%
LAMBADA:
+2.3%
CLUE:
+24.3%
FewCLUE:
+12.8%
• INT8
• INT4
Faster
Transformer
Affordable Serving: Fast Inference:
It saves 75% GPU
mem in inference; 7-8.4 faster than
Pytorch;
It can infer on
3090x4 / 2080x8 2.5 faster than
Megatron
•
•
•
•
NVIDIA
Hygon DCU
Ascend 910
Sunway
Diverse
Supports:
It supports
more diverse
adoptions of
LLMs
36
37. Develop ChatGLM based on GLM-130B
37
38. Challenge 1: Simple NLP task vs. Complex task
• Simple NLP task => Complex task (e.g., logic reasoning)
QQP Paraphrase)
Question1
How is air traffic controlled?
Question1
How is air traffic controlled?
Question1
How is air traffic controlled?
Question1
How is air traffic controlled?
Question2
How do you become an air traffic controller?
Question2
How do you become an air traffic controller?
Question2
How do you become an air traffic controller?
Question2
How do you become an air traffic controller?
Label
0
Label
0
Label
0
Label
0
XSum (Summary)
Document
The picture appeared on the wall of a
Document
The picture appeared on the wall of a
store appeared
on Whymark
Avenue...
Document Poundland
The picture
on the
wall of a
store appeared
on Whymark
Avenue...
Document Poundland
The picture
on the
wall of a
Poundland store on Whymark Avenue...
Poundland store on Whymark Avenue...
Summary
Graffiti artist Banksy is believed to be
Summary
Graffiti artist Banksy is believed to be
Summary behind...
Graffiti artist Banksy is believed to be
Summary behind...
Graffiti artist Banksy is believed to be
behind...
behind...
Math (GSM8k):
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
{Question1} {Question2}
Pick one: These questions
are duplicates or not
duplicates. I received the questions
"{Question1}" and
"{Question2}". Are they
duplicates? {Document}
How would you
rephrase that in
a few words? First, please read the article:
{Document}
Now, can you write me an
extremely short abstract for it?
{Choices[label]} {Choices[label]} {Summary} {Summary}
Last Letter Concatenation:
Q: Take the last letters of the words in “Lady Gaga”
and concatenate them.
38
39. Challenge 2: Static NLP vs. Dynamic knowledge
• GPT-3’s knowledge can be limited, obsolete, and uninterpretable
– Limited: Limited long-tailed knowledge
• Example: what is the sixth highest mountain in the world? (Answer: Mount K2)
– Obsolete: GPT-3’s knowledge is before 2020
– Uninterpretable: No reference for answers that require knowledge
Case Study: Limited Knowledge
Case Study: Obsolete Knowledge
39
40. Challenge 3: Traditional NLP vs. Align with Human
• Case Study: Explain the moon landing to a 6 year old in a few sentences.
– Without proper ``prompt engineering’’, GPT-3 and GLM-130B won’t return
satisfying answers, either
40
41. Develop ChatGLM based on GLM-130B
CodeGeeX (KDD’23)
Augmenting Code, Alignment,
Web, Image understanding…
Code→Reasoning
~0.4TB,4096
Supervised Fine-tune
GLM-130B++(GLM 3.5)
GLM-130B
(ACL’22, ICLR’23)
Text & Code
Base model
~1TB,4096
2021.12
Instruction following
ChatGLM
RLHF
Chat Product
Image understanding
Web
WebGLM
(KDD’23)
Combining Search
2022.09
2022.12
Visual-ChatGLM
(NeurIPS’21/22, ICLR’23)
Multi-modal
2022.12
2023.05
41
42. CodeGeeX
Generating over 10M lines codes
6B/13B parameters、100+ languages
Support both Nvidia and 910A/B
Free VSCode and JetBrains plugins
CodeGeeX
Optimization
(OpenAI)
(Huawei)
(DeepMind)
(Salesforce)
(Meta AI)
Operators (Layernorm/Gelu/BatchMatmul/Add)
Auto search for optimizing matrix multiplication
Performance
Improve 257% on Ascend 910A
Trained with over 1,000 Ascend 910A
42
43. Relay Diffusion Model (RDM)
https://github.com/THUDM/RelayDiffusion
• RDM transfers a low-resolution image into an equivalent high-
resolution one via blurring diffusion and block noise.
• RDM achieved state-of-the-art FID on CelebA-HQ and sFID
ImageNet-256 (FID=1.87)!
43
44. “draw a dog with a hat”
44
45. CogVLM
• CogVLM connects pretrained language model and image
encoder by a trainable visual expert model
ICLR’24(submitted)
45
46. CogVLM
Achieved the best on 10+
benchmarks
46
47. GLM-4V (pre-release)
47
48. WebGLM = GLM + Search
Liu et al. WebGLM: Towards An Efficient Web-enhanced Question Answering System with Human Preference. KDD’23
49. LLM Agent
50. AgentTuning: Enabling Generalized Agent Abilities
For LLMs
Six agentInstruct trajectory datasets
•
1,866 high-quality CoTs
codes & models: http://github.com/THUDM/AgentTuning
Agent Tuning Mix-training
•
20% AgentInstruct + 80% ShareGPT
51
51. Main Results
+176%
+76%
+57%
In-domain dist Out-domain dist
Significant improvement Good generalization
Better generalization
52
52. ChatGLM-6B
•
•
•
•
•
•
https://github.com/THUDM/ChatGLM3
Download from Huggingface
– git clone https://huggingface.co/THUDM/chatglm3
Download demo
– git clone https://github.com/THUDM/ChatGLM3
– cd ChatGLM-6B
Install demo
– pip install gradio
– python web_demo.py
Run the demo
– python cli_demo.py
Install the api
– pip install fastapi uvicorn
– python api.py
Run ChatGLM on your own MAC (w/ Apple Silicon)
– model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps')
53
53. #star
35,471
14,125
7,315
7,215
4,850
4,635
3,541
Open LLM Research
https://github.com/THUDM
54. Bigmodel.ai —API Platform
ChatGLM-Pro ChatGLM ChatGLM-Lite
Powerful Flexible Fast
0.01 /1000 Tokens 0.005 /1000 Tokens 0.002 /1000 Tokens
High quality, Knowledge
base, reasoning Balanced effect and cost,
news writing, abstract
generation, vertical search High speed, lower cost,
chatting, customer service,
classification, extraction
55
55. What’s the next?
56
56. Abstraction and Reasoning
1. Francois Chollet. On the Measure of Intelligence. 2019
57
57. Abstraction and Reasoning
58
58. Abstraction and Reasoning
59
59. Generative Agent
• Generative agents: computational software agents that simulate
believable human behavior
– A “Westworld” with 25 agents; Auto-GPT; AgentGPT…
1. Joon Sung Park et al. Generative Agents: Interactive Simulacra of Human Behavior. 2023
60
60. 61
61. Summary
• GPT vs GLM
–
–
–
–
–
ChatGPT vs. ChatGLM
DALL.E vs. CogView
Codex vs. CodeGeeX
WebGPT vs. WebGLM
GPT-4V vs. GLM-4V (CogVLM, AgentTuning…)
• 2024-toward AGI
63
62. References
•
•
•
•
•
•
•
•
•
•
•
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, Teng Su, Zhilin
Yang, and Jie Tang. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. KDD’23.
Xiao Liu, Hanyu Lai, Yu Hao, Yifan Xu, Aohan Zeng, Zhengxiao Du, Peng Zhang, Yuxiao Dong, and Jie Tang. WebGLM: Towards An
Efficient Web-enhanced Question Answering System with Human Preference. KDD’23.
Jing Zhang, Xiaokang Zhang, Daniel Zhang-Li, Jifan Yu, Zijun Yao, Zeyao Ma, Yiqi Xu, Haohua Wang, Xiaohan Zhang, Nianyi Lin,
Sunrui Lu, Jie Tang, and Juanzi Li. GLM-Dialog: Noise-tolerant Pre-Training for Knowledge-grounded Dialogue Generation. KDD’23.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam
Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130B: An
Open Bilingual Pre-trained Model. ICLR’23.
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale Pretraining for Text-to-Video Generation via
Transformers. ICLR’23.
Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. CogView2: Faster and Better Text-to-Image Generation via Hierarchical
Transformers. NeurIPS’22.
Jifan Yu, Xiaohan Zhang, Yifan Xu, Xuanyu Lei, Xinyu Guan, Jing Zhang, Lei Hou, Juanzi Li, and Jie Tang. XDAI: A Tuning-free
Framework for Exploiting Pre-trained Language Models in Knowledge Grounded Dialogue Generation. KDD’22.
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General Language Model Pretraining
with Autoregressive Blank Infilling. ACL’21.
Zixuan Ma, Jiaao He, Jiezhong Qiu, Huanqi Cao, Yuanwei Wang, Zhenbo Sun, Liyan Zheng, Haojie Wang, Shizhi Tang, Tianyu Zheng,
Junyang Lin, Guanyu Feng, Zeqiang Huang, Jie Gao, Aohan Zeng, JianWei Zhang, Runxin Zhong, Tianhui Shi, Sha Liu, Weimin
Zheng, Jie Tang, Hongxia Yang, Xin Liu, Jidong Zhai, and Wenguang Chen. BAGUALU: Targeting Brain Scale Pretrained Models with
over 37 Million Cores. PPOPP’22.
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie
Tang. CogView: Mastering Text-to-Image Generation via Transformers. NeurIPS’21.
Junyang Lin, Rui Men, An Yang, Chang Zhou, Yichang Zhang, Peng Wang, Jingren Zhou, Jie Tang, and Hongxia Yang. M6: Multi-
Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining. KDD’21.
64
63. Thank you!
Many many collaborators from Tsinghua and Zhipu AI!
https://github.com/THUDM/
65