Don’t teach. Incentivize
如果无法正常显示,请先停止浏览器的去广告插件。
1. Don’t teach. Incentivize.
MIT EI seminar
Hyung Won Chung
OpenAI
2. Non-goal: share specific technical knowledge and experimental results
Goal: share how I think with AI being a running example
3. Why?
We, the technical people, focus too much on problem solving itself
In my view, more attention should go to finding great problems to solve
Great researchers are good at finding impactful problems. I think this
ability comes from having the right perspective.
I hope this talk sparks interest in developing original perspectives,
which in turn help finding better problems to solve
4. Outline
Build the scale-first perspective for AI research in general
Interpret Large Language Models with this perspective
5. Brain-scale
compute
power
Roughly, 10x more compute every 5 years
Figure from Rich Sutton’s
6. Hardware is exponentially progressing
Software and algorithms should catch up
We need more scalable methods that can better leverage computation
7. The job of AI researchers is to teach machines how to “think”
One (unfortunately common) approach
Teach the machines how we think we think
But we don’t know how we think at the neuron level
So we are teaching what we don’t fully understand in a limited language of mathematics
This approach poses structures to the problem, which can become the limitation when
scaled up
7
8. Bitter lesson
Progress of AI in the past 70 years boils down to
● Develop progressively more general methods with less structure
● Add more data and computation (i.e. scale up)
http://www.incompleteideas.net/IncIdeas/BitterLesson.ht 8
ml
9. The more structure imposed by humans, the less scalable the method is
Performance
Less structure
More structure
Compute
10. Sobering observation
Clever structures posed by human researchers typically become the bottleneck
when scaled up
What is good in the long run almost necessarily looks bad in the short term
Compute is getting cheaper faster than we are becoming better researchers
Give machines more degrees of freedom. Let them choose how they learn
11. Why are these observations not so obvious?
Researchers want to add modeling idea because that is academically
more satisfying
Some people think “just scaling up” is not scientific or interesting
12. Ultimately what do we want to achieve with artificial intelligence?
We should focus on:
maximizing the value generated by AI while minimizing the
downside
regardless of which academic discipline achieves the goal
13. HWC’s definition of scaling
Common definition: doing the same thing with more machines
14. HWC’s definition of scaling
Common definition: doing the same thing with more machines
Scaling implicitly involves identifying the modeling assumption that
bottlenecks further scaling and replacing it with a more scalable one
15. Large Language Models (LLMs)
16. All LLMs so far use Transformer architecture
17. Let’s take a “functional” viewpoint on the Transformer
Sequence-to-sequence
mapping with bunch of
matmuls
Input: [d, n]
Output: [d, n]
18. Process
“Many words don't map to one token:
indivisible.”
Shape
[]
19. Process
“Many words don't map to one token:
indivisible.”
Tokenization
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
Shape
[]
[n]
20. Process
Shape
“Many words don't map to one token:
indivisible.”
Tokenization
[]
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
[n]
Embedding
2.
3
4.
5
…
3.
8
-3.2
5.9
…
1.2
8.
3
4.
5
…
3.
8
5.
4
7.
1
…
9.
0
2.
1
1.
0
…
9.
3
3.
9
5.
3
…
3.
1
-8.9
5.0
…
4.2
3.
8
3.
1
…
0.
8
3.
9
0.
7
…
9.
2
3.
3
5.
0
…
5.
8
[d, n]
21. Process
Shape
“Many words don't map to one token:
indivisible.”
Tokenization
[]
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
[n]
Embedding
2.
3
4.
5
…
3.
8
3.
2
5.
4
…
8.
3
-3.2
5.9
…
1.2
-2.3
9.5
…
2.1
8.
3
4.
5
…
3.
8 5.
4
7.
1
…
9.
0 2.
1
1.
0
…
9.
3
3.
8
5.
4
…
8.
3 4.
5
1.
7
…
0.
9 1.
2
0.
1
…
3.
9
3.
9
5.
3
…
3.
1
-8.9
5.0
…
4.2
3.
8
3.
1
…
0.
8 3.
9
0.
7
…
9.
2 3.
3
5.
0
…
5.
8
8.
3
1.
3
…
8.
0 9.
3
7.
0
…
2.
9 3.
3
0.
5
…
8.
5
N Transformer layers
9.
3
3.
5
…
1.
3
-9.8
0.5
…
2.4
[d, n]
[d, n]
22. Process
Shape
“Many words don't map to one token:
indivisible.”
Tokenization
[]
[7085, 2456, 836, 470, 3975, 284, 530, 11241, 25, 773, 452, 12843, 13]
[n]
Embedding
2.
3
4.
5
…
3.
8
3.
2
5.
4
…
8.
3
-3.2
5.9
…
1.2
-2.3
9.5
…
2.1
8.
3
4.
5
…
3.
8 5.
4
7.
1
…
9.
0 2.
1
1.
0
…
9.
3
3.
8
5.
4
…
8.
3 4.
5
1.
7
…
0.
9 1.
2
0.
1
…
3.
9
2.6
3.
9
5.
3
…
3.
1
-8.9
5.0
…
4.2
3.
8
3.
1
…
0.
8 3.
9
0.
7
…
9.
2 3.
3
5.
0
…
5.
8
8.
3
1.
3
…
8.
0 9.
3
7.
0
…
2.
9 3.
3
0.
5
…
8.
5
N Transformer layers
9.
3
3.
5
…
1.
3
-9.8
0.5
…
2.4
Loss function (predict next token given
previous)
[d, n]
[d, n]
[]
23. Original sentence
24. Original sentence
Given “many”, predict the next
token
apple: 0.01
don: 0.001
…
intelligence: 0.00001
…
words: 0.02
25. Original sentence
Given “many”, predict the next
token apple: 0.01
don: 0.001
…
intelligence: 0.00001
…
words: 0.02
Given “many words”, predict the next
token apple: 0.00003
don: 0.03
…
intelligence: 0.00001
…
words: 0.0000001
26. Original sentence
Given “many”, predict the next
token apple: 0.01
don: 0.001
…
intelligence: 0.00001
…
words: 0.02
Given “many words”, predict the next
token apple: 0.00003
don: 0.03
…
intelligence: 0.00001
…
words: 0.0000001
Probability of a sentence is a product of conditional probabilities. Maximize
this.
27. Feed web-scale text data to Transformer
Sequence-to-sequence mapping
with bunch of matmuls
Input: [d_model, length]
Output: [d_model, length]
Web-scale text
data
28. Somehow the model learns to perform many many tasks only trained with next-token prediction
Chowdhery et al (2022)
29. Some observations on the next-token prediction task
We don’t directly teach any linguistic concepts (e.g. verb, subject,
whatever)
Simply by predicting next tokens over a large corpus, the model learns
languages
Language is learned almost as a by-product of doing such task
The model can do some “reasoning” (e.g. math, code)
30. Next token prediction as a massive implicit multitask learning
31. Next token prediction as a massive implicit multitask learning
This terrible movie was really
boring
32. Next token prediction as a massive implicit multitask learning
This terrible movie was really
boring
After the earning call, the share price of Google went up by 5% from $1,000,
ending in $1,050
33. Next token prediction as a massive implicit multitask learning
This terrible movie was really
boring
After the earning call, the share price of Google went up by 5% from $1,000,
ending in $1,050
인공지능 연구원들은 코딩을 잘 못합니다 .
34. Next token prediction as a massive implicit multitask learning
This terrible movie was really
boring
After the earning call, the share price of Google went up by 5% from $1,000,
ending in $1,050
인공지능 연구원들은 코딩을 잘 못합니다 .
The first law of Thermodynamics is often called conservation of
energy
35. Next token prediction as a massive implicit multitask learning
This terrible movie was really
boring
After the earning call, the share price of Google went up by 5% from $1,000,
ending in $1,050
인공지능 연구원들은 코딩을 잘 못합니다 .
The first law of Thermodynamics is often called conservation of
energy
BILLIONS of sentences
TRILLIONS of task types
36. Massive multitask learning hypothesis
Beyond some scale, the easiest way to do well on the next token
prediction is for the model to find a set of general skills that are
applicable to many tasks.
For example, these skills include learning languages, understanding and
reasoning.
37. Crucially we don’t directly teach any of these skills to the model. We
weakly incentivize the model and the abilities emerge
Abilities that emerge are typically more general skill sets. In order for
abilities to emerge, they should be incentivized as opposed to being
directly taught
Weakly incentivizing the model requires a lot more compute, i.e. it is a
more scalable teaching strategy
38. For a given dataset and an learning objective there is an explicit learning signal and a set of induced
incentives
Next-token prediction with web-scale data
● explicit signal: predict next token
● induced incentive: understand languages and reasoning, etc
39. Example 2: Playing chess with {0, 1} reward at the end of the game
Explicit signal: win the game
Induced incentive: learn what moves are good
40. Example 3: Hallucinations
Reward structure for simple question answering scenario:
●
●
●
●
●
1 if the answer is correct and unhedged
0.5 if answer is correct but hedged
0 if the answer is “I don’t know”
-2 if the answer is hedged but wrong
-4 if the answer is unhedged and wrong
Explicit signal: answer the question correctly
Induced incentive: know what you don’t know
Adapted from John Schulman’s talk
https://www.youtube.com/watch?v=hhiLw5Q_UFg
41. Loose analogy
Give a man a fish, and you feed him for a day.
Teach a man to fish, and you feed him for a lifetime.
42. Loose analogy
Give a man a fish, and you feed him for a day.
Teach a man to fish, and you feed him for a lifetime.
Teach him the taste of fish and make him hungry
43. Give a man a
fish
Teach him how to
fish
Teach him the taste of
fish and make him
hungry
Time required
44. Give a man a
fish
huma
ns
machin
es
Teach him how to
fish
Teach him the taste of
fish and make him
hungry
Time required
Compute
required
45. Small specialist models vs large generalist model
The belief that small specialist models can win on a narrow domain
assumes that there exists tradeoffs between being a generalist and
specialist
46. Specialist-generalist tradeoff doesn’t apply to machines
Such tradeoff is due to the fact that every human beings operate with
the same time budget. Machines do not.
One model gets to enjoy a lot more compute than others
It is akin to someone having access to “Room of spirit and time” from
Dragon ball; one year inside that room is a day outside
47. Importance of incentive structure is not a new. Why now?
No amount of bananas can incentivize monkeys to do mathematical
reasoning
Threshold intelligence is necessary for the incentive structure to work
for a given problem
I think we cross that threshold for many tasks
48. Whether the induced incentive structure works depends on the model size
What abilities emerge depends on the model size
If the model is too small, the model might just give up learning high-
level skills such as reasoning. It relies on heuristics-based pattern
recognition
49. Some abilities emerge with scale
Having the right perspective is crucial
50. Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph et
51. Perspective of “yet”
52. Perspective of “yet”
This idea doesn’t work
53. Perspective of “yet”
This idea doesn’t work
doesn’t work yet
This idea
54. Why is the perspective of “yet” not so obvious?
We are used to operating in an environment where underlying axioms
don’t change
You run an experiment for your new scientific idea. It doesn’t work now.
You know that it will not work if you run 3 years later
For language models, the most capable model serves as an “axiom” for
many research experiments run on top
55. Need for constant unlearning
Many ideas get outdated and invalidated at larger scale
We need to constantly unlearn intuitions built on such invalidated ideas
With less to unlearn, newcomers can have advantages over more
experienced ones. This is an interesting neutralizing force
56. Highly simplified view of emergent abilities
Ability 1
GPT-3 GPT-4
Ability 2
Scale
GPT-3 GPT-4
Ability 3
Scale
GPT-3 GPT-4
Scale
57. Closing
Compute cost is decreasing exponentially
AI researchers should harness this by designing scalable methods
Current generation of LLMs rely on next-token prediction, which can be thought of as
weak incentive structure to learn general skills such as reasoning
More generally, we should incentivize models instead of directly teaching specific skills
Emergent abilities necessitate having the right perspective such as unlearning
58. Thank you!
Twitter: @hwchung27
59. Don’t teach. Incentivize.
MIT EI seminar
Hyung Won Chung
OpenAI