transformer数学101
A lot of basic, important information about transformer language models can be computed quite simply. Unfortunately, the equations for this are not widely known in the NLP community. The purpose of this document is to collect these equations along with related knowledge about where they come from and why they matter.
很多关于变换器语言模型的基本的、重要的信息可以很简单地被计算出来。不幸的是,在NLP社区中,这方面的方程式并不广为人知。本文件的目的是收集这些方程,以及关于它们来自何处和为什么重要的相关知识。
Note: This post is primarily concerned with training costs, which are dominated by VRAM considerations. For an analogous discussion of inference costs with a focus on latency, check out this excellent blog post by Kipply.
注意:这篇文章主要关注的是训练成本,它由VRAM的考虑因素主导。关于对推理成本的类似讨论,重点是延迟,请查看Kipply的这篇优秀博文。
Compute Requirements#
计算要求#
The basic equation giving the cost to train a transformer model is given by:
训练一个transformer模型的成本的基本方程是::
where:
其中:
-
is the compute required to train the transformer model, in total floating point operations
是训练transformer模型所需的计算量,总的浮点运算量。
-
is the aggregate throughput of your hardware setup (), in FLOPs
是你的硬件设置的总吞吐量(),单位为FLOPs
-
is the time spent training the model, in seconds
是训练模型的时间,单位是秒。
-
is the number of parameters in the transformer model
是transformer模型中的参数数量
-
is the dataset size, in tokens
是数据集的大小,单位是tokens
These equations are proposed and experimentally validated in OpenAI’s scaling laws paper and DeepMind’s scaling laws paper. Please see each paper for more information.
这些方程是在OpenAI的缩放法则论文和DeepMind的缩放法则论文中提出并经过实验验证的。更多信息请见每篇论文。
It’s worth taking an aside and discussing the units of . is a measure of total compute, but can be measured by many units such as:
值得在一旁讨论的是,.的单位是总计算量,但可以用很多单位来衡量,比如:
- FLOP-seconds, which is in units of
- FLOP-秒,其单位为
- GPU-hours, which is in units of
- GPU-小时,其单位为
- Scaling laws papers tend to report values in PetaFLOP-days, or total floating point operations
- 缩放规律的论文往往以PetaFLOP-天为单位报告数值,或以浮点运算总量为单位报告数值。
One useful distinction to keep in mind is the concept of . While GPU accelerator whitepapers usually advertise their theoretical FLOPs, these are never met in practice (es...