## transformer数学101

A lot of basic, important information about transformer language models can be computed quite simply. Unfortunately, the equations for this are not widely known in the NLP community. The purpose of this document is to collect these equations along with related knowledge about where they come from and why they matter.

Note: This post is primarily concerned with training costs, which are dominated by VRAM considerations. For an analogous discussion of inference costs with a focus on latency, check out this excellent blog post by Kipply.

## 计算要求#

The basic equation giving the cost to train a transformer model is given by:

where:

• is the compute required to train the transformer model, in total floating point operations

是训练transformer模型所需的计算量，总的浮点运算量。

• is the aggregate throughput of your hardware setup (), in FLOPs

是你的硬件设置的总吞吐量（），单位为FLOPs

• is the time spent training the model, in seconds

是训练模型的时间，单位是秒。

• is the number of parameters in the transformer model

是transformer模型中的参数数量

• is the dataset size, in tokens

是数据集的大小，单位是tokens

These equations are proposed and experimentally validated in OpenAI’s scaling laws paper and DeepMind’s scaling laws paper. Please see each paper for more information.

It’s worth taking an aside and discussing the units of . is a measure of total compute, but can be measured by many units such as:

• FLOP-seconds, which is in units of
• FLOP-秒，其单位为
• GPU-hours, which is in units of
• GPU-小时，其单位为
• Scaling laws papers tend to report values in PetaFLOP-days, or total floating point operations
• 缩放规律的论文往往以PetaFLOP-天为单位报告数值，或以浮点运算总量为单位报告数值。

One useful distinction to keep in mind is the concept of . While GPU accelerator whitepapers usually advertise their theoretical FLOPs, these are never met in practice (es...