大型机器学习模型的半二次量化

Editor's note: We are republishing a blog post from the Mobius team, originally published in 2023, that introduced a now widely used quantization algorithm. We plan to continue this line of work by collaborating with the open source community on inference optimization and will be sharing more updates soon.

编者注：我们正在重新发布 Mobius 团队于 2023 年撰写的一篇博客文章，该文章介绍了一种现已广泛使用的量化算法。我们计划通过与开源社区合作，继续推进推理优化工作，并将很快分享更多更新。

~ ~ ~

Large Language Models (LLMs) have revolutionized various subfields of machine learning like natural language processing, speech recognition and computer vision, enabling machines to understand and generate outputs with unprecedented accuracy and fluency. However, one of the most critical challenges in deploying LLMs is their expensive memory requirements, for both training and inference. Quantization methods such as bitsandbytes, GPTQ and AWQ have made it possible to use large models such as the popular Llama-2 with significantly less memory, enabling the machine learning community to conduct remarkable research using a single consumer-grade GPU.

Large Language Models (LLMs) 彻底改变了自然语言处理、语音识别和计算机视觉等机器学习子领域，使机器能够以空前的准确性和流畅性理解并生成输出。然而，部署 LLM 时最关键的挑战之一是其昂贵的内存需求，无论是训练还是推理。诸如 bitsandbytes、GPTQ 和 AWQ 等量化方法使得使用大型模型（如流行的 Llama-2）所需的内存大幅减少，从而让机器学习社区能够仅使用一块消费级 GPU 就开展卓越的研究。

In this article, we propose a new quantization technique called Half-Quadratic Quantization (HQQ). Our approach, requiring no calibration data, significantly speeds up the quantization of large models, while offering compression quality competitive with that of calibration-based methods. For instance, HQQ takes less than 5 minutes to process the colossal Llama-2-70B, that’s over 50x faster compared to the widely adopted GPTQ. Our Llama-2-70B quantized to 2-bit outperforms the full-precision Llama-2-13B by a large margin for a comparable memory usage.

在本文中，我们提出了一种新的量化技术，称为 Half-Quadratic Quantization (HQQ)。我们的方法无需校准数据，显著加快了大模型的量化速度，同时在压缩质量上与基于校准的方法具有竞争力。例如，HQQ 处理庞大的 Llama-2-70B 仅需不到 5 分钟，比广泛采用的 GPTQ 快 50 倍以上。我们将 Llama-2-70B 量化到 2-bit 后，在可比的内存占用下，大...