如何使用 vLLM 运行 gpt-oss

vLLM is an open-source, high-throughput inference engine designed to efficiently serve large language models (LLMs) by optimizing memory usage and processing speed. This guide will walk you through how to use vLLM to set up gpt-oss-20b or gpt-oss-120b on a server to serve gpt-oss as an API for your applications, and even connect it to the Agents SDK.

vLLM 是一个开源的高吞吐推理引擎，旨在通过优化内存使用和处理速度来高效地服务大型语言模型（LLM）。本指南将带你一步步使用 vLLM 在服务器上部署 gpt-oss-20b 或 gpt-oss-120b，将 gpt-oss 作为 API 供你的应用调用，甚至还能连接到 Agents SDK。

Note that this guide is meant for server applications with dedicated GPUs like NVIDIA’s H100s. For local inference on consumer GPUs, check out our Ollama guide.

请注意，本指南面向配备专用 GPU（如 NVIDIA 的 H100）的服务器应用。若想在消费级 GPU 上进行本地推理，请查看我们的 Ollama 指南。

Pick your model

选择你的模型

vLLM supports both model sizes of gpt-oss:

vLLM 支持 gpt-oss 的两种模型尺寸：

Both models are MXFP4 quantized out of the box.

这两个模型开箱即用，均已MXFP4 量化。

Quick Setup

快速设置

Install vLLM
vLLM recommends using uv to manage your Python environment. This will help with picking the right implementation based on your environment. Learn more in their quickstart. To create a new virtual environment and install vLLM run:
安装 vLLM
vLLM 建议使用 uv 来管理 Python 环境，这有助于根据你的环境选择正确的实现。在他们的快速入门中了解更多。要创建新的虚拟环境并安装 vLLM，请运行：

!----

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install --pre vllm==0.10.1+gptoss \
 --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
 --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
 --index-strategy unsafe-best-match

Start up a server and download the model
vLLM provides a serve command that will automatically download the model from HuggingFace and spin up an OpenAI-compatible server on localhost:8000. Run the following command depending on your desired model size in a terminal session on your server.
启动服务器并下载模型
vLLM 提供了 serve 命令，它会自动从 HuggingFace 下载模型，并在 localhost:8000 启动一个兼容 OpenAI 的服务器。根据你所需的模型大小，在服务器终端会话中运行以下命令。

!----

# For 20B
vllm serve openai/gpt-oss-20b
 
# For 120B
vllm serve openai/gpt-oss-120b

Use the API

使用 API

vLLM exposes a...