构建Pinterest Canvas，一个文本到图像的基础模型

Eric Tzeng; ML Research Scientist, ATG | Raymond Shiau; ML Research Scientist, ATG |

Eric Tzeng; ML研究科学家，ATG | Raymond Shiau; ML研究科学家，ATG |

In this engineering note, we wanted to share some of our latest progress on Pinterest Canvas, a text-to-image foundation model for enhancing existing images and products on the platform. Building image foundation models has been a core part of Pinterest’s ML strategy for the past decade, but these have been focused on representation learning tasks (e.g. our Unified Visual Embedding v2, v3, etc.). More recently, we have begun to explore the application of generative models, specifically those that can be conditioned on existing Pinterest images, to create new backgrounds for products.

在这篇工程笔记中，我们想要分享一些关于Pinterest Canvas的最新进展，这是一个用于增强平台上现有图像和产品的文本到图像基础模型。构建图像基础模型一直是Pinterest机器学习策略的核心部分，但这些模型一直专注于表示学习任务（例如我们的统一视觉嵌入v2，v3等）。最近，我们开始探索生成模型的应用，特别是那些可以以现有Pinterest图像为条件来创建产品的新背景的模型。

Pinterest Canvas is built as a text-to-image model that can support arbitrary conditioning information in the form of product masks and conditioning images for stylistic guidance. In this post, we will discuss first the training of the base text-to-image model, then the fine-tuning process to generate photorealistic backgrounds conditioned on masks, and finally an in-context learning process for conditioning on image styles.

Pinterest Canvas是一个文本到图像模型，可以支持以产品掩码和风格指导的条件图像的任意条件信息。在本文中，我们将首先讨论基础文本到图像模型的训练，然后是在掩码条件下生成逼真背景的微调过程，最后是在图像风格条件下的上下文学习过程。

Because we’re primarily interested in image generation as a way of visualizing existing products in new contexts, rather than generating completely new content from scratch, we don’t have a direct product use for the typical image generation model that takes a text caption and tries to generate an image based on that caption. Nevertheless, this text-to-image task ends up being a useful way to teach a model about the visual world, so that it can then learn how to generate cohesive and compelling objects and scenes.

因为我们主要关注图像生成作为一种将现有产品可视化到新环境中的方式，而不是从头开始生成全新内容，所以我们没有直...