Google AI Unveils Muse, a New Text-To-Image Transformer Model

Google AI released a research paper about Muse, a new Text-To-Image Generation via Masked Generative Transformers that can produce photos of a high quality comparable to those produced by rival models like the DALL-E 2 and Imagen at a rate that is far faster.

Muse is trained to predict randomly masked image tokens using the text embedding from a large language model that has already been trained. This job involves masked modeling in discrete token space. Muse uses a 900 million parameter model called a masked generative transformer to create visuals instead of pixel-space diffusion or autoregressive models.

Google claims that with a TPUv4 chip, a 256 by 256 image can be created in as little as 0.5 seconds as opposed to 9.1 seconds using Imagen, their diffusion model that they claim offers an "unprecedented degree of photorealism" and a "deep level of language understanding." TPUs, or Tensor Processing Units, are custom chips developed by Google as dedicated AI accelerators.

According to the research, Google AI has trained a series of Muse models with varying sizes, ranging from 632 million to 3 billion parameters, finding that conditioning on a pre-trained large language model is crucial for generating photorealistic, high-quality images.

Muse also outperforms Parti, a state-of-the-art autoregressive model, since it uses parallel decoding and is more than 10 times faster at inference time than the Imagen-3B or Parti-3B models and three times faster than Stable Diffusion v1.4 based on tests using hardware that is equivalent.

Muse creates visuals that correspond to the various components of speech found in the input captions, such as nouns, verbs, and adjectives. Additionally, it shows knowledge of both visual style and multi-object features like compositionality and cardinality.

Generative image models have come a long way in recent years, thanks to novel training methods and improved deep learning architectures. These models have the ability to generate highly detailed and realistic images, and they're becoming increasingly powerful tools for a wide range of industries and applications.

About the Author

Daniel Dominguez

Daniel has worked in the software industry for over 14 years, gaining experience in product development for companies ranging from Silicon Valley startups to Fortune 500. As a seasoned engineer, he is passionate about cloud computing to deliver innovative software solutions. In addition to software product management, he is also involved in artificial intelligence and machine learning.

Inspired by this content? Write for InfoQ.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team. Thomas BettsLead Editor, Software Architecture and Design @InfoQ; Senior Principal Engineer