Photo by Judi Neumeyer on Unsplash
GPT-3 is already old if we compare it with what AI is showing us this year. Since the transformer came out in 2017, it has seen wild success in diverse tasks, from language to vision. GPT-3 revolutionized the world last year and since then multiple breakthrough models have been presented. Countries and companies are immersed in a race to build better and better models.
The premise is that bigger models, bigger datasets, and more computational power comprise the AI-dominance trinity. Even if this paradigm has important detractors, its success is simply undeniable.
In this article, I’ll review the 5 most important transformer-based models from 2021. I’ll open the list with GPT-3 because of its immense significance, and then continue in chronological order — the last one was published just two weeks ago!
GPT-3 — The AI rockstar
OpenAI presented GPT-3 in May 2020 in a paper titled Language Models are Few-Shot Learners. In July 2020, the company released a beta API for developers to play and the model became an AI-rockstar overnight.
GPT-3 is the third version of a family of Generative Pre-Trained language models. Its main features are multitasking and meta-learning abilities. Being trained in an unsupervised way on 570GB of Internet text data, it’s able to learn tasks it hasn’t been trained on by seeing a few examples (few-shot). It can also learn from zero- and one-shot settings, but the performance is usually worse.
GPT-3 has demonstrated crazy language generation abilities. It can have conversations (impersonating historical figures, alive or dead), write poetry, songs, fiction, and essays. It can write code, music sheets, and LaTeX-formatted equations. It shows a modest level of reasoning, logic, and common sense. And it can ponder about the future, the meaning of life and itself.
Apart from this, GPT-3 showed great performance on standardized benchmarks, achieving SOTA in some of them. It shines the most on generative tasks, such as writing news articles. For this task, it reached human levels, confusing judges trying to separate its articles from human-made ones.
Here’s a complete overview I wrote about GPT-3 for Towards Data Science:
Switch Transformer — The trillion-parameter pioneer
In January 2021 Google published the paper Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. They presented the switch transformer, a new neural net which goal was facilitating the creation of larger models without increasing computational costs.
The feature that distinguishes this model from previous ones is a simplification of the Mixture of Experts algorithm. Mixture of Experts (MoE) consist of a system by which tokens (elemental parts of the input) entering the model are sent to be processed by different parts of the neural net (experts). Thus, to process a given token, only a subsection of the model is active; we have a sparse model. This reduces the computational costs, allowing them to reach the trillion-parameter mark.
With the original MoE, each token was sent to at least two experts to compare them. With the Switch Transformer, Google simplified the routing process so each token is sent to only one expert. This further reduces computational and communication costs. Google showed that a large Switch Transformer would outperform a large dense model (such as GPT-3, although they didn’t compare the two). This is a huge milestone in reducing the carbon footprint of large pre-trained models, which are state-of-the-art in language — and now also vision— tasks.
DALL·E — The creative artist
OpenAI presented DALL·E in February 2021 in a paper titled Zero-Shot Text-to-Image Generation. The system, named after Spanish painter Salvador Dalí and Pixar’s cute robot WALL·E, is a smaller version of GPT-3 (12 billion parameters), trained specifically on text-image pairs. In the words of OpenAI’s researchers: “Manipulating visual concepts through language is now within reach.”
DALL·E explores the possibilities of image generation using the “compositional structure of language.” It combines the meaning of a written sentence with the potential visual representations it may have. Still, like GPT-3, it’s highly dependent on the wording of the sentence to not commit mistakes in the images. Its strength relies on its zero-shot capabilities; it can perform generation tasks it hasn’t been trained on without the need for examples.
Among other capabilities, it can generate images from scratch given a written prompt, regenerate hidden parts of images, control attributes of objects, or integrate them in a single image. Even more impressive, DALL·E can also combine concepts at high levels of abstraction (when told “A snail made of harp,” it often draws the snail as having a harp as the shell) and translate image-to-image (when told “the exact same cat on the top as a sketch on the bottom,” it draws a similar cat than the original picture).
DALL·E shows a rudimentary form of artistry. From the loosely interpretable descriptions of written language, it creates a visual reality. We may be closer to an AI version of “a picture is worth a thousand words” than ever.
Here’s the blog post from OpenAI with visual instances of DALL·E’s abilities:
LaMDA — The next generation of chatbots
Google presented LaMDA in their annual I/O conference on May 2021. LaMDA is expected to revolutionize chatbot technology with its amazing conversational skills. There’s no paper or API yet, so we’ll have to wait to get some results.
LaMDA, which stands for Language Model for Dialogue Applications is the successor of Meena, another Google AI presented in 2020. LaMDA was trained on dialogue and optimized to minimize perplexity, a measure of how confident is a model in predicting the next token. Perplexity correlates highly with human evaluation of conversational skills.
LaMDA stands out as a sensible, specific, interesting, and factual chatbot. In contrast with previous ones, it can navigate the open-ended nature of conversations making sense of its responses. It can make them specific, avoiding always-valid responses such as “I don’t know.” It can make “insightful and unexpected” responses, keeping the conversation interesting. And it makes correct answers when there’s factual knowledge involved.
Here’s a complete review I wrote about LaMDA for Towards Data Science:
MUM — The brain of the search engine
Together with LaMDA Google presented MUM, a system meant to revolutionize the search engine, in a similar — but more impactful — way as BERT did a couple of years ago. As with LaMDA, there is no further information apart from Google’s demo and blog post, so we’ll have to wait for more.
MUM stands for Multitask Unified Model. It’s is a multitasking and multimodal language model 1000x more powerful than BERT, its predecessor. It has been trained in 75 languages and many tasks which gives it a better grasp of the world. However, its multimodal capabilities are what makes MUM stronger than previous models. It can tackle text+image information and tasks, which gives it a versatility neither GPT-3 nor LaMDA have.
MUM is capable of tackling complex search queries such as “You’ve hiked Mt. Adams. Now you want to hike Mt. Fuji next fall, and you want to know what to do differently to prepare.” With today’s search engine a precise and sensible answer would take a bunch of searches and a compilation of information. MUM can do it for you and give you a curated answer. Even more striking, because it is multimodal, “eventually, you might be able to take a photo of your hiking boots and ask, “can I use these to hike Mt. Fuji?”
Here’s a complete review I wrote of MUM for Towards Data Science:
Wu Dao 2.0 — The largest neural network
On the 1st of June, the BAAI annual conference presented Wu Dao 2.0 — translated as Enlightenment. This amazing AI holds the title of largest neural network that a year ago belonged to GPT-3. Wu Dao 2.0 has a striking 1.75 trillion parameters, 10x GPT-3.
Wu Dao 2.0 was trained on 4.9TB of high-quality text and image data. In comparison, GPT-3 was trained on 570GB of text data, almost 10 times less. Wu Dao 2.0 follows the multimodal trend and is able to perform text+image tasks. To train it, researchers invented FastMoE, a successor of Google’s MoE, which is “simple to use, flexible, high-performance, and supports large-scale parallel training.” We’ll probably see other versions of MoE in future models.
Its multimodal nature allows Wu Dao 2.0 to manage a wide set of tasks. It’s able to process and generate text, recognize and generate images, and mixed tasks such as captioning images and creating images from textual descriptions. It can also predict the 3D structures of proteins, like DeepMind’s AlphaFold. It even created a virtual student that can learn continuously. She can write poetry and draw pictures and will learn to code in the future.
Wu Dao 2.0 achieved SOTA levels on some standard language and vision benchmarks, such as LAMBADA, SuperGLUE, MS COCO, or Multi 30K, surpassing GPT-3, DALL·E, CLIP, and CL². These amazing achievements put Wu Dao 2.0 as the most powerful, versatile AI today. Yet, it’s a matter of time that another, bigger AI appears on the horizon. Keep your eyes open!
Here’s a complete review I wrote of Wu Dao 2.0 for Towards Data Science: