That is how the magic DALL-E picture generator from OpenAI works

It seems like every few months someone posts a machine learning paper or demo that makes my jaw drop. This month it’s OpenAI’s new imaging model. GIVE HER.

This gigantic neural network with 12 billion parameters uses a text label (ie “an armchair in the shape of an avocado”) and generates corresponding images:


I find the images pretty inspiring (I would buy one of those avocado chairs) but what’s even more impressive is DALLE’s ability to understand and convey concepts of space, time, and even logic (more on that in a second) . .

In this post, I’ll give you a quick rundown of what DALL · E can do, how it works, how it fits in with the latest trends in ML, and why it’s important. Let’s go!

What is DALL · E and what can it do?

In July, the inventor of DALL · E, the OpenAI company, released a similarly sized model called the GPT-3 that excited the world its ability to generate human-like text, including op eds, poems, sonnets, and even computer code. DALL · E is a natural extension of GPT-3 that analyzes text messages and then responds with images rather than words. For example, in an example from the OpenAI blog, the model renders images from the prompt, “A living room with two white armchairs and a painting of the Coliseum. The painting is mounted over a modern fireplace “:

DALLE generated imagesFrom

Pretty smart, isn’t it? You can probably already see how useful this could be for designers. Note that DALL · E can generate a large number of images from a command prompt. The images are then called by a second OpenAI model called CLIP that tries to determine which pictures fit best.

How was DALL · E built?

Unfortunately, we don’t have many details on this yet, as OpenAI has not yet published a full paper. At its core, however, DALL · E uses the same new neural network architecture that has driven recent advances in ML: the Transformer. Transformers, discovered in 2017, are an easy-to-parallelize neural network that can be scaled and trained on large amounts of data. They were particularly revolutionary in natural language processing (they are the basis for models such as BERT, T5, GPT-3, and others) and have improved the quality of Google search Results, translation and even in Predicting the structures of proteins.

[Read: Meet the 4 scale-ups using data to save the planet]

Most of these large language models are trained on huge text data sets (like all of Wikipedia or Crawls the web). What makes DALL · E unique, however, is that it was trained on sequences that were a combination of words and pixels. We don’t yet know what the dataset was (it probably had pictures and captions in it), but I can guarantee you it was probably huge.

How “smart” is DALL · E?

While these results are impressive, the skeptical machine learning engineer rightly asks whenever we train a model on a huge data set whether the results are only of high quality because they were copied or saved from the source material.

To prove that DALL · E isn’t just revealing images, the OpenAI authors forced it to render some pretty unusual prompts:

“A professional high quality illustration of a giraffe turtle chimera.”


“A snail made from a harp.”


It’s hard to imagine that the model encountered many giraffe-turtle hybrids in its training dataset, which makes the results more impressive.

Additionally, these weird prompts hint at something even more intriguing about DALL · E: its ability to do “visual thinking without a shot”.

Zero-Shot Visual Reasoning

Typically in machine learning, we train models by giving them thousands or millions of examples of tasks to perform and hoping they’ll pick up on the pattern.

For example, to train a model that identifies dog breeds, we can show a neural network thousands of images of dogs tagged by breed and then test its ability to tag new images of dogs. It’s a limited-scope task that seems almost curious compared to the latest OpenAI feats.

Zero-shot learning, on the other hand, is the ability of models to perform tasks for which they were not specially trained. For example, DALL · E was trained to generate images from subtitles. However, with the correct prompt, images can also be converted to sketches:

Results of the prompt “Exactly same cat above as sketch below”. From

DALLE can also render custom text on street signs:

Results from the prompt “A shop front with the word” openai “written on it”. From

This allows DALL · E to behave almost like a Photoshop filter, although it is not specifically designed for that behavior.

The model even shows an “understanding” of visual concepts (i.e., “macroscopic” or “cross-sectional images”), locations (i.e., “a photo of the food from China”), and time (“a photo of the Alamo Square, San Francisco, at night from a street ”;“ a photo of a telephone from the 1920s ”). For example, what it spat out in response to “a photo of eating China” prompt:

“A photo of the food from China” from

In other words, DALL · E can do more than just paint a pretty picture for a lettering. In a sense, it can also visually answer questions.

To test DALL · E’s ability to think visually, the authors had a visual IQ test performed. In the examples below, the model had to complete the lower right corner of the grid following the test’s hidden pattern.

A screenshot of the visual IQ test OpenAI for testing DALL · E at

“DALL · E is often able to solve matrices that continue simple patterns or basic geometric considerations,” the authors write, but some problems did better than others. When the colors of the puzzles were inverted, DALL · E was worse – “which suggests his skills may become brittle in unexpected ways.”

What does that mean?

What strikes me most about DALL · E is its ability to perform surprisingly well on so many different assignments that the authors didn’t even expect:

“We think that DALL · E. […] is able to perform various kinds of picture-to-picture translation tasks when prompted in the correct way.

We didn’t anticipate this ability to emerge, and we didn’t make any changes to the neural network or training process to encourage it. “

It’s amazing, but not entirely unexpected. DALL · E and GPT-3 are two examples of a larger subject in deep learning: Exceptionally large neural networks trained on unlabeled Internet data (an example of “self-supervised learning”) can be very versatile and many things doing weren’t special for developed.

Of course, don’t confuse this with general intelligence. It is Not hard make these types of models look pretty dumb. We will know more when they are openly available and we can start playing around with them. But that doesn’t mean I can’t get excited in the meantime.

This article was written by Dale Markowitz, an applied AI engineer at Google in Austin, Texas, where she is working on applying machine learning to new areas and industries. She also likes solving her own life problems with AI and talks about it on YouTube.

Published on January 10, 2021 – 11:00 UTC

Comments are closed.