What is DALL-E? And What Does it Mean for AI?

Written by Andres Garzon
Technology

What happens when you cross surrealist Spanish master Salvador Dali with Pixar’s lovable robot Wall-E?

How about OpenAI’s newest artificial intelligence model. At least that's why they called it DALL-E. (Get it?)

After last year's much-hyped and deeply discussed OpenAI release of GPT-3, OpenAIs neural-network-powered language model, built to find patterns in data and to use those patterns to complete written prompts (through the scan of 175 billion parameters), this release perhaps should not come as so great of a surprise. But it is pretty impressive.

So what makes it different? And why does it matter?

What does DALL-E do?

First, let’s start with the basics. Ask OpenAI, the research and development company founded by Elon Musk in 2015 (and currently part of Microsoft), and they’ll tell you this:

“DALL-E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text-image pairs.”

This means it’s built on the GPT-3 engine -- which finds patterns in data and uses those patterns to complete text prompts -- but instead of building (eerily human) written responses, DALL-E uses those prompts to create auto-generated images combining visual interpretations of each set of keywords.

For example, if you ask DALL-E to create an image of “an armchair in the shape of an avocado,” it produces this:

What is DALL-E? And What Does it Mean for AI?

Pretty snazzy, and certainly interesting, if oddly accurate. But is it groundbreaking technology?

Let’s look deeper.

If you ask it for “a living room with two white armchairs and a painting of the colosseum” and even specify that “the painting is mounted above a modern fireplace,” you get this:

What is DALL-E? And What Does it Mean for AI?

Pretty cool, right?

But besides a cute parlor trick -- and its clear implications for, say, helping different kinds of designers generate quicker, more accurate, and more dynamic first draft versions of how they might lay out your new Riviera summer home -- what implications does DALL-E and the tech therein have for the development of AI and machine learning in general?

What does DALL-E matter?

Once again, with the debut of this machine learning wunderkind, there are those who fear such a development could mean the “end of creativity.” The worry usually goes like this: If a computer can create original images, then what use could humanity have for artists, graphic designers, illustrators, or the like?

While this basis (or lack) of this fear will only be born out over time -- with the application of DALL-E in the world, and the myriad uses and advances which we’ll surely see over the next few years -- in its current state DALL-E is an enhancement of the way we understand machine learning, not a replacement for human creativity.

For example, DALL-E still requires a very specific type of language to be used to render complex images, and it still does so in a way that may not be the preferred outcome for all aesthetics, projects, or needs.

So while each DALL-E query does produce 512 images for each prompt, and a second computer model (also developed by OpenAI and called CLIP) narrows this down automatically to the 32 “best” results, what this “best” means is a process that still needs to be refined and developed. And even after it has, the 32 “best” avocado sofas may just not be the avocado sofa you desire in your avocado-themed reading lounge.

But in the short term, there are many applications that we can get excited about for DALL-E and its inevitable iterations. Such as product images on e-commerce sites like Amazon and eBay could become easier -- and cheaper -- to generate. Image prototyping could be simpler and more efficient (and also less expensive), allowing designers to include dynamic imagery long before technical design. And real estate sites could generate models of unbuilt homes, or redesigned homes, in a manner wherein customers could dynamically render images of their potential purchases tailored to their specifications. How fun!

Rather than creating a “creative armageddon,” these are serviceable implications that could be...very helpful. In an “incremental progress” sort of way. Not so bad, DALL-E. Certainly, many other applications are already being brainstormed, dreamed up, and created by those in a variety of visual and design fields.

Is there anything else going on with this “big news”?

What does DALL-E mean for machine learning?

When GPT-3 was rolled out last summer, the greatest excitement surrounding its breakthrough tech wasn’t that it could be used to write a poem, or an Op-Ed -- though these uses did create a lot of buzz (and hand-wringing). The greatest excitement was what the raw technology implied for what could be built by machines that are not beginning to learn how to learn.

The advances represented by DALL-E -- while perhaps less dramatic, are no less important.

As Carlos E. Perez writes in his Medium essay on DALL-E’s relationship to philosopher Wittgenstein’s ‘picture theory of meaning’ (i.e. that verbal statements are meaningful only if they can be pictured in the real world): “by pairing an understanding of natural language with an ability to generate corresponding visual representations—in other words, by being able to both “read” and “see”—DALL-E is a powerful demonstration of multimodal AI’s potential.”

In other words, if DALL-E can take randomly generated language and with it render accurate visual representations, then it’s essentially bringing to life Wittgenstein’s theory, and to see the implication of what it means for machine learning to “bridge the gap between language expression and pictures.”

As computers get smarter, and our capacity for building more uses out of their intelligence expands, the kind of application that will “change the world” is likely not yet imagined. We can’t know what we’ll need DALL-E’s progeny to do, because we have yet to build the world where this technology will be necessary.

Yes, for now, DALL-E is doing something interesting and powerful -- scanning 12 billion parameters, showing a rudimentary understanding of grammatical constructions and the translation of textual information to graphic information. But don’t expect the world to crumble or change irrevocably with this new development. Yet, at least.

Unlike an AI-rendered avocado chair, the most profound implications will only become visible over time.

--

If you want to stay up to date with all the new content we publish on our blog, share your email and hit the subscribe button.

Also, feel free to browse through the other sections of the blog where you can find many other amazing articles on: Programming, IT, Outsourcing, and even Management.

Share
linkedIn icon
Written by Andres Garzon
LinkedIn

Andres was born in Quito, Ecuador, where he was raised with an appreciation for cultural exchange. After graduating from Universidad San Francisco de Quito, he worked for a number of companies in the US, before earning his MBA from Fordham University in New York City. While a student, he noticed there was a shortage of good programmers in the United States and an abundance of talented programmers in South America. So he bet everything on South American talent and founded Jobsity -- an innovative company that helps US companies hire and retain Latin American programmers.