Doorways

There are infinite worlds to explore.

Afterlife

Heaven / Hell
Limbo / Purgatory
Bodhi / Samsara
Valhalla / Fólkvangr
Paradise / The Underworld
Fields of A’aru / Maw of Ammut
Remembered / Forgotten

Nature

Sun / Moon
Fall / Spring
Summer / Winter
Monsoon / Drought
Plants / Fungus
Flowers / Insects
Forest / Kelp Forest
Mountain / Canyon
Desert / Bayou
Glacier / River
Rainforest / Coral Reef
Plains / Tundra
Crystals / Fossils

Other Worlds

El Dorado / Atlantis
Fae / Dragons
Gnomes / Trolls
Giants / Elves
Ice Cream / Chocolate
Meat / Vegetables
Bread / Butter
Grimm’s Fairy Tales / 1001 Arabian nights
Cyberpunk / Steampunk

Future

Mars / Luna
Artificial Intelligence / Genetic Engineering
Space / Virtual Worlds
Climate Apocalypse / An Uncertain But Hopeful Future

How these were made

All of these images were synthesized using two machine learning models, VQGAN and CLIP. The text prompts that I gave to CLIP all have “by James Gurney,” (the patron saint of CLIP driven art) appended to them to deliver the characteristic style of sumptuous oil paint, color, light, and detail. The code that I started from was written by RiversHaveWings based on a method by advadnoun with further modification by Eleiber and Abulafia. I then added capabilities to that code to make compelling panoramas and to introduce concepts of symmetry. All of these methods I’ll describe below. (I’ll post the code here too once I get confirmation from my employer that they don’t have any copyright interest in it.) If the fact that these images were all generated by machines leaves you in awe the way it still does me, you can read more of my thoughts about that at the end of this post. For now, I’ve moved on from being overwhelmed by all this to trying to think about how to use it to make durable art.

VQGAN is a model designed to generate random realistic looking images. CLIP is a model designed to match images against text. The strategy that this code uses for producing images that match text is to generate a noise image from VQGAN, send it to CLIP, and have CLIP tell VQGAN “tweak it this way, and it will be a better match” a few hundred to a few thousand times. VQGAN is capable of creating pretty large images (by the standards of AI art) but CLIP can only accept a low resolution image as input, so to bridge the gap, the code cuts out patches of the image, distorts them, shrinks them to be small enough, and sends them to CLIP.

The bulk of the difference between different strategies in this general framework comes down to how the patches are chosen and how the images are distorted before they’re scored against the text. Much like how digital artists flip their canvas to double check their proportions, and artists in traditional media will rotate around their canvas to view it from different angles as they’re working, giving CLIP randomly rotated, skewed, slightly blurred images produces much better results. (It also makes the optimization process robust to Adversarial noise, in which the image is tweaked in a way that’s imperceptible to a human, but overwhelms ML models, due to how dot products work in high dimensions”)

Humans are better classifiers when they have augmentations too.

Panoramas

If you try to make a panoramic image using this method alone, the result isn’t very satisfying. CLIP can only see one square section of it at a time, so VQGAN finds the same optimum in each horizontal patch, and you end up with a repetitive image that just rearranges the same few elements.

A very repetitive panorama

If you want the model to produce something different, you have to tell it to do that. So the approach I took was to vary the prompt across the image. To do this, I keep track of the x axis position of the center of each cutout I make, and then weight the loss against each prompt with an interpolation function of the x coordinate. 

Interpolation Success!

The trick then becomes picking prompts that are consonant with each other. A good strategy is to pick your prompts with the pattern, “place” “object in that place.” Using this general pattern I was able to make a bunch of pretty compelling images. These were some of the more successful ones.

Though each prompt only directly influences one side of the image, they still affect each other both by sharing the latent representation in VQGAN and having some overlapping areas of influence. In the following set, all the images have the same prompt on the right hand side, and the prompt on the left varies. Notice how the choice of prompt on the left changes the time of day on the right.

Sunny Clearing in the Enchanted Garden / Enchanted Garden
Flowers in the Enchanted Garden / Enchanted Garden
Lanterns in the Enchanted Garden / Enchanted Garden
Fireflies in the Enchanted Garden / Enchanted Garden

Once I could make interesting consonant panoramas, I started thinking about making juxtapositions. What happens if you interpolate between two opposites? Then I started thinking about symmetry. Could I make a symmetric composition with two opposing concepts juxtaposed with each other?

The first thing I tried to accomplish this was to work in image space with a simple image salience filter. I ran a sobel edge detector, blurred it, and then added a squared loss term comparing that to itself flipped horizontally. This produced some symmetric results, but it was too easy to hack. Need more edges? Just add noise. Need fewer edges? Remove all the detail. There’s probably more I could have tried here with a different sobel kernel with wider bandwidth, but it was finicky enough that I moved on to other techniques.

My test prompt for this method was Dystopia / Utopia (both “by James Gurney” of course.) Sobel filters are easy to hack. Either remove all the detail because a blur always matches a blur…
…or just throw in noise everywhere because noise matches noise.
This is the best image I got from this technique, but I wasn’t satisfied with the fuzziness.

The next thing I tried was trying to match tone. I made a greyscale version of the image, blurred it, flipped it, and added L1 loss against itself. (L1 seemed to be better than L2 at allowing it to make objects before it tried to match the tone. The failure case with all these constraints is that it gives up on the primary objective, just satisfies the secondary one and makes a muddy disorganized mess.)

All of the interesting pairings of objects here are just an accident of providing a contrasting prompt (dystopian city/utopian city) and the tone constraint.

This started to produce some interesting and pleasing results, but I wanted to go further.

Semantic Symmetry

CLIP represents each image and bit of text with an “embedding” which is a point in high dimensional space that encapsulates everything CLIP knows about an image or a bit of text. Scoring an image against a text is then just a matter of measuring distance in that high dimensional space. The first deep learning models based on words ended up with embedding spaces that encoded a huge amount of meaning. In word2vec you could solve analogies just by doing vector math on the embeddings of the words. “King” – “man” + “woman” ≈ “queen” was the famous example. Is CLIP’s space similarly semantic? I decided to find out. 

To do this I introduced the idea of “semantic symmetry.” I first started by modifying the cutout logic to select symmetrical patches from each side of the image. Then for each pair of patches, I added a new objective. Instead of just maximizing cosine(text, patch) I also maximized cosine(patch1 – text1, patch2 – text2). The goal here is to take objects from one context and translate them into the other. This second objective I give a lower weight so that it doesn’t prevent the image from succeeding at the overall task of matching the image against the text. Otherwise it’s easy for it to solve this task with “failure matches failure.”

After many failed attempts and a lot of images that looked almost right but not quite it was difficult to tell whether it was working. This was the first test that proved to me it was. Notice how identically styled fireflies are placed symmetrically.

With this tool in hand, I suddenly felt like I had awesome cosmic powers. Want to see how CLIP interprets the same objects in a cyberpunk dystopia vs a retrofuturist utopia? Let’s check.

Utopia or dystopia, the brands stay the same.
I love how it translates the futuristic police car to a pink finned ’50s flying car. Also note the same street texture but broken up in the dystopia, and the shanty translated into a swooping skywalk/monorail thing.,

Or we can go for gentler scenes and juxtapose a fall garden with a spring garden.

Fall Garden / Spring Garden

When I showed people these images, they thought they were pretty and impressive, but no one immediately got what was going on. The two prompts blend together seamlessly enough that without context it just looks like an ordinary somewhat symmetric composition. At worst, the objects match up so well that it starts to look like the boring repetitive panoramas we started with. I needed something that would really highlight the contrast, something that would make it obvious that there were two places, two choices. After many failed attempts to make a good looking heaven/hell juxtaposition I finally asked it to show me the “entrance” and everything clicked into place.

If you look back now at some of the images with this in mind, you might notice details you didn’t before, the fairy wings paired with dragon wings in Fae / Dragon, the gold leaf paired with seaweed leaves in El Dorado / Atlantis.

Iteration Process

The first prompt I tried for each idea was “entrance to <place> by James Gurney” or for concepts that aren’t themselves places, “entrance to the <subject> realm by James Gurney.” Usually this lead to good results. Before settling on “realm” I tried “land” and “world”, but “land” tended to diminish the doorway, and “world” tended to hide little globes all over the image. After seeing what that prompt created, the first thing I would tweak is the description of the entrance. I would substitute “doorway,” “archway,” “gate,” “gateway” depending on what the theme and the scale of the subject seemed most consistent with.

CLIP knew a lot about most of the subjects, and I didn’t have to provide any further description. For some of them, CLIP’s vision was different from mine, so I tweaked the prompt accordingly. (It saw El Dorado as being in a desert, and Atlantis as being an ancient high tech concept rather than classical ruins.) For the non-English words and places, it needed more help. It generally understood bodhi and samsara, but produced better results when adding “enlightenment” and “the cycle of death and rebirth.” For the Norse afterlife, it knew both place were Norse, but that’s about all it knew, and would produce a generic craggy Norwegian landscape without more description. It knew nothing at all about the Egyptian words. I attempted to make the Aztec afterlife as well, but similar to the Norse words, it just produced images that looked like a museum diorama of Aztec life, and I ultimately couldn’t produce anything that didn’t look like a caricature.

CLIP’s understanding of the world reflects what people make images of, and so is a mirror to what we can visualize. Its understanding of the future comes mostly from the history of science fiction art, and has a deep well of styles from the history of such art to draw from. But its vision of the future seemed much more limited than its visions of fantastical places and the past, and I had much less success there. It had no idea how to visualize a positive climate future whatsoever, probably because artists don’t either, and so left me to describe what that vision might look like myself.

Because optimizing the image within these constraints is a more difficult task, I used much lower learning rates than people typically use, usually around 0.025 instead of 0.2. This meant that each image took 30-45 minutes to complete, and 15 minutes to have a clear trajectory. Around half of the images you see above worked immediately. Another quarter or so took just a few rerollings of the seed, or tweaks of the prompt to get them to work. Some of them took many many retries and many rewordings. For some of the ideas I had I spent probably 30+ hours of GPU time, trying different seeds, different prompts, different initialization images, and could never get them to work well. Dear reader, you will not believe the amount of time I spent trying to make an entrance to an orbital city paired with an entrance to an underwater city. I’m convinced such an entrance does not exist, or if it does, it is a lovecraftean fractally blob.

Orbital city / Underwater city. Beautiful places, but I’ve decided they have no entrance. You can’t go there no matter how hard you try.
This is the closest I got, after I don’t even know how many attempts.

An astute reader might notice that there’s nothing in the symmetry constraint that guarantees the image will contain two doors, rather than say, three with one in the center that mixes the two prompts, or indeed any number, even or odd. This was the cause of most of the retries. For some of the themes where I felt like a middle door was appropriate, I kept that version, since this was the most common output other than two. I started to develop superstitions about particular random seeds. “This prompt feels like it would be good for a random seed of 3,” was a real thought I started thinking, even though it’s crazy.

Over time, you start to develop an intuitive sense of which ideas are hopeless based on the first thing it generates, and which will work out perfectly with just a little reshuffling. I found that the more I had a specific vision in mind of what the output should look like, the more I was disappointed and frustrated. You have to embrace the serendipity. Making art with these tools is a lot like practical effects in movies in the 20th century. You have to know the limits of your medium, and design your art around it. CLIP cannot tell whether a figure has the right number of limbs, or how many mouths it has, and it does not care, so if you care, you better figure out how to keep figures out of your images. Some strategies I’ve employed: Adding “Secluded, empty, silent, alone” to the prompt to indicate that nobody is there. Adding a time of day when it’s likely to be unoccupied like “in the early morning.” Or if none of those work, add “enormous” to the subject, so that the figures become tiny pixelated blobs by comparison. This is the main reason I’ve focused so much on landscapes so far. Plants, mountains, trails can have almost any arrangement, and our minds will still accept it as correct.

Final Thoughts

As I worked on this, I thought a lot about what it means to make art in this medium. If you can generate 20 or 100 images per day, why should anyone spend time looking at any one of them? Sometimes it can be motivated by a new technique, like the symmetry I’ve played with here, or with seeing what a new model can do like in the CLIP guided diffusion experiments that RiversHaveWings is doing today. Many have suggested that the role of the artist will be in large part curation and discernment. Tweaking code, parameters, prompts will be a skill, but a lot will come down to simply filtering the firehose to the best of the best, and I do think that’s part of it.

But ultimately, these images do not stand up to a careful viewing. The advantage this medium has is volume, and I think that’s what it should embrace. If you can make many good images in a day, from your phone or in the background while doing your day job, there’s no excuse not to exhaustively explore a theme. Explore every nook and cranny, find every detail of what the models know. Let the images wash over the viewer leaving a vague impression in the volume that is their natural state.

I did that in my last piece where I stumbled upon this strategy by accident out of my own curiosity, but I expect this may be a durable strategy going forward. Being an AI artist will be less like being a painter, laboring over a single image for weeks and rewarding your viewer with texture and detail that they come back to again and again, and more like being a poet, writing line after line of carefully chosen text, that while they may have power individually, only achieve their impact as a whole.

Epilogue: Workflow Examples

Here I’ll document what the workflow process looked like for one of the more successful but more difficult images. This one required an unusual number of attempts, and the way the images change with different approaches are illustrative.

“Entrance to the forest realm by James Gurney | Entrance to the kelp forest realm by James Gurney” This looks great, and clearly has potential. The way the kelp blends into the leaves of the forest is perfect. But there are obviously a few things to fix. There are too many doors. And it seems to think that the kelp forest is in an aquarium with people looking in at it, and the people look terrible. Is that a wookie on the right?
Let’s try adding the keyword “wild” and see what we get, to indicate that the kelp forest should not be in an aquarium. This looks good overall, but has way too many doors let’s re-roll.
Does “Gate” work better? Overall pretty good. Still too many doors. Wish I didn’t have that shadowy figure there. He shows up a lot for some reason.
If we make the entrances “enormous” the people become nicely tiny and we get two doors, but the doors look bad.
If we can’t get rid of the middle door, let’s try to do something cool with it. There are more kinds of forest, let’s try deciduous / evergreen / kelp. This is interesting but the doors are weird.
Trying without “wild” to see if I get a more organized composition. Gah! People again.
What if I go back to “wild” but reorder it with evergreen in the center? This one is maybe usable, no people, three doors with different themes, but there’s none of the awesome blending between forests that we had in the first example that was so compelling.
Does “Redwood” work better than conifer or evergreen? Nope. Also, too many doors.
What about “pine?” Nope. Also, too many doors.
Back to confer. Too many doors.
Adding “Solitary” Seems to help with the too many doors problem, but I don’t like what this one is doing in the center.
Rerolling to get the center door back, but it still looks bad.
Does it work if I abandon the kelp forest and just do evergreen and deciduous? Not really.
What if I go back to the original setup with the blending I loved so much, and try “underwater” for the kelp forest side instead of “wild.” The left here is still “deciduous.”
Does that work with three prompts? Kinda but not really.
Back to basics. The left prompt is back to just “forest” instead of deciduous or evergreen forest, and the only difference now is that the right kelp forest entrance is “underwater” to ensure that it isn’t in an aquarium and there are no people in front of it. Success!

As you can see from the process here as I gradually made things more complicated than simplified it back down, It’s often tempting to try to specify more specifically exactly the image you are looking for, but I find that often makes the result worse. There’s a sort of sweet spot where if a prompt is too specific it fails, and if it’s too general it fails, so the trick is to find a pithy few words that negotiate the difference between what you are looking for and what it wants to make.

When I started working on this, I was a “purist” about only taking exactly what came out of the machine with only tweaking of the code, prompts, and parameters allowed. As it went on and I tired of seeing images with perfect parts that weren’t quite right overall, I started taking a more active role. If an image had elements I liked, but (for instance) it had too many doors, I’d download it, rearrange it, and then use that as the initialization for another round of optimization.

This was starting to turn out great, except for that door in the middle. One of the downsides how the symmetry constraint is constructed is that it’s at the “patch” level. For this prompt there wasn’t much obvious subject matter to fill in the areas around the doors, so I started out putting most of the weight on very large patches to make the doors fill the area. And because of that it’s happy to say “both of these big areas contain entrances, looks fine” and not bother to align them precisely.
So rather than spend a while rerolling, I deleted the extra door, smudged it out with the same colors, put the other door where I wanted it, and blurred the whole thing so it could again make its own decisions about where to put the details.
And success! When I start with an initialization like this from a previous run that already has the elements in the right places, I put most of the weight on small patches so that it just works on filling in detail.

Only three or four got this “mid optimization GIMP surgery” and this was the most significant edit I did. The rest were straight from the code.

For the Mars / Luna image I used a seed image to induce it to construct the airlock frame with the door in the center. Images like that are difficult for this method to construct. It’s quite difficult to “grow” an airlock around the periphery of the image from a seed of it somewhere else. You need something to “globally coordinate” it, and that’s what the seed provides. I still don’t feel like I’ve quite got the handle of how to get good results from seed images. Everything I tried tended to turn out worse, and you can see that the Mars / Luna image looks much less detailed than the rest.

The seed image for Mars / Luna. You probably can’t tell, but there are two slightly darker blotches in the center of both sides, and that points the optimization in the right direction.

5 Comments

  1. Thanks for explaining this, I’ve found it really useful. I am a non-coder, writer and photographer trying to work out how to use this method to help represent my weird recurrent dreams. I tend to work by creating images myself and feeding them into a notebook, but am really hitting the limits of what I can achieve with the desired level of control. I plan to try and make larger images by dividing one of my photoshop works into zones and cutting them up before feeding into a notebook. Can’t quite work out how yet! Some of my stuff is on Instagram @chairmanwill

  2. Fantastic post Ryan, you’ve put a lot of effort into describing the process. I think there is still a lot of exploration of exotic “prompts” left beyond just the realms of James Gurney, which can take us to some amazing places. As you note, further small modifications to CLIP and the code are probably going to yield further improvements to image conformance and the ability to sculpt the results!

  3. “(I’ll post the code here too once I get confirmation from my employer that they don’t have any copyright interest in it.)”

    Fingers crossed they let you post it!

Leave a comment