There are infinite worlds to explore.
How these were made
All of these images were synthesized using two machine learning models, VQGAN and CLIP. The text prompts that I gave to CLIP all have “by James Gurney,” (the patron saint of CLIP driven art) appended to them to deliver the characteristic style of sumptuous oil paint, color, light, and detail. The code that I started from was written by RiversHaveWings based on a method by advadnoun with further modification by Eleiber and Abulafia. I then added capabilities to that code to make compelling panoramas and to introduce concepts of symmetry. All of these methods I’ll describe below. (I’ll post the code here too once I get confirmation from my employer that they don’t have any copyright interest in it.) If the fact that these images were all generated by machines leaves you in awe the way it still does me, you can read more of my thoughts about that at the end of this post. For now, I’ve moved on from being overwhelmed by all this to trying to think about how to use it to make durable art.
VQGAN is a model designed to generate random realistic looking images. CLIP is a model designed to match images against text. The strategy that this code uses for producing images that match text is to generate a noise image from VQGAN, send it to CLIP, and have CLIP tell VQGAN “tweak it this way, and it will be a better match” a few hundred to a few thousand times. VQGAN is capable of creating pretty large images (by the standards of AI art) but CLIP can only accept a low resolution image as input, so to bridge the gap, the code cuts out patches of the image, distorts them, shrinks them to be small enough, and sends them to CLIP.
The bulk of the difference between different strategies in this general framework comes down to how the patches are chosen and how the images are distorted before they’re scored against the text. Much like how digital artists flip their canvas to double check their proportions, and artists in traditional media will rotate around their canvas to view it from different angles as they’re working, giving CLIP randomly rotated, skewed, slightly blurred images produces much better results. (It also makes the optimization process robust to Adversarial noise, in which the image is tweaked in a way that’s imperceptible to a human, but overwhelms ML models, due to how dot products work in high dimensions”)
If you try to make a panoramic image using this method alone, the result isn’t very satisfying. CLIP can only see one square section of it at a time, so VQGAN finds the same optimum in each horizontal patch, and you end up with a repetitive image that just rearranges the same few elements.
If you want the model to produce something different, you have to tell it to do that. So the approach I took was to vary the prompt across the image. To do this, I keep track of the x axis position of the center of each cutout I make, and then weight the loss against each prompt with an interpolation function of the x coordinate.
The trick then becomes picking prompts that are consonant with each other. A good strategy is to pick your prompts with the pattern, “place” “object in that place.” Using this general pattern I was able to make a bunch of pretty compelling images. These were some of the more successful ones.
Though each prompt only directly influences one side of the image, they still affect each other both by sharing the latent representation in VQGAN and having some overlapping areas of influence. In the following set, all the images have the same prompt on the right hand side, and the prompt on the left varies. Notice how the choice of prompt on the left changes the time of day on the right.
Once I could make interesting consonant panoramas, I started thinking about making juxtapositions. What happens if you interpolate between two opposites? Then I started thinking about symmetry. Could I make a symmetric composition with two opposing concepts juxtaposed with each other?
The first thing I tried to accomplish this was to work in image space with a simple image salience filter. I ran a sobel edge detector, blurred it, and then added a squared loss term comparing that to itself flipped horizontally. This produced some symmetric results, but it was too easy to hack. Need more edges? Just add noise. Need fewer edges? Remove all the detail. There’s probably more I could have tried here with a different sobel kernel with wider bandwidth, but it was finicky enough that I moved on to other techniques.
The next thing I tried was trying to match tone. I made a greyscale version of the image, blurred it, flipped it, and added L1 loss against itself. (L1 seemed to be better than L2 at allowing it to make objects before it tried to match the tone. The failure case with all these constraints is that it gives up on the primary objective, just satisfies the secondary one and makes a muddy disorganized mess.)
This started to produce some interesting and pleasing results, but I wanted to go further.
CLIP represents each image and bit of text with an “embedding” which is a point in high dimensional space that encapsulates everything CLIP knows about an image or a bit of text. Scoring an image against a text is then just a matter of measuring distance in that high dimensional space. The first deep learning models based on words ended up with embedding spaces that encoded a huge amount of meaning. In word2vec you could solve analogies just by doing vector math on the embeddings of the words. “King” – “man” + “woman” ≈ “queen” was the famous example. Is CLIP’s space similarly semantic? I decided to find out.
To do this I introduced the idea of “semantic symmetry.” I first started by modifying the cutout logic to select symmetrical patches from each side of the image. Then for each pair of patches, I added a new objective. Instead of just maximizing cosine(text, patch) I also maximized cosine(patch1 – text1, patch2 – text2). The goal here is to take objects from one context and translate them into the other. This second objective I give a lower weight so that it doesn’t prevent the image from succeeding at the overall task of matching the image against the text. Otherwise it’s easy for it to solve this task with “failure matches failure.”
With this tool in hand, I suddenly felt like I had awesome cosmic powers. Want to see how CLIP interprets the same objects in a cyberpunk dystopia vs a retrofuturist utopia? Let’s check.
Or we can go for gentler scenes and juxtapose a fall garden with a spring garden.
When I showed people these images, they thought they were pretty and impressive, but no one immediately got what was going on. The two prompts blend together seamlessly enough that without context it just looks like an ordinary somewhat symmetric composition. At worst, the objects match up so well that it starts to look like the boring repetitive panoramas we started with. I needed something that would really highlight the contrast, something that would make it obvious that there were two places, two choices. After many failed attempts to make a good looking heaven/hell juxtaposition I finally asked it to show me the “entrance” and everything clicked into place.
If you look back now at some of the images with this in mind, you might notice details you didn’t before, the fairy wings paired with dragon wings in Fae / Dragon, the gold leaf paired with seaweed leaves in El Dorado / Atlantis.
The first prompt I tried for each idea was “entrance to <place> by James Gurney” or for concepts that aren’t themselves places, “entrance to the <subject> realm by James Gurney.” Usually this lead to good results. Before settling on “realm” I tried “land” and “world”, but “land” tended to diminish the doorway, and “world” tended to hide little globes all over the image. After seeing what that prompt created, the first thing I would tweak is the description of the entrance. I would substitute “doorway,” “archway,” “gate,” “gateway” depending on what the theme and the scale of the subject seemed most consistent with.
CLIP knew a lot about most of the subjects, and I didn’t have to provide any further description. For some of them, CLIP’s vision was different from mine, so I tweaked the prompt accordingly. (It saw El Dorado as being in a desert, and Atlantis as being an ancient high tech concept rather than classical ruins.) For the non-English words and places, it needed more help. It generally understood bodhi and samsara, but produced better results when adding “enlightenment” and “the cycle of death and rebirth.” For the Norse afterlife, it knew both place were Norse, but that’s about all it knew, and would produce a generic craggy Norwegian landscape without more description. It knew nothing at all about the Egyptian words. I attempted to make the Aztec afterlife as well, but similar to the Norse words, it just produced images that looked like a museum diorama of Aztec life, and I ultimately couldn’t produce anything that didn’t look like a caricature.
CLIP’s understanding of the world reflects what people make images of, and so is a mirror to what we can visualize. Its understanding of the future comes mostly from the history of science fiction art, and has a deep well of styles from the history of such art to draw from. But its vision of the future seemed much more limited than its visions of fantastical places and the past, and I had much less success there. It had no idea how to visualize a positive climate future whatsoever, probably because artists don’t either, and so left me to describe what that vision might look like myself.
Because optimizing the image within these constraints is a more difficult task, I used much lower learning rates than people typically use, usually around 0.025 instead of 0.2. This meant that each image took 30-45 minutes to complete, and 15 minutes to have a clear trajectory. Around half of the images you see above worked immediately. Another quarter or so took just a few rerollings of the seed, or tweaks of the prompt to get them to work. Some of them took many many retries and many rewordings. For some of the ideas I had I spent probably 30+ hours of GPU time, trying different seeds, different prompts, different initialization images, and could never get them to work well. Dear reader, you will not believe the amount of time I spent trying to make an entrance to an orbital city paired with an entrance to an underwater city. I’m convinced such an entrance does not exist, or if it does, it is a lovecraftean fractally blob.
An astute reader might notice that there’s nothing in the symmetry constraint that guarantees the image will contain two doors, rather than say, three with one in the center that mixes the two prompts, or indeed any number, even or odd. This was the cause of most of the retries. For some of the themes where I felt like a middle door was appropriate, I kept that version, since this was the most common output other than two. I started to develop superstitions about particular random seeds. “This prompt feels like it would be good for a random seed of 3,” was a real thought I started thinking, even though it’s crazy.
Over time, you start to develop an intuitive sense of which ideas are hopeless based on the first thing it generates, and which will work out perfectly with just a little reshuffling. I found that the more I had a specific vision in mind of what the output should look like, the more I was disappointed and frustrated. You have to embrace the serendipity. Making art with these tools is a lot like practical effects in movies in the 20th century. You have to know the limits of your medium, and design your art around it. CLIP cannot tell whether a figure has the right number of limbs, or how many mouths it has, and it does not care, so if you care, you better figure out how to keep figures out of your images. Some strategies I’ve employed: Adding “Secluded, empty, silent, alone” to the prompt to indicate that nobody is there. Adding a time of day when it’s likely to be unoccupied like “in the early morning.” Or if none of those work, add “enormous” to the subject, so that the figures become tiny pixelated blobs by comparison. This is the main reason I’ve focused so much on landscapes so far. Plants, mountains, trails can have almost any arrangement, and our minds will still accept it as correct.
As I worked on this, I thought a lot about what it means to make art in this medium. If you can generate 20 or 100 images per day, why should anyone spend time looking at any one of them? Sometimes it can be motivated by a new technique, like the symmetry I’ve played with here, or with seeing what a new model can do like in the CLIP guided diffusion experiments that RiversHaveWings is doing today. Many have suggested that the role of the artist will be in large part curation and discernment. Tweaking code, parameters, prompts will be a skill, but a lot will come down to simply filtering the firehose to the best of the best, and I do think that’s part of it.
But ultimately, these images do not stand up to a careful viewing. The advantage this medium has is volume, and I think that’s what it should embrace. If you can make many good images in a day, from your phone or in the background while doing your day job, there’s no excuse not to exhaustively explore a theme. Explore every nook and cranny, find every detail of what the models know. Let the images wash over the viewer leaving a vague impression in the volume that is their natural state.
I did that in my last piece where I stumbled upon this strategy by accident out of my own curiosity, but I expect this may be a durable strategy going forward. Being an AI artist will be less like being a painter, laboring over a single image for weeks and rewarding your viewer with texture and detail that they come back to again and again, and more like being a poet, writing line after line of carefully chosen text, that while they may have power individually, only achieve their impact as a whole.
Epilogue: Workflow Examples
Here I’ll document what the workflow process looked like for one of the more successful but more difficult images. This one required an unusual number of attempts, and the way the images change with different approaches are illustrative.
As you can see from the process here as I gradually made things more complicated than simplified it back down, It’s often tempting to try to specify more specifically exactly the image you are looking for, but I find that often makes the result worse. There’s a sort of sweet spot where if a prompt is too specific it fails, and if it’s too general it fails, so the trick is to find a pithy few words that negotiate the difference between what you are looking for and what it wants to make.
When I started working on this, I was a “purist” about only taking exactly what came out of the machine with only tweaking of the code, prompts, and parameters allowed. As it went on and I tired of seeing images with perfect parts that weren’t quite right overall, I started taking a more active role. If an image had elements I liked, but (for instance) it had too many doors, I’d download it, rearrange it, and then use that as the initialization for another round of optimization.
Only three or four got this “mid optimization GIMP surgery” and this was the most significant edit I did. The rest were straight from the code.
For the Mars / Luna image I used a seed image to induce it to construct the airlock frame with the door in the center. Images like that are difficult for this method to construct. It’s quite difficult to “grow” an airlock around the periphery of the image from a seed of it somewhere else. You need something to “globally coordinate” it, and that’s what the seed provides. I still don’t feel like I’ve quite got the handle of how to get good results from seed images. Everything I tried tended to turn out worse, and you can see that the Mars / Luna image looks much less detailed than the rest.