Ready-Made Day Dreams with Stable Diffusion

I am very lucky to be able to walk every day and day-dream! As I walk, I sometimes snap photos of things I might use in my art later (oh to be a 'real artist' and be able to sketch!). As I sat on a bench after a heavy rainstorm, I fancied that a tuft of grass in the path in front of me looked like a verdant island in a vast river estuary, so I snapped a picture. Some burnt logs, half-submerged in a fresh puddle, reminded me of crocodiles...

As well as text prompts, Stable Diffusion allows you to use images as a prompt, and have Stable Diffusion 'evolve' the image towards a given prompt (known as "img2img"). I've been experimenting with using my own pictures as img2img inputs to Stable Diffusion, but as I skirted around the crocodiles, it occurred to me that Stable Diffusion could be co-opted into turning my 'day dreams' into 'real' pictures:- basically give it the photo of the logs, but tell it they are crocodiles and let it do its thing!

That evening, I gave it a go, starting with the grass-island. When you specify an image as input, you are invited to give it an 'opacity', which actually tells Stable Diffusion how much noise to add to the image before starting to evolve it. After some trial-and-error, I settled for around 80% and gave it the (not very sophisticated) prompt: Island in the bend of a sandy river. I suspect you can get a lot cleverer with the prompt, but after browsing through a few possibilities and tweaking the prompt, I got something pretty cool:

I love the way it looks believably like a island with trees, and a forest behind, and yet it's clearly the same image as the clump of grass below - you can map the different features of each more-or-less 1:1, even the paler patches in the mud behind become clearings in the forest. So cool.

There are a bunch of things you can tweak with Stable Diffusion, but my general approach was quite basic:

Put in the photo with about 85% opacity
Put in a rough text prompt
Generate a bunch of candidates and see which was closest to what I wanted
Record the 'seed' of the best image and then iterate the heck out of the opacity and the prompt for that one

The 'seed' is the random number seed used by Stable Diffusion and it crudely dictates which random output image you get (I think it actually dictates the direction of each step on the 'path' taken through latent space, and hence where you end-up). In this case, there were a lot of seeds which put little tropical huts on the island, which I didn't want, so I picked one that didn't!

Encouraged, I had a go with the crocodiles with rather more mixed results. After a lot of putzing with variations of the prompt and different levels of 'opacity', I settled for the image on the right.

Again an 'opacity' of around 80% gave the best results, and the final prompt ended up being A group of crocodiles in a shallow pond surrounded by yellow grass. Stable Diffusion definitely fought harder on this one! I eventually found a seed that grudgingly made just the logs into crocs, and stuck with that, but Stable Diffusion was much happier to put weird glitchy crocs elsewhere (see left).

I'm no Prompt Engineer, but putting 'group' seemed to help make it clear I was expecting several crocs, not one slightly malformed big one, and putting 'in a shallow pond' and 'surrounded by' seemed to keep the crocs in the water! Nevertheless, if you look closely at the final image above, you can see the path has partially turned into a crocodile's tail, and the grass in the foreground looks like it's made from crocodile tummy scales!

Lessons Learnt

While not really 'art', it was a lot of fun! Some lessons learnt for anyone who is interested to have a go:

Seeds before Prompts

Start broad with lots of seeds, find one that 'behaves' the way you want and then freeze it. Obviously you could do this several times if you have more patience and more money than I have ;) It seems like Stable Diffusion is a lot more sensitive to the seed than the exact prompt wording, so freezing the seed first makes sense.

Input Image 'Opacity' is straight-forward

I expected the tricky piece to be picking an opacity level - too high and Stable Diffusion is rail-roaded into just using the image supplied, too low and it ignores the image and does its own thing - but actually between 80% and 90% was always pretty good - anything above 90% more or less gave me my image back unchanged, and anything below 80% almost always gave me weird artefacts everywhere! Interestingly as the opacity drops, there seems to be a sharp 'tipping point' at which weird stuff can creep it - something I've seen before - so keeping the opacity just above the tipping point was the easy way forward.

Prompt Engineering

Actually way trickier was finding the right prompt. Although the seed had a bigger effect on the output, there's not much 'science' needed there, you just pick one you like. The prompt definitely did improve the image quality, and there's a lot more 'art' to that (and a bit of 'science'!). Looking at the output can give you clues as to where to go, e.g. I started by saying "tropical island" but that gave me lots of beach huts, and loosing "tropical" seemed to help! Remember that the image is derived from the training set, so think how people might be labelling the images, and what sort of images might be present.

Stable Diffusion will 'glitch' under pressure

The model clearly has some sense of the complete image (or things like perspective in the island image wouldn't happen), but it's surprisingly easy to break if you put in enough constrains, and instead 'bits' of the thing described by the prompt will appear in the places that most resemble the expected pattern e.g. the grass becomes crocodile skin, reflections on the water become the front legs of animals.

AI sees the world differently to me

Finally, and perhaps most interestingly, just because the logs looked like crocs to me, didn't mean Stable Diffusion agreed! Unencumbered by social conditioning about logs being crocodile-like or the need for strong focal points, Stable Diffusion saw the logs more often as patches of river bank. or rocks. It's well-documented that ML/AI sees the world very differently, but it's definitely an interesting question for artistic control and intent.

Page updated

Google Sites

Report abuse