PITTI - Article - How hard does Art need to be ?

How hard does Art need to be ?

Artificial Intelligence,Design | Culture

Date : 2023-10-23

In September this year, a new form of AI-art popped-up on Reddit, and quickly flooded all social media platforms. This opened-up an intense debate on whether AI-generated images qualify as art. Isn't it sufficient that some people can’t resist staring at them, or that they analyze all minute details? Likely not as this happens with landscapes all the time, and no one calls this art. Art is never about a subject, be it a person or a landscape, art is about capturing beauty and throwing it back to someone's face.

So a true creative process seems necessary, but where do you draw the line? If the recipe was as simple as a list of keywords and a click, that would surely be disqualifying. The only way to find out is to try it ourselves, and that’s precisely what we did. The “trying" part, not the “becoming-an-artist-in-a-click" part.

At the very beginning of this journey, the first thing you learn is that the tool that produces the geometric creations does not come off the shelf. Before even trying to use it, you need to assemble it. And just like an IKEA kitchen, it is never as easy as it looks if you do it for the first time.

Artist: Ugleh

There are two major building blocks, Stable Diffusion and ControlNet and, for each of them, there are a number of spare parts to choose from.

Stable Diffusion

Stable Diffusion is a popular suite of models to generate images using a technique called diffusion. In extremely simplified terms, diffusion consists in starting from a noisy canvas and refining the details at each step. The technique also underpins OpenAi’s DALL-E models. But Stable Diffusion is a very different beast: it was developed jointly by the Ludwig Maximilian University of Munich and startup Runway in 2022 and open-sourced immediately whereas the OpenAi models are only accessible through API. Open-sourcing means that many models are directly derived from Stable Diffusion, to a point where there can be confusions as to who should be credited and for what. Some of the most prominent generative AI services today are based on Stable Diffusion including Midjourney, the poster-child of AI-art.

Midjourney is a service that we use extensively for PITTI - cover pictures of most of our own articles and blog posts on this website were produced with Midjourney - but for this project, we had to go back to the basics. Admittedly not to the "basics-basics" since we used a user interface, but as basic as downloading raw Stable Diffusion models from HuggingFace and trying some prompts. All this can be done locally with a sufficiently-powerful device (for us, a Macbook Pro M2 Pro 16GB) and we used the Automatic1111 user interface. Without UI, you would need to learn and remember all the possible parameters, and manually write the values for those you want to use.

Also, prompting does not only involve what you want to represent (positive prompt) and also what you don’t want to see (negative prompt, including multipliers for what you really don’t want to see). This is necessary to prevent the model from going absolutely wild. And despite these precautions, results sometimes seem completely random.

Overall, the user experience is terrible, and our first takeaway was completely unrelated to the initial objective of the experiment : an enormous amount of respect is due to Midjourney. Both for their finetuning and for the user experience: despite some frustration due to moderation, their models are relatively reliable and are much easier to use. Midjourney’s outputs are of incredible quality compared to similar prompts in raw Stable Diffusion, and it’s fast. Finally, one advantage of Discord, which Midjourney use as user interface, is the constant flow of prompts from random users with the resulting images. This helps a lot to learn important tricks to obtain the result you want, or just to find inspiration. Acquiring this knowledge for Stable Diffusion requires to spend hours on Reddit. And yet, you'll only ever see the successful prompts, you will never know if it took 25 attempts to get something acceptable.

We’ll spare you all our fails, just take our word for it when we say that the learning curve is steep. As far as we can tell, you never truly master this process, you just get close to what you want more frequently. Once you get lucky often enough, you can consider that you have a working engine for your shape-blending AI tool. Then you can focus on the steering wheel, ControlNet

ControlNet

ControlNet is another type of model, complementary to Stable Diffusion. It drives the diffusion model, forcing it into certain shapes. The shapes can be body poses, edges of an object or even depth maps. We are not going to pretend that we can explain why it works; we merely focussed on how to make it work, which is reasonably well explained here if you want to learn.

Like for Stable Diffusion models, there are many ControlNet models. This ecosystem is super messy and, to be frank, quite dodgy sometimes: if you keep an eye on the terminal whilst running the model, you may realize that turning on a parameter triggers the loading of a model from a third-party GitHub repository during inference… We felt that it was important to share this warning as part of this article, even though it did not deter us from trying to get to the bottom of the mesmerizing spirals and tiles.

One particular ControlNet model is used to achieve this effect : QR Code Monster from Monster Labs. We first came across this model in May (mentioned it here) when Redditors started to share “creative" QR codes. Since then, users have realized that they can do much more than QR codes.

QR code by Monster Labs on HuggingFace

Beyond the aesthetic dimension of geometric shapes, the blending effect seemed extremely promising for branding, marketing and communication. This intuition was quickly confirmed as we looked on Reddit for tips to better control Stable Diffusion and ControlNet.

Check out this selection of artworks by dobbieandrew on Reddit.

dobbieandrew on Reddit

At this point, we had a decent understanding of how to maneuver the tool that could turn ideas into masterpieces. And we had a vision which was validated by people with actual talent. So close yet so far…

What we did

For our first experiments with Stable Diffusion, we used SD XL, the latest model released Stability AI. However, ControlNet is not compatible with SDXL (at the time of writing). So we used ControlNet in combination with Stable Diffusion v1.5 and derivatives of that model. Each time, we used SD models with safetensors.

What we learnt the hard way :

You don't need to learn Photography but if you want to get anywhere near an credible result, you need to learn important technical terms so that you can describe in your prompt a shot type, an angle, depth of field...
Sometimes, the Stable Diffusion model needs help to generate images in line with the ControlNet guidance. When we used prompts to generate images on the theme of sports, the model easily identified that limbs could be used to represent the edges of our shapes. Likewise, in a forest, the model would appropriately use trees. But in less obvious cases, key words like "diffused light", "shadows", "rain", "steam", "reflection in windows"... seemed to help.
Upscaler did not really help. Not sure what we were doing wrong but we struggled a lot until we disabled it: generation took materially longer and output was worse.
Use batch size = 1 and batch count = 1 whilst you are doing experiments with a prompt. Once you get to a satisfactory result and you believe that it is a matter of tweaking parameters, switch to batch count = 4. It will generate 4 images and you may be lucky with one of them.

Your ControlNet image will have either a white background or a black background. You can “invert" this by selecting the appropriate ControlNet model preprocessor, either “none" or “invert (from white bg & black line)". Try to imagine the result you want to obtain to make your decision. For example, if you imagine your shape as a shadow (relatively darker than the rest of the image), you should select the preprocessor that makes the background white and the shape black.
The single most important driver is the control weight, between 0 and 2. If you choose 2, you’ll essentially get your shape in a different color. Finding the right weight requires a lot of trial and error and, in our experience, the appropriate level is always between 0.6 and 1.4. If the shape you want is dark, you generally need a lower weight than if your shape is light.

We typically stopped the guidance before the end (between 0.5 and 0.7). Not entirely sure why.
If you want to modify the image using inpainting, do not do it directly from the output image. Go to the img2img tab and upload the output from there.

In the end, it was not as fun as we expected. The process is extremely frustrating as less than 10% of the generated images are remotely acceptable. It is often so far from the original idea that it is not a question of tweaking, you have to start again from scratch. And if you are working locally on a Macbook, each generation can take minutes. At no point during our experiments we had the feeling of being in control, we were mostly hoping for the best. We may not have used the correct Stable Diffusion model in the first place, we could probably improve our “hit rate" with a LORA, but the overall conclusion is that there is nothing trivial about the creative process underpinning these creations. You are more likely to give up after 100 attempts than to achieve anything exceptional by chance.

It is clear that the amount of patience and passion required to deliver one viral AI-creation is incredibly high. That may not be enough to convince everyone that it qualifies as art, but our experiments changed the way we look at the creations shared on social media: we used to look for imperfections indicating that it was AI-made, we now stand in awe of perfect details.