Introduction to ControlNet.

This article explains what ControlNet is and how you can use it. It includes an example with simple instructions that you can run yourself right now, no prerequisites. It should take about 10 minutes. Just read it (trust me bro). If you want to make pictures first, you can skip the general info below and look for the pictures.

Not many people on tensor.art use ControlNet. Pictures are generated in a gacha game style - one writes a prompt and hopes for the best. If the result is not satisfactory, it's either another try with the same parameters or prompt update. There are also checkpoints, LoRAs and other parameters of generation, a lot of knobs that affect generation in some peculiar and largely unpredictable way. There is some control over the process indeed, but it heavily relies on luck. There's gacha-esque fun in this process. AI generation is like a box of chocolates.

Sometimes a picture turns out almost perfect. It can be awesome in every way except for having 6 fingers on someone's hand. There is no way to fix it with a prompt. ArsTechnica reported that the latest OpenAI image generator allows prompt iteration on a picture, so such problems may eventually get resolved. For example, you may be able to generate something and then ask for corrections by saying something like "good, but let it be sunset and I want the second girl from the left to be blond, go". Eventually prompts may become the only tool an AI artist really needs to build a scene. For now, prompts are rather limited.

ControlNet doesn't fix it; it's another knob to use. But it allows you to control many aspects of image generation directly, spatially, like "the sword is right here, not somewhere in the picture". You can actually imagine a final picture in your head and work toward it. If you can make a rough sketch, you are halfway there. You can iterate, keep the parts you like and correct those you don't. It is still a gacha game, but your chances of getting an SSR are much higher.

It allows you to shoot for much more complex scenes. There is absolutely nothing wrong with generating hundreds of pretty women portraits if it makes you happy. And I mean it; fun is precious, it is never a waste of time. But if you get bored with it, there are options.

ControlNet uses an image as an additional or, in a few cases, the only prompt - sketch, pose diagram, edge/depth/normal map. "A picture is worth a thousand words". A simple doodle can be more efficient in conveying desired composition than any prompt. Also, models don't follow prompts all that well and perfectly crafted prompts fail most of the time.

ControlNet works by attaching small, specialized neural networks called "adapters" to a pre-trained diffusion model. These adapters are trained to interpret specific types of visual input and influence the generation process accordingly, without retraining the whole model. This allows the base model to remain flexible and powerful, while giving users a way to “steer” the output using visual cues rather than just words.

ControlNet is an open source project based on open research publications. The main contributors to both seem to be from China. Kudos to China. It was initially developed for Stable Diffusion 1.5, then adapted for SDXL and works for derived models and checkpoints. There is no ControlNet for SD 3.0 as far as I know. Tensor.art has ControlNet interface for FLUX.1 but results were consistently dismal when I tried to use it.

Personally, I use mostly Pony derivatives and sometimes Illustrious checkpoints.

Using ControlNet requires persistence, iterations are the whole idea. Basic skills with a graphic editor are necessary to make changes to control files used by ControlNet. If you are experienced in image editing it will help you a lot but you don't have to be a classical artist. I have zero art education beyond lessons in secondary school and I was okay-ish at best. It helps if you find joy in image editing. The ability to use layers is a great bonus.

Personally I use Gimp but there are lots of good editors, including free options. Krita seems to be very good. Paint.NET is simple yet capable.

Below I will use Canny and Depth adapters because these are the two I find the most useful and use frequently. There will be a separate in-depth article about them later. I will also give a brief overview of other adapters available on tensor.art in another article, there is a rather harsh article size limit here.

Remixing a picture using ControlNet.

Let's try using ControlNet. Here is what we will be working with:

Click this link and press "remix". It will set generation parameters. Run it and be amused by the utter failure. Or just skip it, here is what I got:

Not too bad. I like perspective distortion. A couple of anatomical problems, very fixable. There is no bear though. A failure.

We got all the parameters right. The missing ingredients are ControlNet control files. Let's add them.

Download the picture we are trying to remix and remember the location. Click on "Add ControlNet" button in "Model" section, choose "Canny" (3rd option), click on the square area in the lower left corner of the new dialogue window and pick the picture you just downloaded. Here's how it should look:

Repeat the same actions one more time but choose the "Depth" adapter this time (4th option).

Set weights for both at 0.5. If you did everything right it should look like this:

Run it. Here is what I got this time:

The clothing colors are different — which is expected, since the prompt doesn’t specify them. It is a very good picture, on par with the original one.

We successfully remixed the picture without even touching the control files themselves. Let's look at them though.

Click on the garbage bin icon to remove Canny adapter and add it again. Here is what it looks like before you confirm your choice:

Click on the right picture, the one in black and white. You will be presented with control file created by Canny adapter preprocessor. You can save this picture:

Now you can edit it and use edited version instead of the one created by the preprocessor. To do so, you just need to press "Control Image" button in the dialogue above, it will suggest you to upload your control file.

Let's say we don't like the bear. No wonder, I got it from a quick Bing image search, it was a cartoonish sketch. The bear sucks. Let's paint this area black:

And here is what I got using the new version of control file:

That’s a much better-looking bear — more natural and fitting. Every time I run the generation with these parameters I will get a new bear. The bear is drawn there because the prompt asks for it and the control file doesn't leave any other option for its location. Also, the depth adapter still indicates to AI generator presence of large body there:

Once I am happy with the bear, I can fix it in the control file and change other aspects of the generation. I can remove flowers, add a cat, make the woman run toward the bear, make her wear jeans or nothing, make her a demoness, make bear run away from her. The sky is the limit now that you can work on specific aspects of the picture with intent.

As an unexpected bonus, the girl's skirt is see-through now and she seems to be going commando. Not intended and can be inappropriate. Let's fix it. I add to prompt: "elaborate blue dress, orange jacket". Here is what I got:

Nice jacket. The claw is bad and fingers are wonky. Well, you know what to do. Pull the lever, let it roll. :)

Neither the Canny nor the Depth adapter has anything to do with color, just geometry, so your hands are free here. Also, you can switch between checkpoints supporting ControlNet freely now, the scene will generally persist. There are multiple examples of that in my pictures.

That's it. ControlNet is that simple. People really should use it more.

A few clarifications. It might be obvious, but better be safe than vague. When we supply the original image the Canny preprocessor analyzes it and automatically creates a control file, an edge map (the black and white line drawing). Which we can download and reuse/abuse. The weight controls how strongly the ControlNet influences the generation, same as for LoRAs. Higher values stick closer to the control image; lower values give the AI more freedom. At high values (0.7 and above) undesirable effects are very likely.

The method we used above would work for every picture on tensor.art, albeit with different degrees of success. All you need is the prompt and the picture itself, you don't necessarily need to use the same tools and LoRAs as the original author. It works for an arbitrary image too, like anime screenshot, you just need to write a prompt adding details that Canny and Depth adapters miss, like colors, lighting details, etc. That's what I do for almost every single picture I have published.

That's it for now, I plan to publish a few more articles on this topic, it was an introduction.

Questions and comments are welcome.