Character Consistency in Stable Diffusion


UPDATED: 07/01
– Changed templates so it’s easier to scale to 512 or 768
– Changed ImageSplitter script to make it more user friendly and added a GitHub link to it
– Added section on facial expressions
– Added information on scaling in img2img and settings I’ve had success with
– Removed the hires fix approach, it was more for quick/dirty example, but led to some confusion.
– Made a series of clarifications

One of the big questions that comes up often in regards to Stable Diffusion is how to create character consistency if we want to create more than a single image. Facial characteristics are the most important, followed by body shape, clothing style, setting, etc… in the AI image/video world it’s the sought after holy grail as of now (mid-’23).

One route to achieve this goal is through the use of a LoRA (Low-Rank Adaptation), a method of training that inserts weights into an existing AI model’s layers to bias it towards a defined outcome. But, most LoRAs are trained on real-life people (famous actors images, personal photographs) and styles, not on AI generated/created persona output. Because of this the question comes about, what if I want to create a character based on the output of the model itself so that I have a 100% unique fabricated persona to develop a character around, using my own character description, how do I achieve that?

I see this breaking down into 3 stages:



The concept here is to iterate LoRAs a few times to achieve the model we want. The first iteration will focus on refining facial features, and the 2nd body features, which then leads to a final ‘person’ LoRA of our design/concept humanoid.

In this post we’ll focus on facial features, the most ‘recognizable’ element of a person – To achieve this we need to create a character sheet, a set of 15 images that are close enough in likeness to each other that they represent a single character. A character sheet is the primary foundation for consistent character output. Getting the facial features is #1, then refining body features.

Achieving this took some experimentation, and fundamentally to get there requires an iterative approach using a LoRA model. How to build, refine, and include the rest of the character will be in part II of the series, I will do it using kohya_ss to train the LoRA for the outcome we need.

Now on to the character sheet development, which should be fairly straight forward.

SETUP:

In this initial phase we’ll focus on facial features. First, there’s two assets that I’ve created that will be needed so that Stable Diffusion can output a good quality character sheet. (1) For controlnet OpenPose [providing 15 views of a face] (2) For Controlnet Lineart [To guide SD to keep renderings in a specifix box/space]. To download these, just right click and ‘save as…’

Note: Templates updated 7/1 and are now 1328×800


Once you’ve downloaded these we want to bring them into Stable Diffusion text2image using the following Controlnet settings.


The lineart is for providing better guidance to the AI of how we want the sheet segmented. I found using yellow and not 100% black provided better segmentation than pure black/white. For this line mask we’ll use the below settings:


The first one will configure OpenPose to have the proper facial directions we need for the character sheet. The second image using Lineart will provide ‘guideline’ borders for Stable Diffusion to draw within. Remember we’re dealing with AI so outputs are going to vary, but with the above we’re going to tell it what we want, giving it the best guardrails, to get the closest to the outcome we desire.

The size of the OpenPose and mask are setup specifically to have the greatest rendering accuracy from Stable Diffusion. A not well known fact is that Stable Diffusion output dimensions HAVE to be divisible by 8. The way these sheets are setup is to have 8 pixel separators and 256×256 images. This was the best compromise I could find for quality and size given VRAM demands.

TXT2IMG SETTINGS:

Prompt
(a character sheet of a woman from different angles with a grey background:1.4) , auburn hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography

Negative Prompt
easynegative, canvasframe, canvas frame, eyes shut, wink, blurry, hands, closed eyes, (easynegative), ((((ugly)))), (((duplicate))), ((morbid)), ((mutilated)), out of frame, extra fingers, mutated hands, ((poorly drawn hands)), ((poorly drawn face)), ((bad art)), blurry, (((mutation))), (((deformed))), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), ((floating limbs)), ((disconnected limbs)), ((malformed hands)), ((missing fingers)), worst quality, ((disappearing arms)), ((disappearing legs)), (((extra arms))), (((extra legs))), (fused fingers), (too many fingers), (((long neck))), canvas frame, ((worst quality)), ((low quality)), lowres, sig, signature, watermark, username, bad, immature, cartoon, anime, 3d, painting, b&w

Panel settings:


Scaling with img2img

When it’s done we’ll click on ‘send to img2img’. For upscaling in img2img you can use any upscaler you’re comfortable with, for myself I use ultimate SD upscale, which can be installed via the extensions tab (ultimate-upscale-for-automatic1111). Here are my settings. For the noise level, I recommend something in the range of .4 – .6, you’ll need to experiment, but it’ll help remove deformations especially on the side angles. Though a higher noise level also means your character will also change some.

Here’s the primary panel settings I use for img2img:

For Controlnet it’s pretty simple and straight forward:


Lastly, Ultimate SD Options

With this setup, we can click ‘Generate’ and wait… 🙂

The result:


A few other output examples using slightly different prompts and models:

Model: disneyPixarCartoon_v10
(a character sheet of a woman from different angles with a grey background:1.4), auburn hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography

Model: henmixReal_v40
(a character sheet of a beautiful woman from different angles with a grey background:1.4) , blonde hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography

Model: darkSushiMixMix_225D
(a character sheet of a beautiful woman from different angles with a grey background:1.4) , black hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, 8k, studio lighting

Model: reliberate_v10
(a character sheet of a woman from different angles with a grey background:1.4) , black hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography


Expressions

Since we want to build on this work and make sure our character has flexibility in expression I’d recommend creating a few alternate panels defining the expression in the prompt. For example, smiling or sad or angry, there’s actually quite a few. For example, using the prompt:

(a character sheet of a woman smiling from different angles with a grey background:1.4) , auburn hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography

Leave all the other settings the same.


SPLITTING THE IMAGES (UPDATED SCRIPT 7/1):

Once you have the character sheets and the output you want, the next part is splitting up each master image. To achieve this I have a simple Python script (v1.1) – You can grab it via this GitHub link.

The script is very simple at this point, requires Pillow (PIL) (pip install Pillow) for the image handling but will split the master image up. You should expect some frame edges to show on the images, since SD is AI and output is unpredictable it won’t be pixel perfect, but you should get ‘close enough’ outputs during splitting for the next phase for creating a LoRA from what we have.

Conclusion

That get’s us through creating the base line for a character sheet. From here we’ll select a subset of the split images. Up next, creating the LoRA so we can instantiate a consistent character in our Stable Diffusion outputs. Since I’m getting a lot of questions daily on this process, I most likely will make other addendums, and there’s a slight chance that this is a 3 part series so I can cover another area prior to training.

You may also like...

9 Responses

  1. spring says:

    What’s the purpose of generating on a grid instead of processing each pose individually within a batch?

    • Dave Packer says:

      From a Stable Diffusion perspective, creating a character sheet minimizes RNG differences of facial features. Getting the character sheet down, splitting the images, and then upscaling in batch will provide more consistent results for final output. Doing each separately will leave more up to RNG and create more spurious outputs providing inconsistent data to a LoRA for fine tuning.

  2. 柳树 says:

    thanks a lot, Dave pakcer,

  3. Arno says:

    Thanks a log, great write up

  4. Leyline says:

    Please keep up the good work. Consistency for generated characters is so hard to maintain.

  5. Nicolaas Grobler says:

    This guide is so well done!

  6. Richard Wall says:

    Not sure if this will help anyone else but with the image dimensions being 1328*800 I found SD needed 16Gb of vRAM.
    Adjusting the prompt generation size to 1280*768 while leaving everything else the same got the requirement down to 12Gb and without any major loss to the quality of the output on my 3060.

    Thanks for the guide, great method for keeping high quality and consistent character generation.

    • Dave Packer says:

      Thanks for the comment Richard. I’ve been doing a ton of testing since I wrote this post to try to reduce the footprint without compromising the ultimate output, which is why I’m delaying the LoRA part a bit because I want it to be achievable for most of the community and I know larger images are VRAM taxing. My plan is to switch to a 3×3 grid, which would be 800×800 (768×768 images, with 32 bits for borders) I can also clip the borders in half. I’ll be testing that today.