What exactly is 'CFG (classifier-free guidance)' that determines how much prompt / spell instructions are followed in the image generation AI 'Stable Diffusion'?
Stable Diffusion ', which generates images based on input prompts, has been attracting attention from people all over the world since its public release, and various tools andapplication methods have been announced. Among the setting items for generating images with Stable Diffusion, there is a value called ' CFG (classifier-free guidance) scale ' that determines 'how much prompts are followed in image generation'.
CLASSIFIER-FREE DIFFUSION GUIDANCE.pdf
[Paper commentary] Understanding OpenAI ``GLIDE'' | A fun introduction to AI and machine learning
The Road to Realistic Full-Body Deepfakes - Metaphysic.ai
When generating images and sounds with AI, rather than simply inputting labeled data and allowing it to learn, a separate classification model (classifier) is prepared at the time of sampling . There is a method called CFG (classifier-free guidance) is an improvement to this method, in which a diffusion model and a classification model are trained simultaneously instead of preparing a separate classification model.
By introducing CFG, it is possible to improve the quality while reducing the sample diversity in the diffusion model. The following images released by Google researchers who announced CFG are 'images generated without using CFG' on the left and 'images generated using CFG' on the right. You can see that the images generated using CFG are very similar in composition and objects, but also in high quality.
img2img ', but the image will be distorted. increase. On the other hand, the smaller the CFG scale, the more likely it is to move away from the prompt or input image, but the better the quality.
CFG is also one of the main parameters in Stable Diffusion, and the larger the CFG scale, the more likely it is that a new image can be generated according to the image input by the prompt or '
Metaphysic , an AI development company, said, ``The LAION dataset on which the model was trained is reliable, and even short and simple img2img instructions give valid results, so in many cases there is no need to increase the CFG scale.'' I'm here. However, if you want to generate something that Stable Diffusion has not pre-trained, have difficulty generating coherently for the AI, or want to combine multiple people or concepts, then you may want to change the CFG scale or denoising strength. It needs to be raised.
One of the things that Stable Diffusion is not good at in image generation using img2img is 'changing the color of clothes'. Below is Metaphysic using Stable Diffusion, inputting a picture of a woman in a red dress with img2img, and based on that, following the prompt 'A woman in a blue evening dress' This is the screen where I tried to generate the image. The CFG scale is set to ``13.5'', which is slightly higher than the general recommended value (about 7 to 11), and despite the inclusion of ``red'' in the prohibited word, the women's evening dress remains red. It is
Salma Hayek ''.
Also, below is the video that Metaphysic tried with Stable Diffusion to ``change the color of the clothes worn by the performer from dark blue to red and make the face
Hayek CFG comparison-YouTube
The left side was generated with standard parameters of 'CFG: 12, noise reduction: 0.46', and the right side was generated with parameters close to the maximum value of 'CFG: 28, noise reduction: 0.94'. The movement of the female performer, which is the input data, is shown in the lower left frame. The texture and pose are clearly different between the left and right side, and you can see that the left side is much more accurate as an image, but only the dress color on the right side is more in line with the prompt. .
If the CFG scale is too large, the quality will suffer and the image will be completely different from moment to moment. Certainly, the instructions to 'change the color of the dress to red' were faithfully followed, but other parts were greatly damaged.
Metaphysic suggests using materials that are a little closer to what you want to generate in the end as a way to improve these problems.