A method to generate ``specific image-like ○○'' from just one image with image generation AI ``Stable Diffusion'' in just tens of seconds will be announced

By instructing image generation AI such as Stable Diffusion to compress specific images and style into specific words and instructing AI, it is possible to 'optimize' the image you want to generate to resemble any image. A team led by Linon Gal, a computer scientist at Tel Aviv University, has announced a method to achieve image optimization with just one image and 5 to 15 steps of adjustment.

[2302.12228] Designing an Encoder for Fast Personalization of Text-to-Image Models


Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

One of the techniques that enables image optimization in Stable Diffusion is Textual Inversion. This textual inversion is a technology called 'Embeddings', and it is possible to generate an image that closely resembles a specific image simply by preparing data learned from the image separately from the model data of Stable Diffusion. Become. Textual Inversion only updates the 'weighting' in keyword vectorization, so the advantage is that the memory required for learning is relatively small.

Explain the merits and demerits of ``Textual Inversion'' that fine-tunes the image generation AI ``Stable Diffusion'' with several images with examples - GIGAZINE

In addition, 'Dream Booth' is an image optimization technology developed for Google's image generation AI ' Imagen '. Unlike textual inversion, Dream Booth performs additional training on the model itself to update the parameters. A method has been developed to apply this Dream Booth to Stable Diffusion, and anyone can easily run Dream Booth by using the following tools, for example.

``Dream Booth Gui'' review that allows you to easily use ``Dream Booth'' that allows you to additionally learn patterns and styles from just a few illustrations to the image generation AI ``Stable Diffusion'' - GIGAZINE

However, Mr. Gall et al. pointed out that ``the conventional image optimization approach has problems such as long learning time and high storage requirements,'' and proposed an ``encoder-based domain tuning approach'' to solve these problems. doing.

In Stable Diffusion, the input text is output as a 768-dimensional token embedding vector by the 'text encoder', the token embedding vector is converted into noise image information in the latent space by the 'U-NET encoder', and furthermore, the 'decoder ” generates an image by outputting the noise image information to a pixel image. The specific mechanism is summarized in the following article.

Detailed illustration of how the image generation AI 'Stable Diffusion' generates images from text - GIGAZINE

Gall et al.'s approach consists of two steps: adding a single input image and a combination of words representing that image to the text encoder, and then updating the U-NET encoder to change the weighting of the vectors. increase.

Below is a summary of the results of actually reading the researcher's face photo into Stable Diffusion and generating a similar image. From the leftmost, the image columns are 'loaded image', 'Stable Disffusion', 'Textual Inversion (1 loaded image)', 'Textual Inversion (5 loaded images)', 'Dream Booth (1 loaded image)', ' Dream Booth (5 loaded images), 'Textual Inversion + Dream Booth', 'Gal's approach'. It can be seen that the image generated by Mr. Gal et al.'s approach can reproduce the face of the person who read it fairly faithfully.

In addition, it was confirmed that the approach of Mr. Gall et al. works not only on the face of the person, but also instructs to change the subject while touching the image, and conversely, it is possible to change the style of painting while keeping the subject as it is. I'm here.

However, according to Gal et al., this encoder-based approach significantly increases the required amount of VRAM. Also, since both the text encoder and U-NET encoder need to be tuned at the same time, it seems to require a lot of memory.

At the time of writing the article, Mr. Gal and his colleagues have not released the code to do this approach, but they say that it will be released soon on GitHub.

Related Posts:

in Software, Posted by log1i_yk