How image generation models are trained? Why they understand prompt words and can generate the images from texts.

174 views

How image generation models are trained? Why they understand prompt words and can generate the images from texts.

In: 5

5 Answers

Anonymous 0 Comments

Neural networks are basically pattern recognition engines. [This video](https://www.youtube.com/watch?v=HLRdruqQfRk) which uses a very silly use-case explains it pretty well.

Basically how they are trained is that you have some input and some output that exhibit some kind of pattern. For the example in the video, the input is every lower-case letter in a given font, and the output is every corresponding upper-case letter. (And also vice-versa, for the other model). Then, you need to show it many, many examples of the input and output. Like hundreds of thousands. In the case of the video, this was every single font that actually uses letters that the author could find on the internet.

If the pattern relationship between the input and the output examples were simple and direct, we could build a simple algorithm to understand it. But the relationship between ‘A’ and ‘a’ that also applies to the relationship between ‘B’ and ‘b’ isn’t simple, it’s kind of weird and complicated. So instead, we ask the model to come up with hidden layers – ways of transforming the data about the input that eventually get to the output. What does the model put into these hidden layers? We don’t know. We just tell the model at the end how successful it was, and ask it improve these hidden layers over time to get more and more successful. Eventually, the neural net will have learned a number of hidden layers to transform the input into the output that apply to all cases. In the font example, the end model can be given an image of a letter, and pretty competently guess what the cooresponding lower-case version of that letter would be. You can also use it to do silly things like give it a lowercase letter and have it guess what the lowerer-case of that letter would be, or give it a random non-letter symbol and find out what the “lowercase” of that would be according to the model.

This is how image generation models were trained. They were shown thousands – maybe millions – of matched inputs and outputs, like the word ‘dog’ and a picture of a dog, and asked to come up with hidden layers that can transform one into the other. A second algoithm that can judge how well the model did is then used to give the model feedback so it could improve. Eventually you end up with a model that can generate images based on all sorts of weird prompts. But we don’t actually know how the model works internally – what aspects of ‘dogness’ it uses to make a good picture of a dog.

Anonymous 0 Comments

They understand prompts because they’re trained not only on images but also on descriptions of these images. If you just gave them images they would be able to generate similar images, but not react to prompts

Anonymous 0 Comments

TLDR: There are different ways to train an image generation model, such as using a transformer model or a generative adversarial network. These models use different techniques to generate new images, such as sampling pixel values or interpolating latent space vectors.

If you train an image generation model on a collection of paintings, it can generate new paintings that have the same style and colors as the original ones.

One way to train an image generation model is to use a **transformer** model, which is a type of neural network that can process sequences of data, such as words or pixels. A transformer model can learn how to generate coherent text or images by predicting the next element in a sequence based on the previous ones.

To train a transformer model on images, you need to **unroll** the images into long sequences of pixels, which are the tiny dots that make up an image. Each pixel has a value that represents its color and brightness. The transformer model can then learn the patterns and features of these pixel sequences, such as shapes, edges, textures, etc.

To generate new images, the transformer model can use a technique called **sampling**, which means randomly choosing some pixel values based on the probabilities learned by the model. The model can then use these pixel values as inputs to generate the rest of the image sequence.

Another way to train an image generation model is to use a **generative adversarial network** (GAN), which is a type of neural network that consists of two parts: a generator and a discriminator. The generator tries to create fake images that look real, while the discriminator tries to tell apart real images from fake ones. The generator and the discriminator compete with each other and improve over time.

To train a GAN on images, you need to provide both real and fake images as inputs to the discriminator. The real images are from your dataset, while the fake images are generated by the generator. The discriminator outputs a score that indicates how likely an image is real or fake. The generator tries to fool the discriminator by making its fake images more realistic, while the discriminator tries to catch the generator by making its scores more accurate.

To generate new images, the generator can use a technique called **latent space interpolation**, which means creating new images by combining features from different existing images. The generator has a hidden layer called the latent space, where each image is represented by a vector of numbers. The generator can then create new vectors by mixing and matching elements from different vectors, and use these vectors as inputs to generate new images.

Anonymous 0 Comments

I’m going to avoid going into detail about Neural Networks (the innards, so to speak) as it has already been [covered below](https://www.reddit.com/r/explainlikeimfive/comments/1428yqy/comment/jn3nfg3/?utm_source=reddit&utm_medium=web2x&context=3).

There are multiple techniques, but all of them require a human to provide context and meaning first. The computer needs to be able to start making relationships between shapes, shadows and detail with the descriptions of the image. That’s a significant amount of data to obtain because your sample size has to be huge to get passable results.

So how do we get those descriptions? Well, we can either farm them ourselves, buy them from a private data seller, hire a bunch of people to do it in house, or just flat out use Google Images. Algorithms get data from us all of the time on how to better improve their optical skills.

You know those “prove your not a robot” tests where you have to select all squares with a bicycle or traffic light? Well, the image you got was probably low-quality and grainy to mitigate automation, but it was selected because the algorithm has a degree of uncertainty about it and it wants human input.

Google knows where it thinks the bike is, but its not 100% confident. You are confirming it by clicking on the correct squares. This feeds back into a large data set of other people’s answers… here, the algorithm can now either confirm or deny the accuracy of algorithm and evolve as needed. Every time you fail this test, the algorithm still learns – you’ve just told it to be more uncertain about that image.

In practice though, this topic can be way more convoluted and complex.

Anonymous 0 Comments

Having trained a LoRA (Low-Rank Adaptation) for Stable Diffusion, there are 2 pieces of data that I had to prepare to be fed into the system.

I have a folder of images and text files with the same name as the image, the text files are simply a bunch of keywords in a list separated by commas that describe what aspects of the images are.

The words are associated with patterns in the images and they’re used to guide the image generation algorithm towards a desired result, if you generate an image with no prompt at all, it will make a complete guess and you have no idea what will appear.