How image generation models are trained? Why they understand prompt words and can generate the images from texts.

184 views

How image generation models are trained? Why they understand prompt words and can generate the images from texts.

In: 5

5 Answers

Anonymous 0 Comments

Neural networks are basically pattern recognition engines. [This video](https://www.youtube.com/watch?v=HLRdruqQfRk) which uses a very silly use-case explains it pretty well.

Basically how they are trained is that you have some input and some output that exhibit some kind of pattern. For the example in the video, the input is every lower-case letter in a given font, and the output is every corresponding upper-case letter. (And also vice-versa, for the other model). Then, you need to show it many, many examples of the input and output. Like hundreds of thousands. In the case of the video, this was every single font that actually uses letters that the author could find on the internet.

If the pattern relationship between the input and the output examples were simple and direct, we could build a simple algorithm to understand it. But the relationship between ‘A’ and ‘a’ that also applies to the relationship between ‘B’ and ‘b’ isn’t simple, it’s kind of weird and complicated. So instead, we ask the model to come up with hidden layers – ways of transforming the data about the input that eventually get to the output. What does the model put into these hidden layers? We don’t know. We just tell the model at the end how successful it was, and ask it improve these hidden layers over time to get more and more successful. Eventually, the neural net will have learned a number of hidden layers to transform the input into the output that apply to all cases. In the font example, the end model can be given an image of a letter, and pretty competently guess what the cooresponding lower-case version of that letter would be. You can also use it to do silly things like give it a lowercase letter and have it guess what the lowerer-case of that letter would be, or give it a random non-letter symbol and find out what the “lowercase” of that would be according to the model.

This is how image generation models were trained. They were shown thousands – maybe millions – of matched inputs and outputs, like the word ‘dog’ and a picture of a dog, and asked to come up with hidden layers that can transform one into the other. A second algoithm that can judge how well the model did is then used to give the model feedback so it could improve. Eventually you end up with a model that can generate images based on all sorts of weird prompts. But we don’t actually know how the model works internally – what aspects of ‘dogness’ it uses to make a good picture of a dog.

You are viewing 1 out of 5 answers, click here to view all answers.