how do latent diffusion text-to-image ai work ?


how do latent diffusion text-to-image ai work ?

In: 5

The “latent space” is the name for the mathematical map created by the learning phase of neural network training.

In extremely basic terms, the neural network has learned that a fingernail is at the end of a finger, followed by a bit of finger, then a joint, then more finger, then another joint, then the hand. It’s next to 1 or 2 fingers, the ratio of the lengths of the segments is X.

That is the “latent space.” The abstracted definitions of how different parts of the pictures relate to each other. That way if it finds part of a hand it knows what *else* is supposed to be there and if it is it knows it’s a hand.


Diffusion models are a method of actually generating the image. You ever see the CSI “Enhance” memes where they take an ultra low resolution security camera still and “clean it up and enhance to the point where they can make positive identification based on reflections that are all of 4 pixels in the original image?

That’s what diffusion models do. They’re trained by taking an image and adding noise and trying to un-do the noise with the AI to get the original image back. Then once it’s reliable at doing that, you add more noise and more noise and more noise until you get to the final end result: The ability tell the computer what it’s supposed to find and then giving it a completely random noise image and it uses that noise to “recreate” what it was told was there.

And you use the text prompt to identify the part of the latent space that it needs to use to tell the diffusion model what it’s supposed to find in the noise.