I could explain this technology with a concise explanation such as ”Latent Diffusion Models (LDM) use machine learning to iteratively remove deliberately added noise”. Or I could try to further summarise the University of Nottingham’s YouTube video (which is in itself a summary of research papers). But let’s try to add value by providing info in an FAQ format.
As this deals with a family of related algorithms, we will only focus one of the central party tricks of this relatively new technology called text-to-image. This turns a short text into images that hopefully depict that text in an aesthetic and arguably creative image.
- What software falls under this category?
Stable Diffusion, Midjourney, Dall-E2 (with links to Microsoft and Elon Musk), Google’s Imagen and their spin-offs.
- What are the key machine learning technologies called?
See the video for a real explanation. This is just a list of the 3 main ingredients that all variants seem to agree on. They have together become “spectacularly” successful:
- The Latent Diffusion Model algorithm or architecture handles the removing of random noise in an image using neural networks training specifically to perform this task. The noise seems to be mainly artificially added to introduce an element of surprise and to avoid giving the same result each time the same question is asked.
- Embedding of textual input into the learning process, using for example the GPT·3 algorithm. This encourages the models to produce images matching the input text rather than just any image: after all what do you expect if you try to read data out of noise without any hint what you would like to see.
- Classifier-free guidance more or less forces the models to produce images that match the input prompt. This is done by determining (subtracting) and increasing (multiplying) the effect of using the GPT-3 and prompt.
- Does all this use Google Search to find images?
No. If that were the case the result would change over time: Google Search keeps finding new information on the internet on a daily basis. For people who develop and test algorithms, stable input data is actually a benefit.
And the technology would “know” roughly everything that Google has run into. This is not the case: Google may contain captioned images that LDM’s smaller data set doesn’t know about. The data sets for LDM are smaller largely for practical reasons (compute cost, network load, compute time, copyrights).
- Does LDM contain actual images (e.g. of a cat or Picasso painting) that get included in the output?
Hard to say. Stable Diffusion has been trained with about 600 million esthetically pleasing image/text pairs. So these will include cats and paintings by Picasso (provable because both keywords give recognisable results). But the algorithm doesn’t (for engineering reasons) run on images themselves (“in pixel space”), but in a heavily encoded space (“lower dimensional” or ”latent space”). Explaining how the processing in such algorithms works is nonintuitive (see link) because it works very different from how people reason.
So you could argue that the model knows how to detect (or generate) cats, but certainly does not generate cat images in a straightforward way using the cat images submitted during training. To generate ”a blue cat”, the model thus has a learned knowledge of blueness and catness and can easily combine both. In contrast, a human would typically first tackle catness and then blueness: make a suitable cat image and then turn it blue using some algorithm that largely changes the color without impacting recognisability.
- If this software generates a picture of a comb, is that based on one single input image of a comb, an average of all available comb images, or something way more complex?
Assuming the model was trained with multiple comb images, it would be weird and risky to just pick one: it would waste information that could be obtained from the other examples of combs. But how do you generate a typical comb assuming you don’t have a dedicated parameterized model of alternative combs that one could order? For the software, comb is just a word, and there is not even certainty what part of the image corresponds to ”comb”. So you might picture the software of having (implicitly) discovered that on images labelled with ”comb”, there is a high likelihood of seeing a series of thick parallel lines somewhere in the picture.
- Does LDM contain codified knowledge like what human hands look like in 3D or a model of how reflective surfaces work?
No. The facts that a human hand has 5 fingers, that left and right hands are similar but different, and which hand goes where are not explicitly fed into the system. The system learns such things by being fed proper examples, and by being rewarded/punished based on what it comes up with.
- How does LDM get perspectives (more or less) right?
I suspect that there is no specific model about geometry or vanishing points. It just leans to get perspective right based on correct examples, and getting rewarded during the learning process if it gets it more or less right.
- How does LDM get lighting (more or less) consistent across an image?
Same answer as question on perspective, I think.
- How much computation is done beforehand (“training”) and how much per task?
Training the model on 600 million images, each with captions, takes a lot of computation. In one case it took 256 high-end graphics cards computing in parallel almost 1 month to process all these images. But the output model is highly reusable. In contrast, converting a single text to an image using this big model takes say 1 minute per image – but it only gave one possible answer to one single question.
- Does this require a graphics processor?
In practice, a GPU is needed for speed. In theory you can prove that any computer (with enough memory) will get the job done… eventually. Typical minimal requirements are an Nvidia RTX class graphics card with 8-16 GB of (video, VRAM) memory. Or an Apple M1 class computer with 8-16 GB of (unified) memory. Obviously all this depends on the size of the model that has been pre-trained, on the image size, number of denoising steps, and your patience.
- Does this require special machine learning cores (like Apple has)?
So far, I haven’t seen anyone claiming to use the ML cores to accelerate these algorithms. This could mean they are fundamentally not suited for these specific algorithms, it could mean that nobody has bothered to convert the algorithms for this Apple-specific technology, or could mean that the benefits are not worth the effort.