You can weave together a ton of data before the prompt is entered and connect it all up. You can spend a long long time (computationally) beforehand analyzing the corpus of data to make your weights. The prompt still requires a lot of processing itself but it’s going off of predetermined mappings, if that makes any sense.
Think of your prompt tumbling down a plinko board or something taking twists and turns based on the input but the piping and turns were all made beforehand.
The tricky part happens when the AI model is first created, or trained if you prefer. All that info you’re talking about is crunched down until it is basically reduced to a really complex math formula. The formula is large and complicated, but doing math is what computers are best at, so it can be run relatively easily.
In the case of generative text models, the result is basically a very advanced version of predictive text algorithms that run on your phone when you are texting. Those algorithms on your phone look at the words you’ve written and try to guess what you might write next based on what other people write. Generative AI models basically do the same thing, except they’re using a much larger dataset and they are able to generate much more than a single word at a time. So when you ask the model a question, it isn’t even trying to look for the correct answer, it’s finding an approximation of “what do people normally receive as an answer to a question like this?”
The LLM isn’t pouring over that data every time you make a query. What happens is that all the data is analyzed during a training phase and that produces a model. The model is basically words and numbers describing the relationships between words. The model is what you interact and it contains a representation of the data that was analyzed during the training phase. Once you have the model, things are pretty quick, but training that model can take days.
AI computation doesn’t work the way you’re probably thinking. It’s not searching through a database of information each time to find your answers*.
Machine learning is all about finding patterns. To give a simplified example, how do you recognize if something is an apple? Well, you could have a database full of thousands of images of apples, and go through one by one to see if your input matches any of them.
Or you could spend some time beforehand to look at all those pictures, and realize, “Oh, I see. Apples are red (sometimes shades of green and yellow), roundish, and shiny.” And now when you see a new picture of an apple, you can just go, “Is it apple colored? Is it the right shape? Is it the right texture?” which is not only much faster, it’s more generalizable than just trying to match it against a database of pictures.
Modern language AIs rely on the fact that language, and the information it contains, is just patterns. The words “George Washington” are frequently near the words “president” and “United States.” That’s a pattern. And therefore an AI can learn that. It doesn’t need to search through a database to find the list of US presidents. It just learns a bunch of numbers that represent the relationships between words. Plus a bunch more math to further refine those relationships.
Another factor is hardware. Modern AI is designed to take advantage of the parallelization capabilities of hardware like GPUs, and companies are now producing special chips designed just for making AI computations fast.
* AI systems *can* be hooked up to databases or use web search, but those are more like add-ons. The fundamental way modern AI works doesn’t depend on them.
Patented AI engineer here.
AI is primarily a series of multiplications and additions across large tables (matrices). The largest models (like GPT-4) have about 1 trillion parameters, which are the weights learned during the training phase. These trillion parameters compress the model’s knowledge about the world, which was obtained from training on vast datasets consisting of text and images from the internet. This datasets are on the order of petabytes, so it seems impossible to rapidly access them.
Now, imagine the size of 1 trillion numbers. If each number is stored in FP16 precision (2 bytes), the total storage required is about 2 terabytes. This 2 terabytes represents the compressed form of the knowledge the model has learned.
To generate a response, the model needs to perform a large number of multiplications and additions for each word it generates. Modern compute power is incredibly advanced. For example, NVIDIA’s H200 GPUs can produce up to 8 petaflops (quadrillion operations per second) each. In a setup with 20 such GPUs, you get a combined 160 petaflops.
Considering that each word generation might involve several trillion operations, this setup allows the model to generate words quickly. For instance, with 160 petaflops, you can theoretically perform 160 quadrillion operations per second. Dividing this by the trillions of operations needed per word, such a system could generate thousands of words per second, demonstrating the immense computational power available today.
Latest Answers