Eli5: how does a program like chatgp actually “learn” something


Is there a fundamental difference between how a program like chatgp learns and how a human learns?

In: 2

At the basic level, different weightings in its giant neural network are adjusted to get an expected (or roughly similar to expected) output.
Simplified a bit, it’s “guessing algorithm” is retuned so that it gives a specific output for a given input.

Machine-learning systems like GPT “learn” through a number of techniques, none of which is anything like Human learning. It’s more like breeding animals.

Imagine an AI as a grid of numbers.

Start with 100,000 of these AI grids and fill them with random numbers.

Ask these grids to do something useful, like answer a question that you already know the answer to.

“Kill” all the AIs that get the answer wrong.

Take the ones that got the answer right and let them “breed”, creating new AIs with mixes of the numbers that got that answer right, plus some randomness added in.

Repeat that thousands or millions of times, with millions of questions that you already know the answers to.

The “descendants” (keeping with the animal metaphor) that survive can reliably answer questions correctly.

For Chat GPT, the “questions” are what do various types of writing on a variety of topics look like, and the answers are text that looks like a valid sentence, short essay, white paper, etc., but doesn’t have to be factual or correct because correctness isn’t what they’re looking for.

It’s important to note that it’s dicey to talk about AI with the same words we apply to humans. “Learning” really isn’t the same thing for a computer program, but it’s analogous.

There are different kinds of AI, and ChatGPT is one type called a Large Language Model (LLM). Oversimplifying, it takes giant amounts of text, especially conversations, and builds a model of what an actual conversation looks like. Plus, it takes all that text as a database, kind of like Google does when it indexes to give you search results. Google doesn’t really “know” anything, it just has a good database and a good algorithm for returning search results.

It’s also important to note that these LLMs are a bit better at having a conversation that looks right than at providing accurate information. In fact, if the text they’re trained on is consistently wrong about something, they will be too. They also make up text that fits their model of what an answer should look like, and sometimes what they make up is nonsense.

It’s a cool advancement, and they’ll keep getting refined and improved, but there are for sure shortfalls.

Chatgpt has for reference the entire internet as it existed up to 2021. You can go look at google for the string “Is there a fundamental difference between…” Now, in several examples (all of them actually,) look at the text after that. Further, look at how that phrase ‘leads to’ the words chatgpt and human. Chatgpt makes a summary of the text it finds related to the words you’re interested in, and returns that to you.

Chatgpt is Kim Peek meets Chauncey Gardner — it knows everything, and can relate and discuss it, but doesn’t know What it knows, or Why.

Other’s have given a good analogy for Genetic Algorithms, but chatgpt learns using gradient descent, not GA. For gradient descent the usual analogy is.

Imagine you are a blind person stranded on the side of a mountain. You need to find water, and know that rivers usually run through valleys. How do you find your way down the mountain? Well, you can take tiny little steps around you, feeling for the direction of the slope. Once you find the direction that points downhill, you walk in that direction for a little while, and repeat the process in your new location to see if the direction of ‘downhill’ has changed. You can continue this process until you reach the river.

In this analogy, your current location represents the weights of the neural network. The downhill direction is the gradient. The number of steps before repeating is called the learning rate. The topography of the mountainside is your cost function, which measures how good your model predictions are. When you reach the river, you’ve minimized the cost function, and the neural network is generally pretty good at producing the outputs you want it to.

How does the network know what you want it to output, though? The input data during this process has been manually annotated by humans. The cost function calculates how far away the models predictions are from those annotations. When we look for the ‘downhill’ direction, we calculate the direction which we can change the weights to reduce this cost. By reducing the cost, the neural networks predicted values move closer to the annotations.

But what if, on your way down the mountain you get stuck in a hole, and you can’t get out of it? Then you’re stuck! Gradient descent is only gaurenteed to find local minima, it does not gaurentee you find the global minimum of the entire ‘cost landscape’. There are some technique’s to combat this, like periodically increasing the learning rate so you can ocassionally take more steps to try and get out of any ‘holes’ you’ve found yourself in.

What if, on you’re way down, you find yourself on an extremely flat plateau, where there is no ‘downhill’ direction? Also stuck! This is called a ‘vanishing gradient’ that really plagued early ML models. A lot of resources have been poured into making network architectures that are robust against this.