Because the specific task that these Large Language Models are trained to perform is not to “answer the question”, but to “guess the next word”. It just so happens to be that solving the problem of “guessing the next word” is apparently a decent enough method for solving problems and generating answers to questions.
It can’t give you a paragraph instantly, because the paragraph is not instantly available.
It is not a rendering gimmick. It is not generating the block of text in one go, and then dripping it out to the recipient purely for the aesthetics. The stream is fundamentally how it is working. It’s a iterative process, and you’re seeing each iteration in real time as each word is being predicted. The models work by taking a body of text as a prompt and then predicting what word should come next*. Each time a new word is generated that new word is added to the prompt, and then that whole new prompt is used in the next iteration. This is what allows successive iterations to remain “aware” of what has been generated thus far.
The UI could have been created so that this whole cycle is allowed to complete before printing the final result, but this would mean waiting for the last word not getting the paragraph instantly. It may as well print each new word as and when it is available. When it gets stuck for a few seconds, it genuinely is waiting for that word to be generated.
*with some randomness to produce variety. It picks from the top candidates within an assigned threshold called the temperature.
It is actually doing it letter by letter, because it’s not thinking, just analyzing a big book of text and running the statics on what letter comes next based on the prompt and general chances of ANY letter coming next. There is no intelligence in these algorithms, just messy statistics that aren’t actually the correct answer because we aren’t looking for correctness, just plausibility.
These top responses are not quite correct. Language models do not just generate word by word. They would show obvious signs of semantic error if they did. Models are very much able to take in different layers of context to decide how to generate text.
The reason you see Chat GPT generate responses word by word is because the designers built it that way. My guess is they wanted you to “see” the text generation. It’s an interface decision, not a consequence of how models generate text.
The way the software is written, it comes up with a response one “word” at a time. I put word in quotes because sometimes the next word is not really a word that you see on the screen. For example, the next “word” could just be “This is the end of the message”.
Each word takes a lot of computation. That requires time, energy, computing resources such as CPUs and GPUs running on a server somewhere, and cooling. Compared to other things that computers do, computing the next word in ChatGPT4 takes a large amount of computation. Multiplied by how many people are using the service at the same time.
If it were to send the entire message at once, the reader would just be waiting there. So they send it one word at a time so you can start reading it even while it’s still writing. Another benefit is that you can see it is successfully writing and not just stuck.
Latest Answers