It seems like they all happily make up a completely incorrect answer and never simply say “I don’t know”. It seems like hallucinated answers come when there’s not a lot of information to train them on a topic. Why can’t the model recognize the low amount of training data and generate with a confidence score to determine if they’re making stuff up?
EDIT: Many people point out rightly that the LLMs themselves can’t “understand” their own response and therefore cannot determine if their answers are made up. But I guess the question includes the fact that chat services like ChatGPT already have support services like the Moderation API that evaluate the content of your query and it’s own responses for content moderation purposes, and intervene when the content violates their terms of use. So couldn’t you have another service that evaluates the LLM response for a confidence score to make this work? Perhaps I should have said “LLM chat services” instead of just LLM, but alas, I did not.
In: Technology
In order to generate a confidence score, it’d have to understand your question, understand its own generated answer, and understand how to calculate probability. (To be more precise, the probability that its answer is going to be factually true.)
That’s not what ChatGPT does. What it does is to figure out which sentence a person is more likely to say in response to your question.
If you ask ChatGPT “How are you?” it replies “I’m doing great, thank you!” This doesn’t mean that ChatGPT is doing great. It’s a mindless machine and can’t be doing great or poorly. All that this answer means is that, according to ChatGPT’s data, a person who’s asked “How are you?” is likely to speak the words “I’m doing great, thank you!”
So if you ask ChatGPT “How many valence electrons does a carbon atom have?” and it replies “A carbon atom has four valence electrons,” then you gotta understand that ChatGPT isn’t saying a carbon atom has four valence electron.
All it’s actually saying is that a person that you ask that question is likely to speak the words “A carbon atom has four valence electrons” in response. It’s not saying that these words are true or false. (Well, technically it’s stating that, but my point is you should interpret it as a statement of what people will say.)
**tl;dr: Whenever ChatGPT answers something you asked, you should imagine that its answer is followed by “…is what people are statistically likely to say if you ask them this.”**
Hallucination isn’t a distinct process. The model is working the same in all situations it’s practically speaking always hallucinating.
We just don’t call the answers hallucinations when we like them. But the LLM didn’t do anything differently to get the wrong answer.
It doesn’t know it’s making the wrong shit up as it’s always just making shit up.
When assessing the reliability of a source there are three important questions to ask:
– Who wrote it and are they an expert in this area? If you can’t identify who wrote it and if they have a relevant qualification then it is unreliable.
– Who checked it, and are they also an expert in this area? Even experts can be wrong, so having someone check the information is important, and they should also be an expert.
– How old is the information? Older information tends to be more unreliable, while newer information tends to be more reliable.
Now it is a sad fact that most information on the internet doesn’t meet even the first of these standards. Posts tend to be anonymous, and even when someone does provide a name and a claimed qualification it is nearly impossible to verify. And most posts are upvoted or downvoted by people who really have no clue what they’re on about. In reality often the correct answer is downvoted into oblivion because it is complicated and comes across as talking down to people, or simply doesn’t contain enough jokes or sarcasm.
The best source for reliable information is academic articles, but the big problem with training AI models on these is that the AI is going to end up copying their style, and it is unreadable to most people. So instead they train AIs on garbage. And as the old computing truism goes, “garbage in, garbage out” – if you train the AI on unreliable information you get unreliable answers.
The sad fact here is that what people **want** is the illusion of being informed. They want a nice simple answer to a complicated question in the minimum time. Most people don’t know the difference between a true answer and a lie…. and don’t care – just look at the TIL (today I learned) subreddit, where almost every piece of “information” I’ve read that touches on my field of expertise is wrong in important ways.
And this is why AIs don’t actually use reliable information. Because people don’t actually want it.
They can and some do, there are two main approaches, one focuses on model explianability and the other focuses on more classical confidence scoring that e.g. standard classifiers have usually via techniques such as reflection.
This is usually done on a system level, however you can also extract token probability distributions from most models but you usually won’t be able to use them directly to produce an overall “confidence score”.
That said you usually shouldn’t expect to see any of that details if you only consume the model via an API. You do not want to provide metrics of this detail since they can employed for certain attacks against models, including extraction and dataset inclusion disclosures.
As far as the “I don’t know part” you can definitely fine tune an LLM to do that, however it’s usefulness in most settings would then drastically decrease.
Hallucinations are actually quite useful, it’s quite likely that our own cognitive process does the same we tend to fill gaps and recall incorrect facts all the time.
Tuning hallucinations out seems to drastically reduce the performance of these models in zero-shot settings which are highly important for real world applications.
Because what they are doing is building sentences via a probability model of what words follow each other most often in a specific context. ChatGPT doesn’t know the meaning of any of the words it strings together and cannot ever know them. ChatGPT even cuts and pastes entire paragraphs from random sources that seem to mathematically match the prompt it was given.
They’re very good at building patterns of words that work together, but any human reading the output can spot the nonsense with even a little research.
Any ‘confidence’ score would rely on understanding meaning, and can therefore only be applied or estimated by humans.
All of the other answers are wrong. It has nothing to do with whether or not the model understands the question (in some philosophical sense). The model clearly can answer questions correctly much more often than chance — and the accuracy gets better as the model scales. This behavior *directly contradicts* the “it’s just constructing sentences with no interest in what’s true” conception of language models. If they truly were just babblers, then scaling the model would lead only to more grammatical babbling. This is not what we see. The larger models are, in fact, systematically more correct, which means that the model is (in some sense) optimizing for truth and correctness.
People are parroting back criticisms they heard from people who are angry about AI for economic/political reasons without any real understanding of the underlying reality of what these models are actually doing (the irony is not lost on me). These are not good answers to your specific question.
So, why does the model behave like this? The model is trained primarily on web documents, learning to predict the next word (technically, the next token). The problem is that during this phase (which is the vast majority of its training) it only sees *other people’s work*. Not its own. So the task it’s learning to do is “look at the document history, figure out what sort of writer I’m supposed to be modelling, and then guess what they’d say next.”
Later training, via SFT and RLHF, attempts to bias the model to believe that it’s predicting an authoritative technical source like Wikipedia or a science communicator. This gives you high-quality factual answers to the best of the model’s ability. The “correct answer” on the prediction task is mostly to provide the actual factual truth as it would be stated in those sources. The problem is that the models weights are finite in size (dozens to hundreds of GBs). There is no way to encode all the facts in the world into that amount of data, much less all the other stuff language models have to implicitly know to perform well. So the process is lossy. Which means that when dealing with niche questions that aren’t heavily represented in the training set, the model has high uncertainty. In that situation, the pre-training objective becomes really important. The model hasn’t seen its own behavior during pre-training. It has no idea what it does and doesn’t know. The question it’s trying to answer is not “what should this model say given its knowledge”, it’s “what would the chat persona I’m pretending to be say”. So it’s going to answer based on its estimates of that persona’s knowledge base, not its own knowledge. So if it thinks its authoritative persona would know, but the underlying model actually doesn’t, it’ll fail by making educated guesses, like a student taking a multiple choice guess. This is the dominant strategy for the task it’s actually trained on. The model doesn’t actually build knowledge about its own knowledge, because the task does not incentivize it to do so.
The post-training stuff attempts to address this using RL, but there’s just not nearly enough feedback signal to build that capability into the model to a high standard given how it’s currently done. The long-term answer likely involves building some kind of adversarial self-play task that you can throw the model into to let it rigorously evaluate its own knowledge before deployment on a scale similar to what it gets from pre-training so it can be very fine-grained in its self-knowledge.
tl;dr: The problem is that the models are not very self aware about what they do and don’t know, because the training doesn’t require them to be.
Latest Answers