Why can’t LLM’s like ChatGPT calculate a confidence score when providing an answer to your question and simply reply “I don’t know” instead of hallucinating an answer?

138 viewsOtherTechnology

It seems like they all happily make up a completely incorrect answer and never simply say “I don’t know”. It seems like hallucinated answers come when there’s not a lot of information to train them on a topic. Why can’t the model recognize the low amount of training data and generate with a confidence score to determine if they’re making stuff up?

EDIT: Many people point out rightly that the LLMs themselves can’t “understand” their own response and therefore cannot determine if their answers are made up. But I guess the question includes the fact that chat services like ChatGPT already have support services like the Moderation API that evaluate the content of your query and it’s own responses for content moderation purposes, and intervene when the content violates their terms of use. So couldn’t you have another service that evaluates the LLM response for a confidence score to make this work? Perhaps I should have said “LLM chat services” instead of just LLM, but alas, I did not.

In: Technology

25 Answers

Anonymous 0 Comments

All of the other answers are wrong. It has nothing to do with whether or not the model understands the question (in some philosophical sense). The model clearly can answer questions correctly much more often than chance — and the accuracy gets better as the model scales. This behavior *directly contradicts* the “it’s just constructing sentences with no interest in what’s true” conception of language models. If they truly were just babblers, then scaling the model would lead only to more grammatical babbling. This is not what we see. The larger models are, in fact, systematically more correct, which means that the model is (in some sense) optimizing for truth and correctness.

People are parroting back criticisms they heard from people who are angry about AI for economic/political reasons without any real understanding of the underlying reality of what these models are actually doing (the irony is not lost on me). These are not good answers to your specific question.

So, why does the model behave like this? The model is trained primarily on web documents, learning to predict the next word (technically, the next token). The problem is that during this phase (which is the vast majority of its training) it only sees *other people’s work*. Not its own. So the task it’s learning to do is “look at the document history, figure out what sort of writer I’m supposed to be modelling, and then guess what they’d say next.”

Later training, via SFT and RLHF, attempts to bias the model to believe that it’s predicting an authoritative technical source like Wikipedia or a science communicator. This gives you high-quality factual answers to the best of the model’s ability. The “correct answer” on the prediction task is mostly to provide the actual factual truth as it would be stated in those sources. The problem is that the models weights are finite in size (dozens to hundreds of GBs). There is no way to encode all the facts in the world into that amount of data, much less all the other stuff language models have to implicitly know to perform well. So the process is lossy. Which means that when dealing with niche questions that aren’t heavily represented in the training set, the model has high uncertainty. In that situation, the pre-training objective becomes really important. The model hasn’t seen its own behavior during pre-training. It has no idea what it does and doesn’t know. The question it’s trying to answer is not “what should this model say given its knowledge”, it’s “what would the chat persona I’m pretending to be say”. So it’s going to answer based on its estimates of that persona’s knowledge base, not its own knowledge. So if it thinks its authoritative persona would know, but the underlying model actually doesn’t, it’ll fail by making educated guesses, like a student taking a multiple choice guess. This is the dominant strategy for the task it’s actually trained on. The model doesn’t actually build knowledge about its own knowledge, because the task does not incentivize it to do so.

The post-training stuff attempts to address this using RL, but there’s just not nearly enough feedback signal to build that capability into the model to a high standard given how it’s currently done. The long-term answer likely involves building some kind of adversarial self-play task that you can throw the model into to let it rigorously evaluate its own knowledge before deployment on a scale similar to what it gets from pre-training so it can be very fine-grained in its self-knowledge.

tl;dr: The problem is that the models are not very self aware about what they do and don’t know, because the training doesn’t require them to be.

You are viewing 1 out of 25 answers, click here to view all answers.