eli5- How is chatGPT able to take tests like the bar exam, GRE and SAT?

444 views

I’ve seen a bunch of articles talking about how chatGPT has passed various standardized tests but I’ve never seen anyone talk about the process. Is there a time limit for chatGPT to take the test? Is there someone there entering the test questions manually into ChatGPT as questions come up?

In: 58

17 Answers

Anonymous 0 Comments

[GPT-4 paper](https://arxiv.org/pdf/2303.08774.pdf) talks a bit about this in appendix A. As far as I understood, they’ve done this:

1. For every question new chat.
2. In said chat start with several examples of correctly answered questions.
3. After such examples write the “next” question and let the model continue writing this “list of exam problems with answers”.

Of course there are many possible ways to implement this, some more effective than others. You can try it yourself!

One common mistake is to ask it to first write an answer and than argumentation. Remember, it is writing from left to write and have no internal “short-term memory”, so if you do this, it will be forced to first guess the answer and then write rationalization of why it is correct.

Anonymous 0 Comments

It makes a ton of sense why they do well too. It taking the bar exam is like if you just let students flip though Black’s Law Dictionary throughout the exam.

Anonymous 0 Comments

This is an ELI5 that needs multiple parts.

Part 1 – How does ChatGPT know what words mean?
(Alternative title: King – Man + Woman = Queen)

ChatGPT is a computer, and a computer only understands numbers. So if we want it to also understand words, the computer has to convert the words to numbers. Which is weird when you think about it. How do you even do that?

Let’s think about what a number is. It’s something that tells you the difference between big and small. 3 is small. 7 is bigger. If you have a scale of some kind that shows some idea, you could place the words that mildly describe that idea on one end, and words that describe the idea more intensely on the other.

For example, let’s make a scale for happiness. “Pleasant” is a bit happy, so let’s give it a 3/10. “Ecstatic” is very happy! Let’s give it a 9/10. You can survey people and ask them to rate some words on how happy they are to get these scores, and now you’ll have a place to start.

But what about the word “ice-cream”? It doesn’t mean happiness exactly, everyone who eats ice cream is happy, right? So how do they get the computer to understand that, without doing another survey?

What they do is, they show the computer lots of books and lots of websites. It looks for the word “ice-cream” in those texts, and then looks for words nearby to see how happy they are. So if there’s a sentence in a book “Eating ice-cream made him feel ecstatic!”, “ice-cream” is only four words away from ecstatic (which we know is a happy word) so I’ve-cream is probably a happy word too.

But notice that “feel” is even closer to “ecstatic”, being right next door in the sentence. Feel isn’t a particularly happy word, not like ice cream, so how does the computer know it’s not? But out of all the times “ice-cream” is in all the books, it’s near happy words a lot more times than feel is.

So the computer knows that ice-cream is a happy word, and can use it as a marker like ecstatic was.

Then they make other scales. They make a scale for sad words. They make a scale for words about particular topics, like cars. They make scales to classify words into nouns vs adjectives vs verbs. They make scales for how old things are.

And then they smoosh some scales together. If words on one scale get the same score as they do on another scale, then you don’t need both scales.

Eventually, you end up with a great big spreadsheet table. Each row is about a single word. Each column is one of the scales. And every cell in the spreadsheet tells you the score for that row’s word on that column’s scale. The ice-cream row has “9” in the happiness column, “noun” in the parts of speech column, and so on. This is called an embedding. A few years ago, the state of the art embedding was called gLoVe, and it had every word in the English language, and 200 columns to score each word.

Amazing things came out of that.

You could do maths on it! If you took the scores for the word “King”, subtracted the scores for the word “man”, and added the scores for the word “woman”, you’d get the score for the word “Queen”.

You could also translate with it! If you took the scores for the word “ice-cream”, and then looked at the same scores in the French language version, you’d get the French word for “ice-cream”. After all, a word is just a way of expressing an idea, and if you have enough scales, you can capture the essence of that idea.

And that explains how words can be converted into numbers. The next step is to explain how to convert those numbers into sentences. But that’s for another ELI5.

Anonymous 0 Comments

Speaking for the medical licensing exam I could get an outrageous score too just by googling answers. I suspect someone with no medical knowledge could pass on that basis alone.

Its mostly a memorization test of various medical terms, diseases, tests, diagnoses, etc.

Anonymous 0 Comments

We know OP would not pass these exams

Anonymous 0 Comments

This post is a great example of the total ignorance about what a LLM actually is and what it does.

Anonymous 0 Comments

How ? Aren’t the items proprietary? Maybe practice tests?