All you need to know is that “Tokens” are basically just “words”.
You can think of it like each word is just its own token.
Internally its not like this, tokens are more nuanced, like ____s could be a token that makes a word plural, and there is a <new paragraph here> token, etc. But unless you are building your own, you can just think of it as “words”
basically when you train a large language model you have to decide what kind of data to feed it, and while individual letters might seem intuitive it turns out for various reasons it’s better to group common letter combinations together, these can range from single letters, to pairs of letters to entire words or maybe spamming between two words
that’s what a token is, a bit of text grouped together in one single object that chatgpt can understand
usually a separate system is used to define the tokens, break words into tokens and viceversa, and the most common combinations usually get assigned large tokens while less frequent ones are split up into multiple tokens
Latest Answers