Eli5: What exactly is text burstiness and text perplexity as it relates to large language models?


I know that GPT-0 measures these properties somehow when trying to determine if a text was generated by a human or AI, but what is it and how is it measured?

In: 13

Oh boy. This is tricky explaining this like I’m 5. There’s so examples in my own way though! PM me if you need more of an explanation.

Text burstiness is like when you talk a lot all at once, and then don’t talk for a little while. Text perplexity is like trying to figure out what someone is trying to say when they talk like that. A large language model is like a really smart robot that can understand and talk using words, and we use text burstiness and perplexity to measure how well it’s doing its job.

Text burstiness (in regards to GPT-0; it’s slightly different depending on what you’re talking about) has to do with the fact that human beings tend to write a bursty fashion while machines tend to write in a uniform fashion. So a human written text would have short sentences intermixed with long ones while a computer-generated text would have mostly sentences of the same length (this is primarily what GPT-0 is detecting). In terms of measuring it, you’re normally looking at the probability distribution. Nice, neat probability distributions (like normal distributions) mean a computer. Messy probability distributions mean human.

Perplexity deals with the amount of information in the text. Natural language is highly redundant – it’s what enables us to use the same language to communicate in a crowded bar and at an intimate dinner. Even if you lose some of the text, you can still understand what it says.

High perplexity can also involve external factors the AI can’t predict. Consider the sentence: “Alice awoke after Adam”. This is a fairly simple sentence, but the alliteration adds an additional layer of meaning the AI probably can’t see.

AI is also built on probabilistic models, so it struggles when you’re dealing with text where the words/phrases have a high degree of variance in terms of their specificity. Legal documents tend to have high perplexity because they use key words/phrases that have to be *exactly that*. “Involuntary manslaughter” and “unintentional killing” mean the same thing in common language but not legal language.

To some extent, generative AI for text works on the principle of asking “what’s the most likely word after this one?”. In a perfectly perplex text, the AI would always guess incorrectly.