Benford’s Law

481 viewsMathematicsOther

Can someone explain Benford’s Law to me. I get that certain numbers show up more often in large data sets, but why?

In: Mathematics

8 Answers

Anonymous 0 Comments

There is currently only one answer that gives the actual reason instead of a nice-sounding but misleading one. So I want to try top add a second point of view:

Benford’s Law is typically stated about data that span multiple, ideally many, _orders of magnitude_; so we have values that are many times larger than others instead of them being close together (+). For example take the length of rivers, area of countries, size of asteroids/planets/stars and such things. They come in widely different sizes!

Next is that we have at least a bit of what is called _scale-invariance_: unlike many simpler probability problems, we don’t assume that all numbers are equally likely, but instead that larger ones are rarer. More precisely, we want roughly the same chance to land between x and 2x as for landing between y and 2y, regardless what x and y are. So for example from 1000 to 2000 is as likely as 2000 to 4000, and 4000 to 8000 (and also 3000 to 6000).

If those two are (almost) satisfied, then Benford’s Law states that the leading digit 1 is much more likely (~30%) than an other, and the larger digits are the less common ones (note that 0 can never be the _leading_ digit, except maybe the singular number 0).

There can, provably so, be no _perfect_ examples of scale-invariance in probability. Especially, but not only, in reality where things are bound in size by the quantum microcosm below and the size of the known universe above. But between those lie dozens of orders of magnitude, easily more than enough within reason. And physics “in the middle”, e.g. at human scales, is pretty much scale-invariant itself, so whatever nature creates has a high tendency to also be so.

**And for _why_ it happens:**

A number with leading digit 1 is between [1 and 2], or [10 and 20], or [100 and 200], or … . But by scale-invariance, being in those intervals is as likely as being between [2 and 4], [20 and 40], [200 and 400], … ; those are however numbers that start with 2 or 3. So clearly 1 is the much more likely from 1, 2, 3. And yet again the same chance to be within [4 and 8], [40 and 80], [400 and 800], … ; numbers with leading digits 4, 5, 6 or 7.

In short, each of the sets {1}, {2,3} and {4,5,6,7}; it stands to reason, and can indeed be checked by similar arguments, that the remaining {8,9} is at best half as likely as {4,5,6,7}. So the digit 1 has at least a 1 in 3.5 chance, which is already ~29% (the slight error to the actual value is because {8,9} are even _less_ likely than half of {4,5,6,7}).

(+): there was some debate on election tampering in the US based on this. Districts are by design of similar size, and hence the law does not apply to them, yet uninformed people claimed that something is “fishy” because it doesn’t follow the law.

You are viewing 1 out of 8 answers, click here to view all answers.