# Why does Benford’s Law work? This law says numbers in a data set are more likely to start with low digits (1, 2) than high digits (8, 9), but there are exactly as many numbers in existence beginning with each digit.

420 views

Why does Benford’s Law work? This law says numbers in a data set are more likely to start with low digits (1, 2) than high digits (8, 9), but there are exactly as many numbers in existence beginning with each digit.

In: Mathematics

That’s a property of sets that grow multiplicatively. Benford’s law doesn’t apply to all kinds of datasets.

Imagine you start any number. And then you repeatedly increase your number by 10%.

So if I start from a number that starts with a small numeral like 100 it will stay with a small numberal for a while as it grows 110, 121, 133.1, …

But if you start with a large numeral like 900 the multiplicative nature will quickly flip to the next digit. Here it grows 990, 1089, …

So this effect happens for stuff like city population numbers where the growth rate is scaled by the population size, but it doesn’t work for other data sets like population age distribution

I heard it explained that the units we use for any set of data tends to group them that way.

Charts look weird with lots of decimals or zeroes, so the creators of the charts use some unit that shortens the displayed data.

And THAT last conversion ends up putting a lot of numbers in the situation where they start with “1” or “2”.

In short, it’s anthropogenic… caused by human intervention.

When your data spans multiple orders of magnitude (some are 1 digit, some are 2, some are 3, etc.) then the data tends to follow Benford’s Law. It happens because every time something adds an extra digit, there are 10 times more possible numbers. Heights of buildings is a good example – if you had 5 buildings whose heights differed by 50 feet each, and the shortest one was 900 feet, they’d be 900, 950, 1000, 1050, and 1100. You have a lot more possibilities for those values to start with 1 than other digits.

[deleted]

Say you have values ranging from 0 to _n_. How many of those values start with a 1?

_n_ | % starting with 1
-:|-:
1 | 50%
2 | 33%
… | …
9 | 10%
10 | 18%
… | …
19 | 55%
… | …
99 | 11%
… | …
199 | 55.5%

So, for most values of _n_, numbers from 0 to _n_ are significantly more likely to start with a 1 than they are any other digit.

In general: you have just as many numbers shaped like _xx_ as you do _1xx_, so, as long as your values span multiple orders of magnitude, you’ll hit this effect.

Think about the counting numbers. Basically, counting numbers pile on leading 1’s first. Only when they’ve finished doing that do they start on the 2’s. Leading 3’s don’t get a look in until the 1’s and 2’s have finished. And so on. 9’s don’t catch up until the number is a string of 9’s and about to tick over to the next power of 10. At which point the whole cycle restarts; 1’s take over again, and start stretching their lead back out.

For the vast majority of the time, in other words, the predominant leading digits that the numbers have had to date are the lower ones, with the lowest ones most common. Many numbers in the real world tend to reflect counting to some degree, so they also tend to show the same skewed first digits. Invented ones often don’t, because we’re lousy at making up “realistic” data – which is why it’s often a useful test of the likelihood that a dataset is genuine, as opposed to invented.

But it’s important to say that the “Law” is a statistical observation more than an immutable rule. It most definitely doesn’t apply to all data; it’s also important to look at how the data arose.

When you say “there are exactly as many numbers in existence” you’re looking at the wrong thing. Sure, if you weight all numbers equally, then each digit 0-9 is used about the same number of times.

But as you said, Benford’s law is about *data*, not numbers. Each number is not treated the same, but rather each data point is.

And data points are more likely to have low digits than high digits (it’s rare that a certain measurement is scaled exactly to a multiple of 10 to make all digits 1-9 evenly likely).