# Why does Benford’s Law work? This law says numbers in a data set are more likely to start with low digits (1, 2) than high digits (8, 9), but there are exactly as many numbers in existence beginning with each digit.

410 views

Why does Benford’s Law work? This law says numbers in a data set are more likely to start with low digits (1, 2) than high digits (8, 9), but there are exactly as many numbers in existence beginning with each digit.

In: Mathematics

We used this as an auditing tool as many companies have financial policies of dual signatures on checks over a specific amount, say \$10,000. So, we would use benford’s law to see if there was an abnormally high number of checks with 9 as their first digit. If so, it would indicate that employees were circumventing the financial policy to avoid other people approving certain checks.

When you say “there are exactly as many numbers in existence” you’re looking at the wrong thing. Sure, if you weight all numbers equally, then each digit 0-9 is used about the same number of times.

But as you said, Benford’s law is about *data*, not numbers. Each number is not treated the same, but rather each data point is.

And data points are more likely to have low digits than high digits (it’s rare that a certain measurement is scaled exactly to a multiple of 10 to make all digits 1-9 evenly likely).

It’s because the values aren’t just random, the ranges are. A random number 1-99 has an equal chance of starting with any digit. A random number 1-90 has a much lower chance of starting with 9. A random number 1-18, over half the possible values start with 1. Average this out over large datasets & the lower numbers are much more common as an initial digit.

Bedford’s law doesn’t apply in situations where values are not wide open in what they can be. Back in 2020, a lot of people were trying to make hay about vote totals by district in Illinois not following Benford’s law, but the districts are all about the same size, so you wouldn’t expect them to. The same would apply to things like people’s heights, pressure levels in tires, average star ratings on IMDB, dice rolls or anything else where the possible values are fairly constrained.

Say you have values ranging from 0 to _n_. How many of those values start with a 1?

_n_ | % starting with 1
-:|-:
1 | 50%
2 | 33%
… | …
9 | 10%
10 | 18%
… | …
19 | 55%
… | …
99 | 11%
… | …
199 | 55.5%

So, for most values of _n_, numbers from 0 to _n_ are significantly more likely to start with a 1 than they are any other digit.

In general: you have just as many numbers shaped like _xx_ as you do _1xx_, so, as long as your values span multiple orders of magnitude, you’ll hit this effect.

Think about the counting numbers. Basically, counting numbers pile on leading 1’s first. Only when they’ve finished doing that do they start on the 2’s. Leading 3’s don’t get a look in until the 1’s and 2’s have finished. And so on. 9’s don’t catch up until the number is a string of 9’s and about to tick over to the next power of 10. At which point the whole cycle restarts; 1’s take over again, and start stretching their lead back out.

For the vast majority of the time, in other words, the predominant leading digits that the numbers have had to date are the lower ones, with the lowest ones most common. Many numbers in the real world tend to reflect counting to some degree, so they also tend to show the same skewed first digits. Invented ones often don’t, because we’re lousy at making up “realistic” data – which is why it’s often a useful test of the likelihood that a dataset is genuine, as opposed to invented.

But it’s important to say that the “Law” is a statistical observation more than an immutable rule. It most definitely doesn’t apply to all data; it’s also important to look at how the data arose.

[deleted]

I heard it explained that the units we use for any set of data tends to group them that way.

Charts look weird with lots of decimals or zeroes, so the creators of the charts use some unit that shortens the displayed data.

And THAT last conversion ends up putting a lot of numbers in the situation where they start with “1” or “2”.

In short, it’s anthropogenic… caused by human intervention.

When your data spans multiple orders of magnitude (some are 1 digit, some are 2, some are 3, etc.) then the data tends to follow Benford’s Law. It happens because every time something adds an extra digit, there are 10 times more possible numbers. Heights of buildings is a good example – if you had 5 buildings whose heights differed by 50 feet each, and the shortest one was 900 feet, they’d be 900, 950, 1000, 1050, and 1100. You have a lot more possibilities for those values to start with 1 than other digits.