AnswerCult

Question

2.09K viewsJuly 5, 2024Mathematics Other

Question 100.55K July 4, 2024 0 Comments

Can someone explain Benford’s Law to me. I get that certain numbers show up more often in large data sets, but why?

In: Mathematics

8 Answers

Answer 1 · 2024-07-04T21:32:59+00:00

Create a list of all of the different types of objects in your house. Then go count them and make a list of the numbers. You probably have a lot of single items in your house, like one popcorn maker, or one can opener and other items you might have a lot of like socks. Your list will probably have a lot of ones on it due to the single items you own. Even the items you do have a lot of it is more likely that you have low single digits of that item, kind of like the sock example, you might have 16 socks but it is unlikely that you have 99 socks. This increases the chances even more do that the left most number has a higher likely hood of being a one.

Answer 2 · 2024-07-04T21:34:05+00:00

Benford’s law is specifically that the leading digit in datasets is more likely to be small.

Let’s say you have a random sample of numbers between 0 and 999. In that case the leading digit would be evenly spread between 1 and 9. Which is what we would expect.

But what if it was a random sample between 0 and 1999? Then over half of the numbers in the sample would have a leading digit of 1.

This is an example of why having a dataset that has multiple orders of magnitude can skew toward smaller leading digits.

Answer 3 · 2024-07-04T21:35:42+00:00

Benford’s law says that the first digit of a number picked from a large range of numbers tends to be a 1. Think of it like this
between 1 and 20, 11 of the 20 numbers start with a 1, more than half. between 1 and 99, 11 of the 99 numbers start with a 1. This repeats for 0-200 and 0-999 etc always holding the max of “just over half” and the min of “about 11%” so if you average that for all ranges, you get that about 30% of numbers in an unknown range start with 1.

it seems like this should apply to all digits equally, but take 9. between 1 and 89, 1 number starts with 9, basically 1/89%. going up to 99 brings us back to 11/99, but now 11 is the max and “almost 0” is the min, so again average it over all ranges, and you get more like 4% of lead numbers

You can then apply this to some fraud cases. If the numbers span multiple orders of magnitude AND should be roughly random, AND there are a lot of them, you should expect them to match Benford’s law pretty well. If they dont, one of the 3 requirements is probably false. if you know the first and last are true, you can say “these probably arent actually random”

Answer 4 · 2024-07-04T21:35:18+00:00

Benford’s law says that if you gather up a bunch of random numbers that appear “in the wild” and you tally up the first digits of each of these numbers then you will tend to find lots of 1’s and not many 9’s (and in fact you expect more ‘s than 2’s, more 2’s than 3’s, etc). This tends to happen when you have a range of numbers which is spread out over multiple orders of magnitude. The reason is that when numbers are spread out over many orders of magnitude, it tends to be roughly uniform when measured on a “log scale”. This just means that the number of data points between x and 2x will be about the same for different values of x. In other words, there should be about as many values between 100 and 200 as there are between 200 and 400 and between 400 and 800. If you just look at first digits, this means that you should expect the same number of 1’s as you have 2’s and 3’s combined or even as many as you have 4’s, 5’s, 6’s, and 7’s combined.

Answer 5 · 2024-07-04T22:33:29+00:00

There is currently only one answer that gives the actual reason instead of a nice-sounding but misleading one. So I want to try top add a second point of view:

Benford’s Law is typically stated about data that span multiple, ideally many, _orders of magnitude_; so we have values that are many times larger than others instead of them being close together (+). For example take the length of rivers, area of countries, size of asteroids/planets/stars and such things. They come in widely different sizes!

Next is that we have at least a bit of what is called _scale-invariance_: unlike many simpler probability problems, we don’t assume that all numbers are equally likely, but instead that larger ones are rarer. More precisely, we want roughly the same chance to land between x and 2x as for landing between y and 2y, regardless what x and y are. So for example from 1000 to 2000 is as likely as 2000 to 4000, and 4000 to 8000 (and also 3000 to 6000).

If those two are (almost) satisfied, then Benford’s Law states that the leading digit 1 is much more likely (~30%) than an other, and the larger digits are the less common ones (note that 0 can never be the _leading_ digit, except maybe the singular number 0).

There can, provably so, be no _perfect_ examples of scale-invariance in probability. Especially, but not only, in reality where things are bound in size by the quantum microcosm below and the size of the known universe above. But between those lie dozens of orders of magnitude, easily more than enough within reason. And physics “in the middle”, e.g. at human scales, is pretty much scale-invariant itself, so whatever nature creates has a high tendency to also be so.

**And for _why_ it happens:**

A number with leading digit 1 is between [1 and 2], or [10 and 20], or [100 and 200], or … . But by scale-invariance, being in those intervals is as likely as being between [2 and 4], [20 and 40], [200 and 400], … ; those are however numbers that start with 2 or 3. So clearly 1 is the much more likely from 1, 2, 3. And yet again the same chance to be within [4 and 8], [40 and 80], [400 and 800], … ; numbers with leading digits 4, 5, 6 or 7.

In short, each of the sets {1}, {2,3} and {4,5,6,7}; it stands to reason, and can indeed be checked by similar arguments, that the remaining {8,9} is at best half as likely as {4,5,6,7}. So the digit 1 has at least a 1 in 3.5 chance, which is already ~29% (the slight error to the actual value is because {8,9} are even _less_ likely than half of {4,5,6,7}).

(+): there was some debate on election tampering in the US based on this. Districts are by design of similar size, and hence the law does not apply to them, yet uninformed people claimed that something is “fishy” because it doesn’t follow the law.

Answer 6 · 2024-07-05T01:03:00+00:00

The simple explanation is that when looking at a randomly occurring number of things the first number is most likely to be small with 1 being the most likely.

Now when counting an amount of things you can see a simple trend regardless of what you are counting. That trend is that changing a small leading number is harder than changing a big leading number. I am going to do some rounding for simplicity sake, but going from a leading digit of 1 to 2 requires a 100% increase in total things, 2 to 3 a 50% , 3 to 4 33%, 4 to 5 25%, 5 to 6 20%, 6 to 7 16%, 7 to 8 14%, 8 to 9 12.5%, and last 9 back to 1 11% increase in total things.

As a real world example imagine you are a youtuber and after years of effort you hit 100,000 subscribers. From now on, all your subscriber number counts will be 1xx,xxx subscribers until you have become so successful that you literally double the amount of subscribers and reach 200,000. This likely means that you would need to repeat the same years long effort to change that leading digit. Meanwhile let’s say after a lot of effort, you reach 900,000 subscribers and your leading digit will be 9xx,xxx until you hit 1 million. However going from 900,000 to 1,000,000 is relatively speaking very easy to do as that is only a 11% increase in subscribers. Then once you reach 1 million and have the number 1 as your leading digit. You will need to once again double your subscribers before 2 is the leading digit again.

Another way to look at it is going from 9 back to 1 is an 11% increase, but going from 9 to 2 is a 222% increase.

Answer 7 · 2024-07-04T22:46:35+00:00

Other folks have answered pretty well but an important thing to bring up is that Benford’s law cannot be applied to election data.

A big thing following the 2020 election was people on the right tried to use the fact that several districts vote tallies did not fall into the Benford curve. This caused a lot of people to claim election fraud.

Benford’s Law was misused here primarily in the following way

These charts often focused in on specific voting precincts. Benford’s law requires the data being analyzed to be very big. Something that the population of these precincts are not.

Answer 8 · 2024-07-05T04:16:34+00:00

The radiolab podcast has discussed Benford’s Law various times. I can’t go back and re-listen, but this might be the earliest and most detailed.

https://radiolab.org/podcast/91697-numbers

Radiolab is created for a general audience. You’ll find it easy to listen to.

AnswerCult

Benford’s Law

8 Answers

Search questions

Popular Questions

Latest Answers