How do they determine statistics like “8 million people in the US have __ disease and another 1 million are undiagnosed”?


I work in healthcare and there have been multiple times where I’ve seen disease prevalance statistics that include “undiagnosed cases”. If they have not been diagnosed then where do they get those numbers from?

In: 29

By reliably measuring rate of infection in smaller populations then extrapolating the results to larger populations, researchers compare the extrapolation against reported cases. The difference between these quantities becomes the assessment of undiagnosed cases, which is always subject to some measure of error.

There’s a whole branch of math related to counting things. What they’ve learned is if you know how many things are in “the total set”, if you count all the things in “a smaller set” you can make estimates about what the rest of “the total set” looks like. You can also use other math and different methodologies to calculate the probability that your guess is correct.

So the scientists know how many US citizens there are, roughly. And they have data on how many doctors have diagnosed a condition. They *also* have data on how many people died and the condition was discovered at autopsy, or how many people go to the doctor for something else and this condition is discovered as part of diagnosis. Those people were undiagnosed.

So they use the fancy math to look at how many “undiagnosed but caught” people are in the smaller set of “people who have been diagnosed”. Then they use the math to extrapolate. “If diagnosed and not-diagnosed numbers are this much in the smaller set, we have this % of confidence there is this many diagnosed and not-diagnosed in the larger set.”

It’s weird and hard for our brains to understand, but it works. It’s how insurance companies stay in business. But it’s also tough math, so every study that comes out has to be scrutinized to make sure their “smaller set” was measured properly and there might not be other explanations for their findings. It is NOT as simple as just grabbing a few hundred people and using the percentages directly.

Sometimes there are other factors, too. For example, “excess deaths” is a big one right now. People who work for insurance companies and other businesses that care about death are noticing very dramatic increases in “unexpected” deaths, especially among people who had COVID and recovered. One quote that goes around is an official pointing out their worst-case scenario was a 10% increase, but they saw a 40% increase. We can’t 100% say that those people died because COVID screwed them up in a way that caused them to have a heart attack much younger than expected. We can only point out this started exactly when COVID started, only seems to be affecting people who caught COVID, and is getting worse over time. We’ll have to give it 10 or 15 more years and investigate all possible options before we do anything drastic.


The statistical methods are really quite easy.

For some things, like diabetes they can look at hospital records.

Lets say they look at records for 1/1000th of the population (that would be medical records for 33k people). They can look at 10 different hospitals (3.3k records each) and see the actual number of people who have diabetes of those 33k people) From here they estimate the number (multiply by 1000) in the whole population.

There are things they do with the data to make it more accurate and account for data oddities, but those are more complex.

To estimate the number or percent undiagnosed they make some assumptions like what age were people actually diagnosed, and are people under a certain age likely to have the condition but not be diagnosed yet.

Let’s pretend there is a disease called Womblyblobitis that affects kindergartners. The schools in the country keep track of how many people have Womblyblobitis so we know that 8 million kids have been reported to have the disease.

BUT, we wonder how many people ACTUALLY have this since some people who have it haven’t been tested so haven’t been reported and counted.

SO – we go to a small number of kindergarten classes and test everybody in the class. We can’t go to every single kindergarten class and test everybody, so we have to do the test on just a few classes.

If we find that in the classes we test, each class had 2 kids with Womblyblobitis, we can then predict that if we were able to test every kid in every class in the country, we only would need to know how many classes of kids there were to predict the number of kids who had the disease without testing everybody.


(yes – epidemiologists, don’t come after me with the details / power / bias / sample size etc – this is ELI5 after all!)

Was waiting for some gunner to show off their knowledge talking about odds ratio, relative risk blah blah lol