What is the actual use of the median and mode in statistics compared to the average (mean)?


What is the actual use of the median and mode in statistics compared to the average (mean)?

In: Mathematics

The average is affected by extreme values. Like if in Fentucky everybody makes $1 a year except for one person who makes $1,000,000,000 a year, the average will not be useful for determining how a normal person lives. That’s an extreme example but for things where outlier values are an issue but can’t just be tossed out, median can be more useful.

Mode is usually used for categorical data. Like if Coke is comparing the sales of Coke Zero, Vault (RIP), and Mellow Yellow then they probably just care about how much of each one is sold instead of trying to construct an average.

I use median much more often than mean, just to say. If you don’t know the distribution, and most likely it’s gonna be skewed in one direction, then taking the mean is not too good to estimate the mode. Many processes in nature create log normal distributions, or many distributions are too close to an absolute limit, like 0 or so. In these cases i prefer taking the median, it ignores outliers much more easily if you got enough samples

One example is if you have a wide range with a few outliers that would fuck up your data point. For example for average income or net worth. Because of a few billionaires tossed in the average would be way higher so you would go with one of the other two for a more accurate representation of the data set. Can’t you there is ways to calculate if you should include an outlier or not.

Let’s say you have 99 randomly selected people in a room with Jeff Bezos. The average net worth of everybody in the room will be approximately $2 billion, because Jeff Bezos is worth $200 billion and there are 100 people in there, and the other people have effectively zero money (compared to $200 billion). But saying “the average net worth in the room is $2 billion” is utterly meaningless because it’s really just one person bringing up the average for everybody else. So a more meaningful picture might be the median or the mode… the median is the middle person’s net worth. The mode is the most common net worth. These give you a better picture of what the “average” person looks like in the room, than what the actual average tells you.

They each tell you different things about the data set, so the “actual use” would be dependent on what meaning you want to extract from the data set.

Let’s say you’re being graded on something. You do it four times perfectly, 10/10, but you are sick on day, or just have a run of bad luck, and mess it up, 1/10.

You’re “mean” rating would be 8.2. But does that accurately reflect your abilities? Does a single fluke really mean you are 20% less capable than all those other times?

Both the median and the mode would report a value of 10.

The thing to consider and remember is that they are summarizing some aspect of the data, necessarily focusing on some property at the expense of others. With the mean, you sacrifice any information about the distribution of data and are influence by extreme outliers (such as that 1/10 fluke above).

With the mode, you are explicitly being told *something* about the distribution of the data (namely the most common element in the set), but you don’t really know how it compares to the remaining elements of the set.

And, finally, with the median you are learning what the “middle” of the data set looks like while losing information about any extreme outliers.

Median is a really great measurement, and frankly, should be used for a lot of things that people traditionally use average for.

Median takes the “middle” value of a series. So: 1, 5, 10, 11, 87 The median is the middle number, 10.

Why is this interesting? Median is good because it operates somewhat similar to taking an average, but it generally gets rid of outliers, as above had 87 as one of the values. The average of the above is 22.8. But is that even interesting when all of your values except 1 outlier is way lower than that? Median helps deal with series that have outliers. It also tells you a middle point, for example, what if I add another number? Well, just for theory sake, you can say there’s a 50% chance it’ll be below the median of 10.

Median is good for say how long it takes people to pay bills.

If I said the average time it took to pay a bill is 22.8 days, but look above, thats a meaningless value, nothing took 22.8 days, in fact 4/5 of them took 11 or less. But if you say the median amount is 10 days, you can say, there’s probably about a 50% chance it gets paid in 10 days or less. That 87 out there is probably some special case, so we don’t want to use a method that adjusts for that

Mode is just the most common number in a series. 1, 2, 2, 3, 4, 5, 6 ,7

The mode is 2. It appears twice, all other numbers appear once. This is just telling me what valued occurred the most times. There’s lots of uses for this, especially in series with a lot of numbers

The median can be more useful than the mean in situations where there are large extremes at either end of the data set. It’s often helpful in money – let’s say that we’re looking at how much money people have saved up, and we pick 10 people at random for a simplified example. 3 of them have no money in savings, 3 have around $20k saved up, two have $50k, one has $100k, and one has $1million saved. The mean of that set is $126k in savings, but that doesn’t really give us any useful information, because 9/10 of our group actually have less saved up than that, most of them *far* less.

The median of the set, on the other hand, is $20k. Which doesn’t tell us anything about how far the extreme ends are, but it does mean that if we pick an average person, then 50% will fall at or below that amount, and 50% will fall at or above. So it actually tells us a slightly clearer story than the mean – we don’t know how many people have no savings or how many are millionaires, but we do know where the dividing line is.

The mode can also be handy in sets like this – there are two modes in this set, $0, and $20k. The problem with modes in small sets is that they tend to swing frequently, but a mode like $0 definitely tells you a lot of the story in a data set, because it shows you that lots of those empty accounts exist, and that they’re dragging down your other measurements.

Median is a good simple way to remove outliers from a data set. For example I have a system that measures a distance. As with all real world systems it has a certain level of noise in the output, averaging the output is a good way to reduce this noise. Most of the time the distance is about correct, you could take the mean and get a very accurate result. But every now and then something goes wrong, it picks up a reflection or some other issue and the result will be completely wrong. E.g. it would measure 10 meters rather than 60.

If I took the mean and this was to happen then my result would be significantly wrong. But such large errors are rare, if instead I use the median these rare outliers are ignored and don’t impact the results.

As others have indicated median is often used when looking at wealth related things like income, it avoids a small number of very wealthy people skewing the results. Any time when you have a small number of significantly different outliers median will often give a better indication of the typical value than mean.

Mode is less commonly used. I can’t think of a time I’ve seen it used in the real world.

I heard this ELI5-ish analogy a while ago and it was worth remembering:

There are 10 people in a room. 9 of the people have no apples at all. The 10th person has ten apples. The *average* person has 1 apple…which is true mathematically, but utterly false realistically. *Averages* can be dragged far to one side or the other.

The *median* person has no apples, which is true both mathematically and realistically. The *median* is much less vulnerable to distortion from a single data point far outside all the rest.

The better question to ask is….why do you even use mean in the first place? If your answer is “well….I need an aggregate quantity and I pick the mean because I was told to do so….”, that’s where the confusion came about.

Mean has specific usage, it’s not something that you just throw out there because you want a way to measure central tendency. In fact, it might seems like no-brainer today, but mean used to be quite controversial, until Gauss showed that it is actually useful. The more intuitive, default, quantity to measure central tendency used to be the *median*.

The purposes of these numbers are simple: they are the minimizers of “surprise”. Let’s say you were to need to use a single number to represent the value of every elements in the sample sets, but then you are given a single element, and if the element deviate from the number, you get punished by an amount dependent on how they differed. Then what you want is a number that minimize the expected amount of punishment.

Which number you use depends on how much punishment scale up with the deviation. The most intuitive measure is for the punishment to equal the deviation: the minimizer is the median, which is what people used for centuries. At an extreme end, any deviations at all is bad, the minimizer of this is the mode. And if the square of the deviation is the punishment, the minimizer is the mean.

The reason why the mean see a lot of use in theoretical statistics and probability theory is because: (a) by framing everything into a game theory context, lots of questions can be converted into about the mean of *something* (not necessarily the original data but something computed from it), so you don’t really lose generality by just studying the mean; (b) it has mathematically nice properties, such as linearity and central limit theorem.

But that doesn’t mean you should blindly use the mean in every situation. Think about what is your punishment. How bad it would be for your number to deviate from each samples? And this depends on practical context. If you are reporting numbers generally with no specific purposes, it might be useful to just report a whole bunch of different central tendency so that people have a choice of different numbers to make use of.