What is Overfitting in machine learning and why is it bad?

137 views

What is Overfitting in machine learning and why is it bad?

In: 2

5 Answers

Anonymous 0 Comments

Overfitting means the system learned not the pattern you want it to learn, but rather just knows it’s training data completely.

If you give it 100 pics with 50 cats and let it learn wich ones are cats without any stop criteria it will overlearn that exactly those 50 pictures are cats, but not by what the pictures have in common. It will learn stuff like “oh yeah the one with the dark blue background is a cat pic”

To prevent that you use some part of your data not for training but for quality control. You feed it only 80 pics to learn from, and use 20 only to check if they are also recognized without ever being shown to it during training.

Anonymous 0 Comments

It’s not just for machine learning, it’s a general problem with any models that try to simplify anything. Overfitting is basically when you make the model so “big” (enough values that it can adjust) that it can perfectly fit *any* training data you feed it. So your model will look *amazing* in terms of performance, but it may totally fail when you finish training and try to do something useful with it because it’s too hyperspecialized to the training data.

As a trivial/over-simplified example, suppose I want a machine learning widget to recognize pictures of traffic lights so I can automate those stupid captchas (yes, I know that’s not how they actually work). I get training data of 10,000 pictures of traffic lights and 10,000 pictures of non-traffic lights and use that to train the model. Except I give the model 10,000 different variables to work with (far too many). The model can “learn” to recognize each of the 10,000 pictures because it can use one variable to match each photo of a traffic light. The results on the training data will be perfect…it recognizes every one of my 10,000 traffic lights and ignores anything that isn’t those. 100% success!!! But now I feed it a new picture of a traffic light…and that doesn’t match any of the 10,000 I trained it on before. The model will say “not a traffic light” because it got too specific…I overfitted the model so much that it can *only* recognize the training data. It was never forced to figure out how to efficiently recognize traffic lights with a much smaller number of variables that would learn “traffic-light-ness” but still be general enough to recognize other traffic lights.

You can do the same trick in Excel with polynomial fits to data points…if you give the polynomial enough free variables it can match basically anything to a pretty high accuracy. That doesn’t mean you’ve discovered some amazing 70th degree polynomial that magically predicts your data, you’ve just (grossly) overfitted the model.

Anonymous 0 Comments

Say you sample a sine function, but your data points have some noise. Now say you attempt to fit a polynomial of degree N to those data points i.e. a + bx + cx^2 +… zx^N, assigning values to the coefficients to minimize your error.

If you let N=1 then you can only make a line, so not a good fit. Let N=2 and you can make a parabola, which is closer. If you continue to increase N you get a more and more complicated curve which gets closer and closer to every data point. Eventually N becomes large enough that your function exactly matches all data points with errors of zero, but the problem is that you now have a crazy looking squiggly line that no longer reassembles the smooth sine function which generated the data. Thats because you gave your function so many degrees of freedom that it was able to exactly fit the noise rather than average the data like it would have if it had fewer parameters to work with.

Anonymous 0 Comments

In very simple terms, overfitting is when a model appears to work very well on one datatset, but it completely breaks down on another one. Usually this means that it performs well on the training and validation sets used during development, but it doesn’t actually work when it’s given data that it’s actually supposed to process in practice. That’s why it’s bad: it ends up being useless.

What overfitting actually is depends on the model, but in general it means that the model has learned to exploit some peculiarity of the training dataset that is not present “in the wild”. For example, if you were training a model to look at pictures of people and tell you wether they have blue eyes or not, and every single blue eyed person in your training datatset had blonde hair, the model could learn to actually recognize blond hair. Then if you gave it a picture of a brown haired, blue eyed person, it would tell you that they don’t have blue eyes.

Anonymous 0 Comments

Imagine you do an experiment. You find somebody from every age from 1 to 100 and you measure their height. Then, you plot these on a graph. Experience tells you that for the first 20 years or so (probably less, but let’s roll with it) you get taller and taller. This happens quickly at first, but slows down as you approach 20. Then, your height stays flat for the next 50 years or so, until 70. Then, beyond that, you begin to lose a bit of height. The “line of best fit” of the data you’ve collected should fit that pattern. It should be a smooth curve that peaks around 20 and plateaus for a while, before gradually dropping at the end.

Say instead you tread your data like a dot-to-dot. You connect 1 to 2 to 3 to 4… with straight lines and sharp corners. This would be overfitting the line. Instead of seeing the whole smooth progression, your line might make you think or heights go up and down constantly. Maybe you happened to pick a tall, early-developing 12 year old and then a short, late-developing 13 year old. The line you’ve drawn makes it look like we peak at 12, then immediately shrink, before slowly growing again. Perhaps the 50s were all shorter women and the 60s taller men, or maybe they even alternated!

Point is, you’re giving too much weight to each individual datapoint rather than to the general trend.

This is similar to overfitting in machine learning. Every part of your training dataset has certain flukes and random features. If you train on too small a set for too much, you end up with a system that is very good at dealing with the training images but not so good at stuff beyond that.