Without going super deep into the math, you basically take two sets of data…known inputs with known outputs, and you apply a general mathematical filter (“the model”) to the inputs to calculate the model output. Then compare that to the known output. Initially, the model output will be wildly wrong. Machine learning algorithms are clever ways of adjusting the filter based on comparing the model output to the known output to try and adjust the filter so the model output more closely corresponds to the known output. You iterate that lots of times until the model stops getting any better or you’re as accurate as you want to be. Now your model is “trained”…you know that, for the known inputs, it will produce outputs that match the known outputs (“truth”) to some degree. You assume that now, if you feed it new input, it will give you correct output.
An example…I want to train a machine learning model to pass those annoying captcha things that say “Click on all the pictures of traffic lights”. I collect a whole ton of pictures, some with traffic lights and some without, and I go through them myself and label them “yes” or “no” based on whether it has a traffic light or not. The set of pictures is the known input, my yes/no list is the known output. Then I run a machine learning model on that training data and the model “learns” how to recognize traffic lights by continuously adjusting a bunch of internal math values according to the algorithm until it’s pretty good at recognizing traffic lights. Now I feed it a new captcha that it’s never seen before…if my model is good, it will correctly detect traffic lights in the new photos too.
This makes machine learning *really* dependent on having good training data. Getting this wrong can cause your model to have all kinds of weird biases and errors that can be very difficult to detect.
Latest Answers