Machine learning uses lots of data to learn from. For example, you might have lots of images of cats and dogs, and you want the AI to learn which is which. So that, after learning, if you give it a new image, it can tell you whether it’s a cat or a dog. (This is useful if you need to categorize millions of images of cats and dogs: once the AI has learned how, it can go through a million images in a matter of minutes or hours, while it would take a human weeks, if not months.)
To train your dog/cat AI, you might show it a million images of cats and dogs that you’ve already labeled. Every time you show one of these training images, the AI gives you a guess. If the guess is wrong, you use that information to tweak some of the settings of the AI, to make it more likely that the AI gets it right the next time. How this works is an whole explanation of its own that I won’t get into, but the one-word summary is: math. (We have ways to formulate the learning objective mathematically, so that we also get a formula for how to update the settings of the AI.)
Anyway, the point is that learning means going through all of these images, and updating the AI’s settings. Typically, we won’t go through all the images in the dataset just once. We revisit every images multiple times. Once we’ve gone through the whole dataset, we start over with the first image (sometimes we shuffle the order of the data before we go back through it). One single pass through the whole training dataset is called an **epoch**. So, in our example where we had a million training images of cats and dogs, the first epoch ends after we’ve shown all those 1 million images once, and then the second epoch starts up.
So that’s an epoch. But what is a batch? For various technical reasons, to do with the hardware that the AI runs on, it often makes sense to divide the dataset up into small chunks. For instance, we might split our 1 million dog/cat images up into chunks of 64 images. Each of those chunks, we call a “**batch**”. A batch is a chunk of data that the AI can process all at once. So at the same time that it’s processing image 341, and calling it “dog”, it’s also processing image 342, and calling it “cat”. The images in the batch are processed in parallel.
Batches also have to do with how often we change the settings of the AI to try and improve its performance. “Old-fashioned” AI/machine learning used to typically go through the entire dataset, accumulating information about mistakes, before doing an update of the AI’s settings. In other words, these updates would happen after every epoch. The advantage of this is that you get a lot of information before you change the settings, so the changes you make are more likely to be good changes that improve the AI. But in modern times, datasets got larger and larger. So large, that people started to think “Hey, can’t we do an update a bit sooner? Each update might not be as good, but if we can do like 100 mediocre updates in the same time that it would take to do 1 good one, we might still improve faster.” And that idea turned out to be right. So these days, for things like deep neural networks with large training datasets, it is common to do an update of your AI’s settings after every single batch (although you can also choose to do more batches before you do an update).
So, going back to our example, after every **batch** of 64 cat/dog images that the AI looked at, we would look at the mistakes the AI made and make an update to try to avoid those mistakes in the past. After 15,625 batches, we’ve completed one **epoch**, as we’ve gone through all 15,625*64 = 1 million images in our dataset. At this point, perhaps we shuffle the order of the data, and go back to showing the first batch of images again, and so forth.
Finally, I’ll just mention that there are often things we do at the end of each epoch, and before the start of the next one. For instance, we might have set aside a portion of the data for “validation”. Validation means you check how well the AI is doing on images that you didn’t use to train the AI on. Sometimes, AI’s get “overfit” to their training data, meaning they essentially memorize the correct answers for those specific images, which does not help them to categorize new images at all (it’s a bit like memorizing all the sums in your math homework – this won’t help you do well on your math test, unless some of those exact same sums happen to be on the test). So, by periodically checking how well the AI is doing on images it wasn’t allowed to learn from, we can see whether the AI is overfitting or not. This validation is often done after every epoch (although you can also choose to do it twice per epoch, or once every two epochs, or whatever you think makes sense).
Latest Answers