Training effectively takes a lot of time. So that you need to have a cut off of data, and then focus on the training and validation of the model, if you don’t want it to generate absolute gibberish.
But also, a lot of stuff online after 2021 is already computer generated, which tends to compound errors and is generally bad for model training. Other issues can be legal.
But I also think that during the pandemic the uptick of misinformation was so bad, that training a model on that dataset will get you an antivaxx, neonazi, Qanon, wumao, russians troll bot.
Training an ML model takes a lot of time and resources. You supply it with your training data, and the model basically learns everything it’s capable of learning about the training data, at which point the model can be used.
Because this is expensive, they train it once and use it for as long as the model continues to be useful. Training a new model is like building a new version of your product. They’re only going to do it once the old one stops being relevant.
There are ways to incrementally train models with just new data, but these don’t work as well as training from scratch. Because the training just focuses on new data, it has a tendency to forget stuff it’s learned previously. It’s also inefficient.
So the most common approach right now to get updated information into these increasingly older models is to take your query or your prompt, figure out what present day information might be helpful context for answering your prompt (which you can do with another ML model or a prompt on the same model!), collecting that context, and then providing it to the model with your question. The model can then use what it already knows from when it was trained, combine it with this one-time context, and give you a more up to date completion.
Both Google and Bing are using this approach, leveraging what they can already do about finding relevant search results and feeding that into the model as context when you send your query.
But eventually it will be appropriate to train a new model, at which time it will probably incorporate more recent information.
Unfortunately nowadays you have to compensate for the fact that a lot of the material that you’ll find online is material generated by machine learning models. You don’t generally want a model’s output in its own training data. Finding high quality training data is going to be an increasingly hard ML problem.
chatGPT gave this answer:
1 Time of Training: The training process for large language models like mine takes a significant amount of time and computational resources. By the time my training was completed in 2021, there might not have been sufficient time or resources available to train on newer data.
2 Data Collection and Processing: Gathering and processing large datasets is a complex and time-consuming task. The training data used in my development might have been collected and preprocessed up until 2021, and updating the dataset with newer information requires substantial effort.
3 Model Update Cycles: AI models like ChatGPT go through regular update cycles, and each update typically involves a retraining process on new data. The update cycles may not have coincided with data available after 2021.
4 Maintenance and Resources: Maintaining and updating AI models like ChatGPT is an ongoing effort that requires continuous resources and attention. Prioritizing which datasets to include in updates is a complex decision-making process.
edit: for readability
Latest Answers