Why hasn’t ChatGPT been trained on data after 2021?

76 views
0

Why hasn’t ChatGPT been trained on data after 2021?

In: 34

Training effectively takes a lot of time. So that you need to have a cut off of data, and then focus on the training and validation of the model, if you don’t want it to generate absolute gibberish.

But also, a lot of stuff online after 2021 is already computer generated, which tends to compound errors and is generally bad for model training. Other issues can be legal.

But I also think that during the pandemic the uptick of misinformation was so bad, that training a model on that dataset will get you an antivaxx, neonazi, Qanon, wumao, russians troll bot.

First taste free so they can hook you, then they want you to pay for the new version with internet access, plugins and gpt4 (even tho Bing Chat gives away GPT4 with internet access and image creation thru Dalle-2 for free just so you will stop using Google lol).

That’s the free version.

Bing Chat is GPT4 and updated/connects to the internet and literally makes me look at Edge in a different way.

Training an ML model takes a lot of time and resources. You supply it with your training data, and the model basically learns everything it’s capable of learning about the training data, at which point the model can be used.

Because this is expensive, they train it once and use it for as long as the model continues to be useful. Training a new model is like building a new version of your product. They’re only going to do it once the old one stops being relevant.

There are ways to incrementally train models with just new data, but these don’t work as well as training from scratch. Because the training just focuses on new data, it has a tendency to forget stuff it’s learned previously. It’s also inefficient.

So the most common approach right now to get updated information into these increasingly older models is to take your query or your prompt, figure out what present day information might be helpful context for answering your prompt (which you can do with another ML model or a prompt on the same model!), collecting that context, and then providing it to the model with your question. The model can then use what it already knows from when it was trained, combine it with this one-time context, and give you a more up to date completion.

Both Google and Bing are using this approach, leveraging what they can already do about finding relevant search results and feeding that into the model as context when you send your query.

But eventually it will be appropriate to train a new model, at which time it will probably incorporate more recent information.

Unfortunately nowadays you have to compensate for the fact that a lot of the material that you’ll find online is material generated by machine learning models. You don’t generally want a model’s output in its own training data. Finding high quality training data is going to be an increasingly hard ML problem.

People replying to this say it’s because training a new model is very expensive and takes a lot of time.

Why? How? What is involved in this? Is it manual labor of some kind? Is it automated and just takes a lot of processing time? I’d like to know more about the process.