How did new large language models come out so quickly after chatgpt was unveiled?

463 views

How does ‘cutting edge’ technology suddenly get implemented by other parties soon after new technology is unveiled…for example. chatgpt came out as an amazing new technology that was like nothing we ever saw, and then right after, we hear that other parties are coming up with their own version of chatgpt…If the technology was so secret, then how come everyone else just started coming up with their own version so soon?

In: 506

18 Answers

Anonymous 0 Comments

TL;DR: It was never secret, just expensive, the “hype” pushed academia and business to pour money into this.

So, this turned out to not really be a ELI-5, but more like ELI-CS-Undergrad, but here goes:

First, to get some definitions out of the way

* Model architecture – the complete description of how a model is built (e.g. a picture of how the data flows through it input to output). Having the architecture will allow you to *train* the model.
* Model weights (aka model parameters) – specific numbers that control how the model changes the input data into the output. Having the weights (and the architecture) allows you to *use* the model.
* Training – setting the model parameters in such a way that the model fits the data you have well (for generative LLMs this means predicting the next word in the text well).

Watch [this series](https://www.youtube.com/watch?v=aircAruvnKk) from 3b1b for a much better introduction than I could ever do. With that out of the way, back to LLMs.

The underlying technology behind ChatGPT, GPT in general, and other language models (including [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) and friends) has been public for years / decades. They are based on the transformer architecture ([2017](https://arxiv.org/abs/1706.03762)), which is based on the attention mechanism ([2015](https://arxiv.org/abs/1409.0473)), which itself was designed to alleviate some problems with training [LSTMs](https://en.wikipedia.org/wiki/Long_short-term_memory) ([1997](https://direct.mit.edu/neco/article-abstract/9/8/1735/6109/Long-Short-Term-Memory?redirectedFrom=fulltext)) and other such models (e.g. [GRUs](https://en.wikipedia.org/wiki/Gated_recurrent_unit)).

ChatGPT itself was not really secret technology either, it (was initially) based on GPT-3.5, which is essentially GPT-3 with some inference and tuning tricks built in. [GPT-1](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf), [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [GPT-3](https://arxiv.org/abs/2005.14165) all have their architectures public. We have model weights (essentially, the stuff we need to run the models) public, we do not have it for GPT-3 or GPT-4.

We do not have architecture for GPT-4 public, and while it kinda does make it secret, there have been [leaks](https://twitter.com/swyx/status/1671272883379908608) saying that it’s an [MoE model](https://machinelearningmastery.com/mixture-of-experts/) with GPT-3 like transformers as experts.

Other open source models, like [Falcon](https://arxiv.org/abs/2306.01116) (this mostly describes training, but publishing weights implies publishing architecture) and [LLaMA](https://arxiv.org/abs/2302.13971), also have both their architectures and weights public, and have performance comparable to GPT-3.5.

Basically, the challenge here was not the technology, but amassing enough data and computational power to train these models. With compute becoming cheaper over time (especially after the chip shortage ended), and with people creating and sharing these extremely large datasets ([here’s](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) the dataset for Falcon-180B for example), it became easier for people with money to train these models. The incentives here are also quite clear, it brings fame for academics, and the ability to control your own models for companies, so both are ready to pay / work on them.

Additionally, fine-tuning these models (basically making them tailored to a specific) task is possible on commodity hardware now, so there will be more model variants popping up all over the world.

You are viewing 1 out of 18 answers, click here to view all answers.