How does ‘cutting edge’ technology suddenly get implemented by other parties soon after new technology is unveiled…for example. chatgpt came out as an amazing new technology that was like nothing we ever saw, and then right after, we hear that other parties are coming up with their own version of chatgpt…If the technology was so secret, then how come everyone else just started coming up with their own version so soon?
In: 506
OpenAI’s models are priviate, but the methods for creating a large language model are mostly public. The challange isn’t inventing the method; the challenge is training it and running it.
To do that, you need lots of high-powered computers, skilled programmers, and *tons and tons* of data. Those are all things that companies like Google, Facebook/Meta, Snapchat, Microsoft, etc. all have access to already. Once it became apparent to them that demand for LLMs was high, it was just a matter of shuffling their resources around to make it happen.
Also, a lot of AI — such as Bing Chat — are actually just GPT with pre-prompts or minor tweaks.
History is full of people inventing the same things at the same time. Isaac Newton and Gottfried Leibniz invented (discovered?) calculus independently around the same time. It’s because the ideas, math, techniques and technology behind the invention are ready and sharp minds see the possibilities.
So, don’t hoard that idea you have. Someone else will get to it. Get started now even if you’re not ready.
Those technologies are never « new », because R&D are always working on it. They’re just « ready » at a certain point in time. Also, even if everybody has their own product ready, it may be a good idea to let a competitor take the risk first, see how the public react and then launch your own product taking into account the feedback. If you haven’t noticed, exactly the same thing happened with multi-blades razors. Once the first 2-blades came out, every-body had their own 2-blades ready. Then 3-blades, 4…(are we at 5 now?). You may think « surely, 4 blades is out-of-the-box thinking », but I guarantee you these people have been experimenting with 3-10 blades from the moment 2 blades became popular, and they’re just waiting for the market to be ready.
TL;DR: It was never secret, just expensive, the “hype” pushed academia and business to pour money into this.
So, this turned out to not really be a ELI-5, but more like ELI-CS-Undergrad, but here goes:
First, to get some definitions out of the way
* Model architecture – the complete description of how a model is built (e.g. a picture of how the data flows through it input to output). Having the architecture will allow you to *train* the model.
* Model weights (aka model parameters) – specific numbers that control how the model changes the input data into the output. Having the weights (and the architecture) allows you to *use* the model.
* Training – setting the model parameters in such a way that the model fits the data you have well (for generative LLMs this means predicting the next word in the text well).
Watch [this series](https://www.youtube.com/watch?v=aircAruvnKk) from 3b1b for a much better introduction than I could ever do. With that out of the way, back to LLMs.
The underlying technology behind ChatGPT, GPT in general, and other language models (including [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) and friends) has been public for years / decades. They are based on the transformer architecture ([2017](https://arxiv.org/abs/1706.03762)), which is based on the attention mechanism ([2015](https://arxiv.org/abs/1409.0473)), which itself was designed to alleviate some problems with training [LSTMs](https://en.wikipedia.org/wiki/Long_short-term_memory) ([1997](https://direct.mit.edu/neco/article-abstract/9/8/1735/6109/Long-Short-Term-Memory?redirectedFrom=fulltext)) and other such models (e.g. [GRUs](https://en.wikipedia.org/wiki/Gated_recurrent_unit)).
ChatGPT itself was not really secret technology either, it (was initially) based on GPT-3.5, which is essentially GPT-3 with some inference and tuning tricks built in. [GPT-1](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf), [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and [GPT-3](https://arxiv.org/abs/2005.14165) all have their architectures public. We have model weights (essentially, the stuff we need to run the models) public, we do not have it for GPT-3 or GPT-4.
We do not have architecture for GPT-4 public, and while it kinda does make it secret, there have been [leaks](https://twitter.com/swyx/status/1671272883379908608) saying that it’s an [MoE model](https://machinelearningmastery.com/mixture-of-experts/) with GPT-3 like transformers as experts.
Other open source models, like [Falcon](https://arxiv.org/abs/2306.01116) (this mostly describes training, but publishing weights implies publishing architecture) and [LLaMA](https://arxiv.org/abs/2302.13971), also have both their architectures and weights public, and have performance comparable to GPT-3.5.
Basically, the challenge here was not the technology, but amassing enough data and computational power to train these models. With compute becoming cheaper over time (especially after the chip shortage ended), and with people creating and sharing these extremely large datasets ([here’s](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) the dataset for Falcon-180B for example), it became easier for people with money to train these models. The incentives here are also quite clear, it brings fame for academics, and the ability to control your own models for companies, so both are ready to pay / work on them.
Additionally, fine-tuning these models (basically making them tailored to a specific) task is possible on commodity hardware now, so there will be more model variants popping up all over the world.
Latest Answers