SPONSORED: It won’t come as any surprise that, like so many other organizations, Nimdzi Insights has been experimenting with ChatGPT since its public release in November 2022. While large language models (LLMs) have been around for a number of years, they have mostly gone unnoticed until ChatGPT broke the dam. We at Nimdzi are undoubtedly thankful and excited that language technology suddenly became the talk of the town, and we are closely watching and researching the propagation of LLMs in the translation and localization industry and beyond. As we’ve been monitoring the technology, we couldn’t help but notice a lot of confusion around the core fundamentals (and this is understandable as many are still getting up to speed on the topic and learning the proper lingo). We see GenAI, LLMs, GPT, and more buzzwords and acronyms flying around, intermingled and misused. In this article, we sort out the AI alphabet so you don’t have to. Get ready for a lot of three-letter acronyms (TLA)!
Language models: from n-grams to LLMs
The modeling of languages has a long history. Similarly to machine translation engines, it began with statistical models (n-grams) in the 1950s with moderate success. The advent of neural networks (NN) – the principles of which are based on how the brain works – brought about a revolution in AI and language modeling.
The underlying mechanism is still probabilistic, but NNs allow for machines to learn more effectively than previous models – especially in the new age of deep neural networks (DNN), where multiple layers of artificial neurons (“nodes”) make up the model. DNNs already proved to be useful for various tasks, such as categorization and image processing but were not very effective at handling languages.
“Wait, DNNs are not effective language models?” you might ask with surprise. Well, in order to optimize their effectiveness, they needed a novel architecture. This is where transformers and vast amounts of text data came into play – and upgraded computing power.
In the – by now legendary – 2017 “Attention is all you need” paper from Google, transformers were invented. Models based on this specific DNN architecture signaled a quantum leap in AI, and are applied for various purposes, including language modeling. By introducing attention and self-attention layers, as well as allowing for efficient parallelization of computation, the transformer architecture opened up the exponential scaling of neural language models.
To further mark their importance, in 2021 Stanford researchers named transformers “foundation models,” causing a paradigm shift in the field of AI.
The main features of transformer models:
- Based on the concept of “self-attention,” they learn to predict the strength of dependencies between distant data elements – such as the long-range dependencies and relationships between words (“tokens”) in text.
- They can be pre-trained on massive amounts of (multilingual) text data with enormous potential to scale, and be fine-tuned with custom datasets.
- Based on representing text as numbers in vectors or tensors (“embeddings”), transformer models utilize specialized hardware such as graphical or tensor processing units (GPUs or TPUs) for parallelization and compute efficiency. Recent developments in these computing technologies have greatly contributed to the advances in language modeling.
- Their most well-known applications are the large language models, which achieve human-like performance when tasked with content generation, summarization, labeling tasks, and more.
Generative pre-trained transformer
The first generative pre-trained transformer (or GPT), GPT-1 by OpenAI, appeared in 2018. It generated text, was pre-trained with billions of words, and was based on the transformer architecture. It was the first proof that DNNs based on scalable and efficient transformer architecture could be used for generative tasks.
GPT-1, however, was not the first transformer language model — that was created by Google in 2017, named BERT. While BERT by now powers most of Google search, it is not a generative model as it doesn’t actually create text and is more useful for natural language understanding (NLU) tasks. GPT-1 was created with that specific purpose: the generation of text by predicting the next word.
Large language models
So GPT-1 is a generative transformer DNN LM. So far so good, right? But it is NOT an LLM. Let’s unpack this.
Size turns out to be a critical characteristic for neural networks in two distinct ways: the number of parameters the model is made up of, and the size of the data set used to train the model.
GPT-1 had 117 million parameters and was trained on a few billion words. While these were big numbers in 2018, they are in fact miniscule in comparison to today’s scales. The successor, GPT-2, had ten times the number of parameters (1.5B), ten times the training data, and quadrupled the neural layers. In 2020, GPT-3 upped the game again: a set of parameters 100 times larger (175 billions), ten times the training data, and double the number of layers than GPT-2. When the number of parameters exceeded a few billion (an arbitrary cut-off), these models started to be dubbed “large.”
And so the large language model was formally born with GPT-3.
Other companies have also created LLMs in different shapes and sizes, such as LaMDA and PaLM (Google), the LLaMA family (Meta), Claude (Anthropic), Cohere, BLOOM (led by Hugging Face), Luminous (Aleph Alpha), or Jurassic (AI21) with varying accessibility and usability features. OpenAI’s latest generation LLM (GPT-4) arrived in March 2023, and we can be certain that GPT-5 is already in the works.
So what do they do? All that LLMs are really doing is guessing the next word (rather, “token”) that should come after the existing text. Every single token produced by an LLM requires a complete run through the huge model and is added to the calculation for the next token to be generated.
What are these tokens? Why not just predict words? Great question. The simple answer is that languages are complex, and chunking up long(er) words into so-called tokens yields better results. And it also has additional benefits. For instance, programming languages – strictly speaking – are not made up of words, but LLMs can learn them just as well if not better than human languages.
So, here’s what we have learned so far: GPT-1 is a generative LM, and GPT-4 is a generative LLM. Oh, and BERT is a non-generative LLM, but they are all transformers. How do these relate to generative AI (GenAI)?
All generative LLMs are GenAIs, but generated output can be more than just text. Diffusion models (which are also DNNs utilizing transformer-like features) such as Midjourney, Stable Diffusion, DALL-E, and Bing Image Creator do just that — create images, from text or even image input. There are also attempts for voice, music, video and even game creation using AI, with architectures widely different from transformers.
What about ChatGPT?
LLMs have long stayed under the public radar because they needed an engineer to operate them. With all the rapid technology development, the breakthrough came with a product innovation: providing a public natural language interface to a GPT-3.5 model, fine-tuned for conversational use. Suddenly, practically anyone, and not just developers, could give this language technology a try.
Since then, GPT-4 has been integrated into Bing AI Chat and powers Microsoft 365 Copilot, and other LLMs also received a simple conversational interface: Google Bard, You.com, Perplexity.ai are just a few examples.
And so, with the birth of ChatGPT in November 2022, all the terms explained above became part of public discourse as well as language industry events and publications.