How are LLM's trained?

hoofdstuk

door: Steven Trooster

3 min.

Cleaning data

To train a generative AI and specifically a large language model, a large amount of sources is used,. This can be millions of internet pages, books and images. Between all these sources there can be duplicates, but also unwanted content like manuals to create bombs, explicit sexual content or instructions to commit suicide. The first step is to clean the data sources before the training process is started. This can be done automated, for instance when checking for duplicates, but a lot of data is checked manually. For instance to prevent learning materials on sex education to be left out.

Pre-training

During the pretraining process the computer is being fed with an enormous amount of text (billions of pages), along with an instruction to learn how to create texts that mimic human language. In this phase the AI model develops an internal pattern of how human language is structured. This is also referred to as a foundation model.

This pattern can be compared to a complex table or ‘language map’ with multiple dimensions (more than the three dimensions we can imagine). In this map words, sentences and structures that are similar are near each other while dissimilar words, sentences and structures are further apart. A neural network is trained based on this map, which in turn can effortlessly generate new texts.

After this phase the AI model can already produce human text, but it is not accurate or reliable enough yet to make it widely available.

Fine-tuning

During the finetuning phase the model is further optimized so that it becomes better at preforming specific tasks or giving the desired response. This is done by providing the model’s output with feedback from human assessors.

This is necessary because language has a lot of different functions. Think of holding conversations, summarizing, reporting, or even making jokes. Each of these texts has different requirements. On top of that, models could potentially also produce incorrect, hurtful or even dangerous statements depending on their training.

How the model is steered into the right direction is dependent on its intended use and the choices of the developers (more on this later). Without finetuning, models would be less reliable and unsuitable for widespread use.

Want to know more?

A detailed explanation of how LLM's are trained can be found here:

writings.stephenwolfram.com

What Is ChatGPT Doing … and Why Does It Work?

Stephen Wolfram explores the broader picture of what's going on inside ChatGPT and why it produces meaningful text. Discusses models, training neural nets, embeddings, tokens, transformers, language syntax.

How are LLM's trained?

Cleaning data

Pre-training

Fine-tuning

Want to know more?

What Is ChatGPT Doing … and Why Does It Work?

1. What is (Gen)AI?

2. What is the impact of GenAI?

3. What are the legal frameworks?