The P in GPT - a down-to-earth explainer of gradient descent
You've graciously put up with my endless ramblings about parameters and mixers.
And now I get what you're thinking. "Enough is enough! No more analogies! I want to actually know how it works. If the secret to GPT's power lies with training, then what exactly is training?? Reveal the secret sauce!"
Give me 20 minutes of your time, and I will do just that!
Prepare for hardcore terminology that sounds so SciFi that it belongs in Star Trek:
- Cross-entropy loss
- Successive applications of the chain rule
- Stochastic gradient descent
And yet, we'll cover it in a completely down-to-earth way that doesn't bring back nightmares from high school calculus.
Alas - prepare for some disappointing news at the end. There will still be more to the puzzle, even after this. But we will be most of the way there!
You may even start to taste the ketchup and mayo in the secret sauce.
___
If you'd like to learn how to build with Large Language Models, including fine-tuning your own and coding Agent solutions to solve commercial problems, please take a look at my intensive 8 week course on LLM Engineering:
https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/?referralCode=35EB41EBB11DD247CF54
Connect with me on LinkedIn:
https://www.linkedin.com/in/eddonner/
Follow me on X:
https://x.com/edwarddonner