This February, I put Fiber in maintenance mode to catch up with the AI wave. This is a log of what I read and watched, with notes that might help other engineers doing the same.

I didn't set out with a particular curriculum. I knew I wanted to spent a month and cover the basics across the entire stack, from algorithms to implementation. In retrospect, I spent my time roughly like so:

Week 1: Pre-training
Karpathy's lectures, 3Blue1Brown, GPT papers
Week 2: Post-training
RLHF, PPO vs. DPO, context distillation
Week 3: Scaling
Llama papers, scaling laws, components
Week 4: The kitchen sink
CUDA, matmul, DeepSeek, and everything else I could fit

Background

It might be relevant to mention what I knew about AI before

It might be relevant to mention what I knew about AI before this month started. I had a decent foundation in ML, for a software engineer. My first exposure was back in High School, when I went through the AIMA textbook with a group of friends. Then in college, I spent summer after freshman year going through Andrew Ng's classes on Coursera and CS229 (this course). I later also took intro to AI course.

Late 2018, I learned about Kaggle and tried my hand at prediction modeling. (Ironically, I used to find NLP boring back then.)

Some other resources I remember using which might still be worth it:

Neural Networks and Deep Learning by Michael Nielsen
The Deep Learning textbook by Goodfellow, Bengio and Courville

Before this month, I was mostly "caught up" to the 2019 state-of-the-art in AI. But lot had happened since...

Week 1: Pre-training

I started week one watching Karpathy's "Deep Dive into LLMs like ChatGPT", which he had just released a couple days earlier. It's a newer, longer version of another introduction he did in late 2023, which I watched second.

YouTube

Deep Dive into LLMs like ChatGPT

Andrej Karpathy

YouTube

Intro to Large Language Models

Andrej Karpathy

When I'm diving into a new subject, I like to start with a view from the top. Knowing the landscape ahead of time helps me connect the dots later, when I'm going through the details. [2] Karpathy's deep dives are perfect for this. In a couple hours, he walks you through the entire pipeline of designing, training and refining a conversational LLM.

Another fantastic introduction to LLMs is 3Blue1Brown's series on Deep Learning:

YouTube

Neural networks

3Blue1Brown

These videos give you the best intuition for how and why LLMs work. I had already watched a handful of them over the years but I could use a refresher, so I decided to rewatch the series after the deep dives.

One video stood out, Grant's explanation of the attention mechanism inside transformers. This was the first time that QKV clicked for me, and I kept coming back to these videos for the rest of the month.

YouTube

Attention in transformers, step-by-step | Deep Learning Chapter 6

3Blue1Brown

After 3Blue1Brown and the deep dives, I moved on to Zero To Hero.

Zero to Hero

This is the course by Karpathy that takes you through building and training a model like GPT-2 from scratch.

Each video in the series is 2-3 hours long, but some took me a whole day of work, between lesson and homework. I watched most videos twice: the first time to follow along carefully, and the second time to cement the content and catch anything I might've missed.

The best way to learn is to code along with Karpathy and do the exercises as they come. I used a Google Colab notebook and forced myself to type every bit of code, instead of copying from the GitHub. [3]

In terms of prerequisites, you'll want a basic grasp of linear algebra and neural networks before diving in. Karpathy does not cover linear algebra, not even a bit. And even back backprop, which is a topic of the first video, you're better off having seen it implemented before.

YouTube

The spelled-out intro to neural networks and backpropagation: building micrograd

Andrej Karpathy

YouTube

Building makemore Part 2: MLP

Andrej Karpathy

YouTube

Building makemore Part 3: Activations & Gradients, BatchNorm

Andrej Karpathy

The gist of part 3 is that neural networks don't just learn by themselves, typically. Even with good data and a good architecture, a lot can happen during training that causes them to get stuck in local maxima, or to stop learning entirely. Researchers have developed many solutions to this over the years, including batch norm, ELU, Kaiming etc, which are the subject of this lesson.

YouTube

Building makemore Part 4: Becoming a Backprop Ninja

Andrej Karpathy

Part 4 returns to the topic of backprop and asks us to do the backward pass by hand. Most of the lesson is exercises, which took hours to do.

I found Karpathy's explanation of the backprop code a bit "hand-wavy" and I wanted to derive the math myself. Instead of consulting the internet, I spent hours trying to retrofit what I remembered from matrix calculus to more rigorously explain what Karpathy was doing. But I couldn't make it work.

Much of the difficulty of part 4 stems from "broadcasting", which are rules around how we do operations between tensors of different shapes. I eventually learned, with the help of ChatGPT that broadcasting is not covered by matrix calculus. It's just an implementation detail.

This was a big unlock in my understanding. It means that most of the operations we do with high-dimensional tensors are still element-wise or matrix-wise operations. The extra dimensions are there only to help us train on multiple nodes and examples at the same time.

YouTube

Building makemore Part 5: Building a WaveNet

Andrej Karpathy

YouTube

Let's build GPT: from scratch, in code, spelled out.

Andrej Karpathy

YouTube

Let's build the GPT tokenizer

Andrej Karpathy

Finally, the last video of the series is Let's reproduce GPT-2. It's a lot like the GPT one, but Karpathy uses GPUs and leaves the model training for several hours. I wasn't in the mood to wait that long so I ended up just watching along. The video is 4-hours long and has some implementation details that I haven't seen documented anywhere else on the internet.

YouTube

Let's reproduce GPT-2

Andrej Karpathy

Attention + GPT papers

After Zero to Hero, I was curious to see how far I could get into the GPT papers. First I read Attention, the 2017 paper that introduced the Transformer architecture:

Attention Is All You

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms

Attention Is All You Need

2017 • Vaswani et al.

arXiv1706.03762

Then I read the first GPT paper

Improving Language U

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized

Improving Language Understanding by Generative Pre-Training (GPT-1)

2018 • Radford et al.

I had to use YouTube to supplement my understanding of residual networks. I knew what they were because of Karpathy, but I didn't have an intuition for why they worked. This was the best video I found: Professor BryceResidual Networks and Skip Connections (DL 15)

Then I read GPT-2

Language Models are

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages c

Language Models are Unsupervised Multitask Learners

2019 • Radford et al.

then GPT-3, which is essentially a scaled-up version of GPT-2:

Language Models are

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically

Language Models are Few-Shot Learners

2020 • Brown et al.

arXiv2005.14165

What's special about GPT-3 is that it's the first LLM trained on a general-purpose dataset that is able to perform a wide range of tasks better than fine-tuned specifically for those tasks.

End of week one

And this was it for week one. In retrospect, I covered most of what I wanted to learn about pre-training. The "week" actually lasted 9 days, ending on a Sunday. On the last day I was able to go back to Colab and rewrite the entire GPT-1 from scratch from memory, which felt great. ✌️

Week 2: Post-training

Next, I wanted to understand the post-training phase, which isn't covered in the Zero to Hero series. I felt a bit cocky after making it through the GPT papers, so I next tried to read Fine Tuning Models from Human Preferences. This was the first work released by OpenAI which applied Reinforcement Learning From Human Feedback (RLHF) to improving a language model.

The paper stumped me. I tried hard but didn't have the foundation to understand it.

Being very confused by RLHF

So I backtracked to an earlier work, which applied RLHF to Atari games, and to the DPO paper from 2017, which describes the RL algorithm most used by OpenAI. I spent the next few days bouncing between these three papers, still confused.

Deep Reinforcement L

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agents interactions with the environment

Deep Reinforcement Learning from Human Preferences

2017 • Christiano et al.

arXiv1706.03741

Fine-Tuning Language

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks.

Fine-Tuning Language Models from Human Preferences

2019 • Ziegler et al.

arXiv1909.08593

Proximal Policy Opti

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a surrogate objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update

Proximal Policy Optimization Algorithms (PPO)

2017 • Schulman et al.

arXiv1707.06347

Learning PPO

Proximal Policy Optimization (PPO) is an RL algorithm published in 2017 which became the industry standard for fine-tuning LLMs (at least until DeepSeek's GRPO came along). The best resource I found on PPO was Spinning Up, an old website put together by Josh Achiam at OpenAI, which covers all things RL. Part 3, in particular, was very important for the math.

OpenAI

Spinning Up in Deep RL

OpenAI•2018

spinningup.openai.com/en/latest/index.html

Other than Spinning Up, I did most of my learning on YouTube. It's impressive how much good content you can find there. The best explanation I found of PPO was probably from Umar Jamil. He goes over many of the equations from Spinning Up.

YouTube

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code

Umar Jamil

Wrestling with code

Even after learning PPO, I still couldn't make sense of the 2019 fine-tuning paper. I felt the paper glossed over all the important details, so I turned to the code for help.

The code for the paper was available on OpenAI's GitHub, so I decided to use it to fill in the gaps. I spent a whole day just trying to get it running on Google Colab, with little luck. To start, the data used in the study had vanished from the web sometime since 2019. Worse, the code was written in TensorFlow 1.0, which isn't compatible with the recent versions of Python supported by Colab.

I tried painstakingly updating the code to run with newer versions of TensorFlow, to no avail. Compared to PyTorch, TF 1.0 is hard to write, hard to debug, and nearly impossible to read. The worst part about 1.0 (and perhaps the reason why everybody hates it) is that functions exist only to build computational graphs. That means you can't debug using print statements or step through code with a debugger, because the real computation (eg. computing an activation layer) doesn't happen when the function is executed. So nothing means what it looks like it means. [4]

What finally helped me understand the paper was this article by the Hugging Face team.

Hugging Face Blog

The N Implementation Details of RLHF with PPO

Shengyi Huang, Tianlin Liu, Leandro von Werra•2023

huggingface.co/blog/the_n_implementation_details_of_rlhf_with_ppo

I kicked myself for not Googling more and finding this sooner. It would've saved me a couple days of work.

While I couldn't get far on lm-human-preferences, I was able to learn a lot about PPO reading the TRL implementation of PPOTrainer:

huggingface.co

trl.ppo.PPOTrainer

huggingface.co/docs/trl/main/en/ppo_trainer

I wish I had found this straight away.

Some other resources that were helpful to learn RLHF (out of many dozens I tried):

At one point I was also very confused by why in PPO the policy and the value models tend to be two heads over the same network. I couldn't find an answer anywhere except this particular post on Reddit.

Fine-tuning

Understanding RLHF via PPO unlocked other important papers on fine-tuning LLMs, including instructGPT (a fine-tuned version of GPT-3):

Training language mo

We show that it is possible to significantly improve both the quality and the safety of generated text by training on human feedback. We collect a large, high-quality dataset of human comparisons between summaries and train a model to predict the human-preferred summary. We then use this model as a reward function to fine-tune a summarization model using reinforcement learning.

Training language models to follow instructions with human feedback (instructGPT)

2022 • Ouyang et al.

arXiv2203.02155

Learning to summariz

Learning to summarize from human feedback

2020 • Stiennon et al.

arXiv2009.01325

Then I read Constitutional AI, perhaps my favorite paper I read this entire month:

Constitutional AI: H

We present Constitutional AI, an approach to training AI systems to be helpful, harmless, and honest using a simple process: we train an AI system using AI feedback on whether its outputs are helpful, harmless, and honest. We use a set of rules or principles, and we train the AI system to follow these principles using reinforcement learning from AI feedback (RLAIF).

Constitutional AI: Harmlessness from AI Feedback

2022 • Bai et al.

arXiv2212.08073

End of week two

Finally, I decided to learn about Direct Preference Optimization (DPO). I had read somewhere it DPO was quickly gaining ground as an alternative to PPO that was easier to implement and more stable to train.

So I read some of the paper:

Direct Preference Op

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised training. We present Direct Preference Optimization (DPO), an algorithm that directly optimizes language models to align with human preferences without requiring a separate reward model.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

2023 • Rafailov et al.

arXiv2305.18290

Note I found on the DPO PDF

And again Umar Jamil's channel was very helpful:

YouTube

Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math

Umar Jamil

I ended the week with this CS229 guest lecture by Yann Dubois, which has some good explanations on PPO and DPO.

YouTube

Building Large Language Models (LLMs)

Stanford CS229

There's a moment at 1:27:30 which validated me. Student asks why they didn't start with DPO. The systems section of the lecture also a good foreshadow to what was to come in week 3...

Week 3: Scaling

Halfway mark
Needed to keep pushing myself closer to the state-of-the-art models
Tried reading the GPT-4 paper but saw after a skim that it was heavy on the benchmarks but light on the implementation details
Decided to read open-source papers instead, specifically on Meta's Llama and Mistral
All released in 2023 and quite transparent on the details

LLaMA 1: Open and Ef

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.

LLaMA 1: Open and Efficient Foundation Language Models

2023 • Touvron et al.

arXiv2302.13971

Llama 2: Open Founda

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed.

Llama 2: Open Foundation and Fine-Tuned Chat Models

2023 • Touvron et al.

arXiv2307.09288

Llama 2 invented a technique called Ghost Attention (GAtt), which they used during RLHF to make the model adhere to instructions even in long dialogue
Their explanation of GAtt is short and left me with lots of questions
I found virtually zero explainers online, across YouTube or technical forums
The only resource I found was this: https://cameronrwolfe.substack.com/p/llama-2-from-the-ground-up

...

Then I read the Mistral papers, which introduced me to concepts like Mixture of Experts and context extension, which I will touch on below

Mistral 7B

We introduce Mistral 7B v0.1, a 7-billion-parameter language model engineered for superior performance and efficiency. Mistral 7B outperforms Llama 2 13B across all evaluated benchmarks, and Llama 1 34B in reasoning, mathematics, and code generation. Our model leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.

Mistral 7B

2023 • Albert Jiang

Mixtral of Experts

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep

Mixtral of Experts

2024 • Jiang

Components

GPT 1 through 3 were mostly the same architecture trained on different scales
Back then researchers were concerned with showing that models could become "intelligent"
In 2023 we started to see mass adoption of AI, following the release of chatGPT in November 2022
Labs shifted their focus to training and deploying models at scale
They started to tweak individual components of the transformer trying to squeeze more performance and efficiency

...

For example, Llama 2 replaces the autoregressive Multi-Head Attention (MHA) of GPT with a Grouped-Query Attention (GQA), which reduces the memory requirements making it easier to train larger models. In GQA, each head generates a unique query vector, but they share the key and value vectors with one or more other heads. (I was surprised at first to find that this works well, though I guess it makes sense that models would learn to adapt to this.)

GQA: Training Genera

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

2023 • Ainslie

There is also MQA, a more extreme version of GQA in each values and queries are shared across all heads. Only the queries are unique.
MQA was formulated by Noam Shazeer in 2019.

Fast Transformer Dec

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the lar

Fast Transformer Decoding: One Write-Head is All You Need (MQA)

2019 • Noam Shazeer

Another concern was how to offer large context windows for inference without deteriorating performance (needle in a haystack etc)
One solution to context extension is Rotary Position Embeddings (RoPE), used by Llama 1 and 2.
RoPE helps nodes understand their relative distance to one another, instead of relying on absolute positions.
I gave the RoPE paper a shot but realized the Math was going to take long so I fell back on YouTube instead. I was mostly interested in the general intuition.

RoFormer: Enhanced T

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties.

RoFormer: Enhanced Transformer with Rotary Position Embedding (RoPE)

2021 • J Su

These are the YouTube explained I liked:

YouTube

How Rotary Position Embedding Supercharges Modern LLMs [RoPE]

Jia-Bin Huang

YouTube

Rotary Positional Embeddings: Combining Absolute and Relative

Andrej Karpathy

I did read an earlier paper on relative position representations, though I don't remember why. :)

Self-Attention with

Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending

Self-Attention with Relative Position Representations

2018 • Peter Shaw

Llama and Mistral make several other tweaks to GPT but I didn't go deep into them.
Umar Jamil has a great explainer on Llama, which covers many of these tweaks and that I highly recommend:

YouTube

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Umar Jamil

Scaling laws

Another concern of researchers by 2023 is how big to make the models and how long to train them for.

Scaling Laws for Neu

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model

Scaling Laws for Neural Language Models

2022 • Kaplan et al.

and Chinchilla:

Training Compute-Opt

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally

Training Compute-Optimal Large Language Models (Chinchilla)

2022 • Hoffmann et al.

I also liked the Wikipedia page on Neural Scaling quite thorough:

en.wikipedia.org

Neural Scaling law

en.wikipedia.org/wiki/Neural_scaling_law

Llama 3

Llama 3 was released in July 2024
Clear signs of professionalization. It lists a whopping 235 core contributors, and many hundreds more get partial credit. You can feel the weight of $1T Facebook behind it.

The Llama 3 Herd of

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks

The Llama 3 Herd of Models

DeepSeek r1 and v3

Somewhere along the first couple of weeks I had set myself the goal of being able to read and understand the DeepSeek r1 paper. So I wanted to do that.
DeepSeek made noise with r1 but in a way the innovation is secondary compared to the base model in which r1 is built, called DeepSeek v3.

DeepSeek r1 pioneered GRPO, which is like PPO but simpler to implement.

Wrapping up

Compared to the first couple weeks, this was a big mess. I'm not sure what I read and in what order.

Week 4: The kitchen sink

With the month coming to an end, I started rushing to cover as much ground as I could. My first goal was to understand DeepSeek r1 and why it was a big deal.

Going low

siboehm.com

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

Simon Boehm•2022

siboehm.com/articles/22/CUDA-MMM

YouTube

CUDA Mode Keynote | Andrej Karpathy | Eureka Labs

Accel

blog.ezyang.com

PyTorch internals

blog.ezyang.com/2019/05/pytorch-internals/

With this knowledge, I was able to follow along with the Ultrascale Playbook, which had recently come out:

Hugging Face

Ultrascale Playbook

Nanotron

huggingface.co/spaces/nanotron/ultrascale-playbook

Application

Finally I wanted to learn some application as well. It was becoming clear that my zone of genius was going to be on how to apply these LLMs, as opposed to how to make them smarter or more efficient.

Show Your Work: Scra

Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even in the few-shot regime

Show Your Work: Scratchpads for Intermediate Computation with Language Models

2021 • Maxwell Nie

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

2022 • Jason Wei

Emergent Abilities o

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

Emergent Abilities of Large Language Models

2022 • Jason Wei

LoRA: Low-Rank Adapt

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable

LoRA: Low-Rank Adaptation of Large Language Models

2021 • Edward Hu

YouTube

How DeepSeek Changes the LLM Story

I ended up reading 30 papers in 30 days.

My plan for week 4 was to learn "AI engineering". That didn't happen. Not only because the subject is too broad (and most importantly right within my zone of interest), but because there were things I still wanted to learn before moving on to the top stack.

Epilogue

I had been planning to take time to study AI for a while. I'm very happy with how this month turned out. I only wish I had done it sooner.

Moving back to San Francisco in January increased my sense of urgency to catch up. AI is the only thing we talk about at work and at home. On February 5th, Karpathy dropped his "Deep Dive" video, which quickly went viral on X and HackerNews. I originally decided to take a Friday off to watch it, but it quickly became a month-long effort.

Going forward, I want to dedicate some time to AI engineering. It's a part of the puzzle that I'm still missing. Some content I'm consuming next:

https://news.ycombinator.com/item?id=43323946

Thanks Abhi and James for the feedback.

Notes

I recently rediscovered a barebones C++ implementation of backprop, which I wrote that summer: https://github.com/felipap/nndl-cpp. I think it works but I'm not sure.
Another way to put it: consume the "easy" content first.

It took me many years to learn this lesson. In my teens, my strategy for learning a subject was to find the most complete resource on it — typically a thick textbook — and begin reading it from page one. When I got stuck, I'd just go back and reread the chapter(s). I thought I was being rigorous with my education but this is by far the slowest way to learn.

Today I start with the ELI5, then the "ELI10", and so on. I look for high-level explanations then I go progressively deeper from there. When I get stuck, I look for a different resource.

For example, if I wanted to learn Scala today, I'd start by watching Fireship's Scala in 100 Seconds. Then I'd look for a good 15-30 minute introduction to the language. Then maybe a tutorial that I could follow along. Meanwhile, younger me would've jumped straight into an 11 hour course and would feel bad about abandoning it.
I'd go further and say that I wouldn't bother watching the series without dong the work and writing your own notebooks. The difference in how much you absorb is huge. And I found myself many times going back and consulting previous code and notes I had written.
If you're able to define all computation symbolically, TF can automatically

?. There is a famous list floating around which I considered using. It's a list of papers that Ilya Suskteiver supposedly recommended to John Carmack over a dinner. https://www.lesswrong.com/posts/t4ZBjAjXk2NqqAqJ7/the-27-papers There are a couple different versions of it online. As far as I could tell, these are speculations. We don't actually know what are tehse papers he recommended.

A month of AI

Notes from catching up with the wave

Background

Week 1: Pre-training

Zero to Hero

Attention + GPT papers

End of week one

Week 2: Post-training

Learning PPO

Wrestling with code

Fine-tuning

End of week two

Week 3: Scaling

Components

Scaling laws

Llama 3

DeepSeek r1 and v3

Wrapping up

Week 4: The kitchen sink

Going low

Application

Epilogue

Notes