Build AI from Zero • EduArtha

🧠 Build Your Own AI — From Zero to ChatBot

A hands-on journey from counting patterns to fine-tuning your own chatbot. Build every piece yourself!

⏱ 5–6 hours | 7 Chapters | 35+ Exercises | 25+ Code Files | Indian Context

🚀 How to Run the Project

Read the lesson.md in each level first, then run the scripts in order

⚙️ Step 0 — Install (one time)

cd build-ai-from-zero
pip install -r requirements.txt

🟢 Level 1 — Prediction (Pure Python)

cd level_1_prediction
python step1_bigram.py        # Count character patterns
python step2_generate.py      # Generate text from patterns

🟡 Level 2 — Neural Network (NumPy)

cd level_2_neural_network
python step1_neuron.py        # Build a single neuron
python step2_network.py       # Build a multi-layer network
python step3_train.py         # Train with backprop from scratch
python step4_visualize.py     # Plot loss curves

🟠 Level 3 — Transformer (PyTorch)

cd level_3_transformer
python step1_embedding.py          # Token + positional embeddings
python step2_attention.py          # Self-attention mechanism
python step3_transformer_block.py  # Full transformer block
python step4_put_it_together.py    # Complete Mini-Transformer

🔴 Level 4 — Train Your Mini-GPT ⭐

cd level_4_mini_gpt
python train.py               # Train on Indian stories (~5-10 min)
python generate.py            # Chat with YOUR model!

🟣 Level 5 — Fine-Tune a Real Model

cd level_5_real_finetune
python prepare_data.py        # Prepare education Q&A data
python finetune.py            # Fine-tune DistilGPT-2 with LoRA
python chat.py                # Chat with your fine-tuned bot!

📖 Tip: Each level has a lesson.md — read it first to understand the concepts before running the code. Every script prints colorful output explaining what's happening! 🎨

Part I

Foundations

Understanding AI and why it matters

Chapter 1

What is Artificial Intelligence?

"The science of today is the technology of tomorrow." — Edward Teller

Learning Objectives

Define Artificial Intelligence in simple, intuitive terms
Trace the key milestones in AI history — from Turing to ChatGPT
Distinguish between Narrow AI, General AI, and Super AI
Explain, at a high level, how language models like ChatGPT work
Understand the roadmap of what you'll build in this book
Feel confident that you can learn AI — no PhD required!

1.1 Welcome to the AI Revolution 🇮🇳

Open your phone right now. Chances are, you've already used AI today — maybe without even realising it.

UPI & fraud detection: Every time you send money through PhonePe, Google Pay, or Paytm, AI is silently scanning the transaction in milliseconds — checking if it looks suspicious, if your location makes sense, if the amount is unusual. Billions of UPI transactions happen every month in India, and AI keeps them safe.

Aadhaar biometrics: India's Aadhaar system is one of the largest biometric databases in the world — over 1.3 billion people. When you press your thumb at a ration shop or bank, AI-powered pattern recognition verifies your identity in seconds.

Google Translate for Hindi: Try typing a sentence in Hindi and translating it to English. A few years ago, the translations were laughably bad. Today? They're remarkably good. That improvement is AI — specifically, neural networks learning the patterns of language.

IRCTC and recommendations: Ever noticed how shopping apps seem to know what you want? Or how YouTube suggests that next video you can't resist? That's AI predicting your behaviour based on patterns.

Voice assistants: "OK Google, aaj mausam kaisa hai?" — when you speak to your phone in Hindi, and it understands you, that's AI converting sound waves into text, understanding the meaning, and generating a response.

AI isn't some distant, futuristic technology. It's here, it's now, and it's deeply woven into India's digital fabric.

Important

You don't need to be a consumer of AI. This book will make you a creator of AI. By the end, you'll understand exactly how these systems work — and you'll build one yourself.

1.2 What is Artificial Intelligence?

Let's start with the simplest possible definition:

Artificial Intelligence is the science of making computers do things that normally require human intelligence.

That's it. No jargon. No mystification.

But what does "human intelligence" mean? Think about what you do every day:

You recognise faces (your friend in a crowd)
You understand language ("Pass the chai" means something different from "Chai pass ho gaya")
You predict outcomes (dark clouds = rain coming)
You learn from experience (you touched a hot tawa once — never again!)
You make decisions (should I take the metro or an auto?)

Natural intelligence — the kind you have — was shaped by millions of years of evolution. Your brain has roughly 86 billion neurons connected in impossibly complex ways.

Artificial intelligence tries to replicate some of these abilities in a computer. Not by copying the brain exactly, but by using mathematics and data to achieve similar results.

Tip

The School Analogy 🏫

Data = the textbooks and examples the student studies
Algorithm = the method of studying (rote learning, understanding concepts, practice problems)
Model = the knowledge the student builds in their brain
Prediction = when the student answers a question they've never seen before

Think of AI like teaching a new student:

A well-trained student (model) who studied good textbooks (data) using effective methods (algorithms) can answer new questions (predictions) accurately. That's exactly how AI works!

The key difference? A human student might need 10 examples to understand a concept. A computer might need 10 million. But once it learns, it can apply that knowledge millions of times per second, without getting tired, without forgetting, without asking for chai breaks. ☕

1.3 A Brief History of AI

AI didn't appear overnight. It's been a journey of over 70 years — full of breakthroughs, disappointments (called "AI winters"), and spectacular comebacks.

[Diagram: see interactive version]

Let's walk through each milestone:

🏛️ 1950 — The Turing Test

Alan Turing, a British mathematician, asked a simple but profound question: "Can machines think?" He proposed a test: if a human talks to a machine and can't tell it's not human, the machine is "intelligent." This question launched the entire field of AI.

🧠 1957 — The Perceptron

Frank Rosenblatt built the first artificial neural network — a simple device that could learn to recognise patterns. It was inspired by how biological neurons work. The media went wild: "A machine that thinks!" But the excitement faded when researchers discovered its limitations. (We'll build our own version in Chapter 3!)

♟️ 1997 — Deep Blue Beats Kasparov

IBM's chess computer defeated the world champion Garry Kasparov. This was huge — chess was considered the ultimate test of intelligence. But Deep Blue worked by brute force (calculating millions of moves), not by "understanding" chess the way a human does.

🖼️ 2012 — AlexNet and the Deep Learning Revolution

A neural network called AlexNet crushed the competition in an image recognition contest, reducing error rates dramatically. The secret? Deep learning — neural networks with many layers, trained on massive datasets using powerful GPUs. This was the moment everything changed.

🔮 2017 — The Transformer

A team at Google published a paper called "Attention Is All You Need." It introduced the Transformer architecture — a new way for AI to process language. This single paper is the foundation of GPT, BERT, Gemini, and almost every modern language model. (We'll build a mini Transformer in this book!)

💬 2022 — ChatGPT Changes Everything

OpenAI released ChatGPT, and within 5 days, it had 1 million users. For the first time, ordinary people — teachers, students, shopkeepers, everyone — could talk to AI and get useful, coherent responses. AI was no longer just for researchers.

🌟 2024 — The AI Explosion

Google's Gemini, Anthropic's Claude, Meta's LLaMA, and dozens of open-source models made AI accessible to everyone. Indian startups began building AI for Indian languages. Students started learning to build their own models. That's exactly what you're about to do.

1.4 Types of AI

Not all AI is created equal. Scientists categorise AI into three types:

Type	Description	Examples	Status
Narrow AI (ANI)	AI that does ONE thing well	Google Translate, Siri, chess engines, spam filters	✅ Exists today
General AI (AGI)	AI that can do ANY intellectual task a human can	A machine that can cook, write poetry, do surgery, AND play cricket	❌ Doesn't exist yet
Super AI (ASI)	AI that surpasses human intelligence in every way	Science fiction (for now)	❌ Theoretical only

Note

🤔 Think About It

ChatGPT can write essays, code, poetry, and answer questions. Does that make it "General AI"? Not quite! It can't drive a car, cook dinner, or even see the physical world. It's an incredibly capable Narrow AI — extraordinary at language tasks, but only language tasks. We're still far from true AGI.

Where are we today? We live in the age of very powerful Narrow AI. These systems can beat humans at specific tasks (chess, Go, image recognition, language) but can't generalise across domains the way a 5-year-old child can.

1.5 How Do Language Models Work? (The 30,000 Feet View) ✈️

Here's the single most important idea in this book. Ready?

Language models work by predicting the next word (or token).

That's it. That's the secret. Let me show you what I mean.

Complete this sentence:

"मेरा नाम ___ है"

Your brain instantly filled in a name — maybe your own name, maybe "Rahul" or "Priya." How did you do it? You've seen thousands of sentences with this pattern. Your brain predicts what comes next based on what it has seen before.

Now try this one in English:

"The capital of India is ___"

You thought "New Delhi" — not because you reasoned through geography, but because you've seen this pattern so many times that the prediction is automatic.

ChatGPT works exactly the same way, just at an incredible scale:

It was trained on billions of sentences from the internet
It learned the patterns: what words typically follow what other words
When you ask it a question, it predicts the most likely next word, then the next, then the next…
It does this with such sophistication that the output looks like "thinking"

Tip

Key Insight 💡

Your model looks at 1 character of context
ChatGPT looks at 128,000 tokens of context
Your model has a few hundred parameters
GPT-4 has over 1 trillion parameters

The difference between your bigram model (Chapter 2) and ChatGPT isn't the core idea — both predict what comes next. The difference is:

Same idea. Different scale. And you're about to understand both!

1.6 What You'll Build in This Book 🏗️

This book takes you from zero knowledge to building your own chatbot, step by step. No magic, no black boxes — you'll understand every line of code.

Here's the roadmap:

[Diagram: see interactive version]

Each level builds on the previous one. By the end:

Level 1: You'll have a model that generates text by counting character patterns
Level 2: You'll build a neural network from scratch — no TensorFlow, no PyTorch — just NumPy and math
Level 3: You'll understand how computers represent words as numbers (embeddings) and learn the "attention" mechanism
Level 4: You'll build a mini Transformer — the same architecture that powers ChatGPT and Gemini
Level 5: You'll fine-tune a real model and build a working chatbot interface

Important

No shortcuts. We don't use high-level libraries until Level 5. You'll build everything from scratch so you truly understand it. This is like learning to drive with a manual car before switching to automatic — harder at first, but you'll be a much better driver.

1.7 Prerequisites 📋

Here's what you need to start:

✅ What You Need

Basic Python knowledge — variables, loops, functions, lists, dictionaries. If you can write a function that takes a list and returns the largest number, you're ready.
A computer — any laptop or desktop. The code in Levels 1-3 runs on even the most basic hardware.
Curiosity — the most important ingredient!

❌ What You DON'T Need

❌ A PhD in mathematics
❌ A powerful GPU (until Level 4-5)
❌ Prior knowledge of machine learning
❌ Experience with TensorFlow or PyTorch

🆓 Free Resources

Google Colab (colab.research.google.com) — free cloud computing with Python, NumPy, and even GPUs. You can run all the code in this book for free!
Python (python.org) — if you prefer running code locally

Tip

If you're new to Python, spend a weekend going through a basic tutorial. Focus on: variables, if-else statements, for loops, functions, lists, and dictionaries. That's all you need!

💭 1.8 Discussion Questions 💭

Take a moment to think about (or discuss with a friend):

AI in Indian classrooms: How could AI change the way students learn in government schools? Could an AI tutor help bridge the gap between urban and rural education?

Language diversity: India has 22 official languages and hundreds of dialects. Why is building AI for Indian languages harder than building it for English? What challenges exist?

Ethical concerns: If an AI system trained on internet data learns biases (gender, caste, region), who is responsible? The programmer? The company? The data?

Jobs and AI: Some people say "AI will take all our jobs." Others say "AI will create new jobs we can't imagine." What do you think will happen in India over the next 10 years?

Creativity and AI: If an AI writes a poem in Hindi, is it "creative"? Can a machine that predicts the next word ever truly be creative, or is it just very sophisticated pattern matching?

📝 Chapter Summary

AI is the science of making computers do things that require human intelligence — recognising images, understanding language, making decisions, and learning from experience.
AI has a rich history spanning over 70 years, from Turing's question in 1950 to ChatGPT taking the world by storm in 2022.
There are three types of AI: Narrow AI (what we have today), General AI (still theoretical), and Super AI (science fiction for now).
Language models work by predicting the next word/token. This simple idea, scaled up with massive data and clever architectures, produces the remarkable behaviour we see in ChatGPT, Gemini, and other models.
This book will take you from zero to building your own chatbot, through 5 levels of increasing complexity — and you'll understand every step along the way.
You don't need a PhD. Just Python basics and curiosity. Let's go! 🚀

⏭️ What's Next?

In Chapter 2: Prediction with Patterns, you'll write your very first AI model — in pure Python, with zero libraries. You'll teach a computer to learn patterns from text and generate new text, character by character.

It starts with the simplest possible question: "Given this character, what character is most likely to come next?"

Simple question. Surprisingly powerful answer. Let's build it!

"A journey of a thousand miles begins with a single step." — Lao Tzu

Your step starts in Chapter 2. Turn the page. 📖

Part II

Building Blocks

From counting patterns to learning neural networks

Chapter 2

Prediction with Patterns — Your First AI Model

"All models are wrong, but some are useful." — George Box

Learning Objectives

Explain what a bigram is and why it's useful for text prediction
Build a bigram model from scratch in pure Python — no libraries!
Understand probability distributions and how they drive text generation
Generate new text character by character using your model
Recognise the limitations of bigram models and why more context matters
Connect this simple model to how ChatGPT works at a fundamental level

2.1 The Simplest AI: Counting Patterns 🔢

Before we write any code, let's do an experiment with your own brain.

Complete these sentences:

"मेरा नाम ___ है"
"The capital of India is ___"
"Virat Kohli is a great ___"

How did you do that? You didn't "reason" through the answer — your brain instantly predicted the most likely next word based on patterns you've seen thousands of times before. You've heard "मेरा नाम" followed by a name so often that the prediction is automatic.

Now here's the profound insight: This is exactly how AI language models work.

They look at the text so far, and predict what comes next. The only difference is scale:

You've read maybe a few thousand books in your life
ChatGPT was trained on billions of pages of text

But the core idea? Prediction based on patterns. And we're about to build the simplest version of this idea.

Note

🤔 Think About It

When you text a friend on WhatsApp and your keyboard suggests the next word — that's a prediction model! It learned from YOUR texting patterns. Our bigram model works on the same principle, just at the character level.

2.2 What is a Bigram?

A bigram is simply a pair of two consecutive characters (or words).

Let's look at the word "namaste":


n-a  a-m  m-a  a-s  s-t  t-e

These are the bigrams: na, am, ma, as, st, te.

Now think about English. If I give you the letter 't', what letter do you think comes next?

th — very common! ("the", "that", "this", "three")
to — common ("to", "top", "today")
tr — fairly common ("tree", "train", "true")
tz — very rare in English!

You intuitively know that 'h' is much more likely to follow 't' than 'z' is. You know this because you've seen millions of English words. A bigram model learns the same thing — by counting!

Tip

The Chai Stall Analogy ☕

80 people ordered chai with samosa
15 people ordered chai with biscuit
5 people ordered chai with pakora

Imagine you sit at a chai stall and count how many people order what after chai:

Now if someone orders chai, you'd predict: "They'll probably want a samosa!" That's exactly what a bigram model does — count what follows what, then predict.

2.3 Building Your First Model 🏗️

Now let's look at actual code! Our first script, step1_bigram.py, takes a piece of text and counts every pair of consecutive characters.

The Training Text

The code uses a paragraph about Indian scientific achievement as its training data:

Python
SAMPLE_TEXT = """India has a rich history of scientific achievement that spans thousands of years.
Ancient Indian mathematicians gave the world the concept of zero, which transformed
mathematics forever. Aryabhata, born in 476 CE, calculated the value of pi to four
decimal places and proposed that the earth rotates on its axis. Charaka and Sushruta
were pioneers of medicine and surgery in ancient India. The decimal number system that
the whole world uses today originated in India. Ramanujan, one of the greatest
mathematical geniuses, made extraordinary contributions to number theory and infinite
series. Chandrasekhar won the Nobel Prize for his work on the structure and evolution
of stars. ISRO, the Indian Space Research Organisation, successfully launched missions
to the Moon and Mars, making India one of the few nations to achieve this feat. From
the ancient universities of Nalanda and Takshashila to modern research institutions,
India has always been a land of knowledge and discovery."""

The Core Function: Counting Bigrams

This is the heart of our "AI model" — and it's just counting! The function reads through the text and tallies how many times each character follows each other character:

Python
def build_bigram_counts(text):
    """
    Count how often each character follows each other character.

    This is the CORE of our "AI model"!

    Think of it like this:
    - We read the text one character at a time
    - For each pair of consecutive characters, we make a tally mark
    - At the end, we know exactly which characters tend to follow which
    """

    # A nested dictionary: counts[char_a][char_b] = number of times
    # char_b appeared right after char_a
    # Example: counts['t']['h'] → 42 means 'h' followed 't' 42 times
    counts = {}

    # We look at pairs: text[i] and text[i+1]
    # If text has 100 characters (indices 0-99), the last valid pair is (98, 99)
    # So we go from 0 to 98, which is range(99) = range(len(text) - 1)
    for i in range(len(text) - 1):
        current_char = text[i]      # The character we're looking at now
        next_char = text[i + 1]     # The character that comes right after it

        # If we haven't seen current_char before, create an empty dict for it
        if current_char not in counts:
            counts[current_char] = {}

        # If we haven't seen this specific pair before, start its count at 0
        if next_char not in counts[current_char]:
            counts[current_char][next_char] = 0

        # Add one to the count! 📊
        # This is literally the entire "learning" process of our AI model.
        # That's it. Just counting.
        counts[current_char][next_char] += 1

    return counts

Let's walk through this with a tiny example. Suppose our text is "chai":

Step	`i`	`current_char`	`next_char`	What happens
1	0	`'c'`	`'h'`	`counts['c']['h'] = 1`
2	1	`'h'`	`'a'`	`counts['h']['a'] = 1`
3	2	`'a'`	`'i'`	`counts['a']['i'] = 1`

After processing, our model "knows" that after 'c', 'h' appeared once; after 'h', 'a' appeared once, etc. With a longer text, these counts build up into meaningful patterns.

Important

Key Insight: The entire "learning" process of this AI model is on one line: counts[current_char][next_char] += 1. That's it! Just counting. Yet this simple idea — learning patterns from data — is the foundation of ALL language models, including ChatGPT.

Converting Counts to Probabilities

Raw counts aren't enough. We need probabilities — "After seeing 't', what percentage of the time does 'h' come next?"

Python
def build_bigram_probabilities(counts):
    """
    Convert raw counts into probabilities.

    WHY probabilities instead of counts?
    Because we need to know: "After seeing 't', what PERCENTAGE of the time
    does 'h' come next?" — not just "how many times."

    Example:
        If 't' is followed by 'h' 40 times, 'e' 10 times, and 'o' 5 times:
        Total = 55
        P('h' | 't') = 40/55 = 0.727 (72.7% of the time!)
        P('e' | 't') = 10/55 = 0.182 (18.2%)
        P('o' | 't') = 5/55  = 0.091 (9.1%)
    """
    probabilities = {}

    for char, next_chars in counts.items():
        # Total count of all characters that followed this character
        total = sum(next_chars.values())

        probabilities[char] = {}
        for next_char, count in next_chars.items():
            # P(next | current) = count(current, next) / count(current)
            probabilities[char][next_char] = count / total

    return probabilities

2.4 Understanding Probability Distributions 🎲

Let's pause and understand a crucial concept: probability distributions.

The Dice Analogy

When you roll a fair die, each face has an equal probability:

P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = \frac{1}{6} \approx 0.167

This is a uniform distribution — all outcomes are equally likely.

The Loaded Dice

Now imagine a loaded die where 6 comes up more often:

P(1) = 0.1, \quad P(2) = 0.1, \quad P(3) = 0.1, \quad P(4) = 0.1, \quad P(5) = 0.1, \quad P(6) = 0.5

This is a non-uniform distribution. The probabilities still sum to 1, but some outcomes are more likely than others.

Characters as Loaded Dice

Our bigram model creates a loaded die for each character. For the character 't':

P(\text{next} \mid \text{current} = \text{'t'}) = \begin{cases} 0.73 & \text{if next = 'h'} \\ 0.18 & \text{if next = 'e'} \\ 0.09 & \text{if next = 'o'} \\ \ldots & \end{cases}

The formula is beautifully simple:

P(\text{next\_char} \mid \text{current\_char}) = \frac{\text{count(current\_char, next\_char)}}{\sum_{c} \text{count(current\_char, c)}}

Note

🤔 Think About It

Why do we need probabilities and not just pick the most common character every time? Because that would give us the SAME output every time! "th" → "the" → "the" → forever repeating. Probabilities + randomness = variety. Just like how you don't say the exact same sentences every day, even though you use the same language.

2.5 Generating Text ✨

Now for the exciting part — making the model generate new text! The code in step2_generate.py uses our probability table to create text character by character.

Weighted Random Choice

First, we need a function that picks a random character based on probabilities. This is like throwing a dart at a ruler where each character occupies space proportional to its probability:

Python
def weighted_random_choice(probability_dict):
    """
    Choose a random character based on probability weights.

    Imagine a ruler from 0 to 1:
    |---'h'(0.45)---|--'e'(0.20)--|--' '(0.15)--|--'a'(0.10)--|-others-|
    0              0.45          0.65           0.80          0.90     1.0

    We throw a dart at a random point on this ruler.
    Characters with bigger sections are more likely to be hit! 🎯
    """
    # random.random() gives us a random float between 0 and 1
    # This is our "dart throw" on the probability ruler
    r = random.random()

    # Walk along the ruler, accumulating probabilities
    cumulative = 0.0
    for char, prob in probability_dict.items():
        cumulative += prob
        # If our random number falls within this character's section → pick it!
        if r <= cumulative:
            return char

    # Fallback for floating-point precision edge cases
    return list(probability_dict.keys())[-1]

Example walkthrough: Suppose after 't', the probabilities are:

'h': 0.45 (occupies 0.00 to 0.45 on the ruler)
'e': 0.20 (occupies 0.45 to 0.65)
' ': 0.15 (occupies 0.65 to 0.80)
'a': 0.10 (occupies 0.80 to 0.90)
others: 0.10 (occupies 0.90 to 1.00)

If random.random() returns 0.37, we hit the 'h' section. If it returns 0.72, we hit the ' ' (space) section. Characters with higher probabilities have bigger sections, so they get picked more often!

The Generation Function

Now we chain these choices together to build text character by character:

Python
def generate_text(probabilities, start_char, length):
    """
    Generate text character by character using the bigram model.

    Algorithm:
    1. Start with a character
    2. Look up: "what characters can follow this one, and with what probability?"
    3. Randomly pick the next character (weighted by probability)
    4. Use THAT character as the new current character
    5. Repeat!
    """
    # Start with our seed character
    result = start_char
    current_char = start_char

    for _ in range(length - 1):
        # Check if we have data for this character
        if current_char not in probabilities:
            # Pick a random known character to continue
            current_char = random.choice(list(probabilities.keys()))

        # Use our weighted random choice to pick the next character
        next_char = weighted_random_choice(probabilities[current_char])

        # Add it to our result
        result += next_char

        # The next character becomes the current character
        # This is the "bigram" part — we only look at the LAST character!
        current_char = next_char

    return result

Step-by-Step Generation Example

Let's trace through generating text starting with 't':

Step	Current Char	Options (top 3)	Random Choice	Text So Far
1	`'t'`	`'h'`=73%, `'e'`=18%, `'o'`=9%	`'h'`	`"th"`
2	`'h'`	`'e'`=45%, `'a'`=25%, `'i'`=15%	`'e'`	`"the"`
3	`'e'`	`' '`=40%, `'r'`=15%, `'n'`=12%	`' '`	`"the "`
4	`' '`	`'a'`=12%, `'t'`=10%, `'o'`=9%	`'i'`	`"the i"`
5	`'i'`	`'n'`=35%, `'s'`=15%, `'t'`=12%	`'n'`	`"the in"`

Notice how the model captured real English patterns: "the" is a real word! But after a few characters, it starts generating gibberish because it can only see ONE character of context.

Tip

Key Insight 💡

This is the fundamental trade-off in language models: more context = better predictions. Our bigram model sees 1 character. A trigram sees 2. GPT-4 sees up to 128,000 tokens. That's why GPT-4 can write coherent essays and our bigram model cannot — same idea, vastly different context windows.

💭 2.6 Discussion: Why Does This Work?

Our bigram model generates text that has some English-like qualities:

- Common pairs like "th", "he", "in", "an" appear frequently

- Spaces appear at reasonable intervals (because they follow common letters)

- Some short words occasionally form by chance

But it also produces mostly gibberish. Why?

Because the model has zero memory beyond the last character. When it sees a space, it doesn't know if the previous word was "the" or "elephant" — it treats them identically. It can't form words consistently, let alone sentences or paragraphs.

> [!IMPORTANT]

> The context window is everything. A bigram model has a context window of 1 character. Every improvement in language models from here to ChatGPT is essentially about making the context window larger and using it more effectively.

2.7 From Bigrams to N-grams

What if instead of looking at just the last character, we looked at the last TWO characters? That's a trigram model.

Model	Context	Looks At	Example
Bigram	1 char	`'t'` → predict next	Knows `'h'` often follows `'t'`
Trigram	2 chars	`'th'` → predict next	Knows `'e'` often follows `'th'`
4-gram	3 chars	`'the'` → predict next	Knows `' '` often follows `'the'`
5-gram	4 chars	`'the '` → predict next	Knows common words starting after `'the '`

More context = better predictions. But there's a problem: the number of possible combinations explodes. With 50 unique characters:

Bigrams: 50 × 50 = 2,500 possible pairs
Trigrams: 50³ = 125,000 possible triples
5-grams: 50⁵ = 312,500,000 possible combos!

You'd need enormous amounts of text to see enough of each combination. This is called the curse of dimensionality, and it's one reason why simple counting doesn't scale — and why we need neural networks (Chapter 3!).

Key Concepts Summary

Concept	Definition
Bigram	A pair of two consecutive characters (or words). The building block of our first model.
Probability Distribution	A set of probabilities that tells us how likely each outcome is. All probabilities sum to 1.
Sampling	Randomly choosing an outcome based on a probability distribution. This is how we generate text.
Tokenization	Breaking text into units (characters, words, or subwords). Our model uses character-level tokenization.
Context Window	How much previous text the model can "see" when making a prediction. Bigram = 1 character.

📝 2.9 Exercises 📝

Trigram Model: Modify build_bigram_counts to look at the last TWO characters instead of one. How does the generated text quality change?

Train on Hindi: Find a Hindi poem (try a Kabir doha or a Premchand excerpt) and train the bigram model on it. What patterns does it learn? Does it generate recognisable Hindi syllables?

Visualise Frequencies: Use matplotlib to create a bar chart of the 20 most common bigrams in the training text. Which patterns dominate?

Compare Models: Train two separate bigram models — one on the Indian science text, one on a cricket commentary. Compare the generated text. How are they different?

Tiny Data Experiment: What happens if you train the model on just 10 characters of text? What about 50? At what point does the model start generating somewhat reasonable output?

💭 2.10 Discussion Questions 💭

Why can't bigrams write essays? What fundamental capability is missing? (Hint: think about what "understanding" means.)

Counting vs learning: Our model doesn't "learn" anything — it just counts. Is counting a form of learning? Where does counting end and real learning begin?

The vocabulary problem: Our model works on individual characters. What would change if we used whole words instead? What are the advantages and disadvantages?

Indian languages: Hindi uses Devanagari script, which has different character patterns than English. How would a bigram model trained on Hindi differ from one trained on English? What about Tamil or Bengali?

Scale and quality: If you had a bigram model trained on ALL the text on the internet, would it generate good text? Why or why not?

📝 Chapter Summary

You learned that prediction based on patterns is the fundamental idea behind all language models — from our simple bigram to ChatGPT.
You built a bigram model that counts character pairs and converts them into probabilities. The entire "learning" is just counting: counts[current][next] += 1.
You understood probability distributions — how counts become probabilities, and how we sample from them to generate text.
You generated new text character by character, watching the model pick each character based on what it saw before.
You recognised the limitations: a context window of 1 character means the model can capture some local patterns (like "th") but can't form coherent words or sentences.
You saw the path forward: more context = better predictions, but counting doesn't scale. We need something smarter.

⏭️ What's Next?

Our bigram model has a fundamental problem: it can only count what it's explicitly seen. It can't generalise. It can't figure out that "th" and "sh" have something in common (both are followed by vowels).

What if, instead of counting, the computer could learn these patterns by itself? What if it could discover relationships we never told it about?

That's exactly what neural networks do. In Chapter 3: Neural Networks from Scratch, you'll build a single neuron, then connect many neurons into a network, and teach it to learn patterns through backpropagation — the algorithm that powers all of modern AI.

Get ready — things are about to get really interesting! 🧠

"I hear and I forget. I see and I remember. I do and I understand." — Confucius

You just DID it. You built an AI model. Now you understand. 🙌

Complete Source Code - Chapter 2

Below are the complete, runnable source files for this chapter. Every line is included.

Complete Code: step1_bigram.py

Python
#!/usr/bin/env python3
"""
╔══════════════════════════════════════════════════════════════╗
║        LEVEL 1 — STEP 1: BUILDING A BIGRAM MODEL            ║
║                                                              ║
║   What we're doing:                                          ║
║   1. Take a piece of text                                    ║
║   2. Count which characters appear after which               ║
║   3. Build a probability table (our first "AI model"!)       ║
║                                                              ║
║   NO LIBRARIES NEEDED — Pure Python only!                    ║
╚══════════════════════════════════════════════════════════════╝
"""

# ============================================================================
# WHY: We don't use ANY external libraries in Level 1.
# This proves that AI concepts are simple enough to build from scratch.
# Python's built-in features are all we need!
# ============================================================================

import os
import sys

# ============================================================================
# ANSI COLOR CODES
# ============================================================================
# WHY: We use ANSI escape codes to make terminal output colorful and readable.
# These are special character sequences that terminals interpret as formatting.
# The format is: \033[CODEm  where CODE controls the color/style.
# \033 is the "escape" character — it tells the terminal "what follows is a command"
# ============================================================================

class Colors:
    """ANSI color codes for beautiful terminal output."""
    # Text colors
    RED = '\033[91m'       # Bright red — for errors or important warnings
    GREEN = '\033[92m'     # Bright green — for success messages
    YELLOW = '\033[93m'    # Bright yellow — for highlights and emphasis
    BLUE = '\033[94m'      # Bright blue — for headers and section titles
    MAGENTA = '\033[95m'   # Bright magenta — for special information
    CYAN = '\033[96m'      # Bright cyan — for data and values
    WHITE = '\033[97m'     # Bright white — for regular text

    # Text styles
    BOLD = '\033[1m'       # Bold text — makes text thicker/heavier
    DIM = '\033[2m'        # Dim text — makes text lighter/faded
    UNDERLINE = '\033[4m'  # Underlined text

    # Reset — IMPORTANT: always reset after coloring, or the whole terminal stays colored!
    RESET = '\033[0m'

    # Background colors
    BG_GREEN = '\033[42m'  # Green background
    BG_BLUE = '\033[44m'   # Blue background
    BG_MAGENTA = '\033[45m'  # Magenta background


def print_header():
    """Print a beautiful header for this script."""
    print(f"\n{Colors.BOLD}{Colors.CYAN}{'═' * 64}{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.YELLOW}  🧠 LEVEL 1 — STEP 1: BUILDING A BIGRAM MODEL{Colors.RESET}")
    print(f"{Colors.DIM}{Colors.WHITE}  Your first step into the world of AI!{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.CYAN}{'═' * 64}{Colors.RESET}\n")


def print_section(emoji, title, description=""):
    """Print a formatted section header."""
    print(f"\n{Colors.BOLD}{Colors.BLUE}{'─' * 60}{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.GREEN}  {emoji} {title}{Colors.RESET}")
    if description:
        print(f"{Colors.DIM}{Colors.WHITE}  {description}{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.BLUE}{'─' * 60}{Colors.RESET}\n")


def print_footer():
    """Print a beautiful footer for this script."""
    print(f"\n{Colors.BOLD}{Colors.CYAN}{'═' * 64}{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.GREEN}  ✅ Step 1 Complete! Your bigram model is ready.{Colors.RESET}")
    print(f"{Colors.DIM}{Colors.WHITE}  Next: Run step2_generate.py to generate text!{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.CYAN}{'═' * 64}{Colors.RESET}\n")


# ============================================================================
# THE SAMPLE TEXT
# ============================================================================
# WHY: We need a piece of text to analyze. We embed it directly in the code
# so this script runs without needing any external files.
# We chose a paragraph about Indian science & history — something meaningful
# and interesting to read while learning!
# ============================================================================

SAMPLE_TEXT = """India has a rich history of scientific achievement that spans thousands of years.
Ancient Indian mathematicians gave the world the concept of zero, which transformed
mathematics forever. Aryabhata, born in 476 CE, calculated the value of pi to four
decimal places and proposed that the earth rotates on its axis. Charaka and Sushruta
were pioneers of medicine and surgery in ancient India. The decimal number system that
the whole world uses today originated in India. Ramanujan, one of the greatest
mathematical geniuses, made extraordinary contributions to number theory and infinite
series. Chandrasekhar won the Nobel Prize for his work on the structure and evolution
of stars. ISRO, the Indian Space Research Organisation, successfully launched missions
to the Moon and Mars, making India one of the few nations to achieve this feat. From
the ancient universities of Nalanda and Takshashila to modern research institutions,
India has always been a land of knowledge and discovery."""


def build_bigram_counts(text):
    """
    Count how often each character follows each other character.

    This is the CORE of our "AI model"!

    Think of it like this:
    - We read the text one character at a time
    - For each pair of consecutive characters, we make a tally mark
    - At the end, we know exactly which characters tend to follow which

    Args:
        text (str): The input text to analyze

    Returns:
        dict: A nested dictionary where counts[char_a][char_b] = number of times
              char_b appeared right after char_a
    """

    # WHY a nested dictionary?
    # We want to look up: "Given character A, how many times did character B follow?"
    # A nested dict lets us do: counts['t']['h'] → 42 (meaning 'h' followed 't' 42 times)
    counts = {}

    # WHY range(len(text) - 1)?
    # Because we look at pairs: text[i] and text[i+1]
    # If text has 100 characters (indices 0-99), the last valid pair is (98, 99)
    # So we go from 0 to 98, which is range(99) = range(len(text) - 1)
    for i in range(len(text) - 1):
        current_char = text[i]      # The character we're looking at now
        next_char = text[i + 1]     # The character that comes right after it

        # If we haven't seen current_char before, create an empty dict for it
        # WHY: We can't add to a dict that doesn't exist yet
        if current_char not in counts:
            counts[current_char] = {}

        # If we haven't seen this specific pair before, start its count at 0
        if next_char not in counts[current_char]:
            counts[current_char][next_char] = 0

        # Add one to the count! 📊
        # This is literally the entire "learning" process of our AI model.
        # That's it. Just counting.
        counts[current_char][next_char] += 1

    return counts


def build_bigram_probabilities(counts):
    """
    Convert raw counts into probabilities.

    WHY probabilities instead of counts?
    Because we need to know: "After seeing 't', what PERCENTAGE of the time
    does 'h' come next?" — not just "how many times."

    Example:
        If 't' is followed by 'h' 40 times, 'e' 10 times, and 'o' 5 times:
        Total = 55
        P('h' | 't') = 40/55 = 0.727 (72.7% of the time!)
        P('e' | 't') = 10/55 = 0.182 (18.2%)
        P('o' | 't') = 5/55  = 0.091 (9.1%)

    Args:
        counts (dict): The bigram counts from build_bigram_counts()

    Returns:
        dict: A nested dictionary where probs[char_a][char_b] = probability
              that char_b follows char_a
    """
    probabilities = {}

    for char, next_chars in counts.items():
        # WHY sum all counts?
        # To calculate probability, we need: (count of this pair) / (total of all pairs starting with this char)
        total = sum(next_chars.values())

        probabilities[char] = {}
        for next_char, count in next_chars.items():
            # This is Bayes' most basic form: P(next | current) = count(current, next) / count(current)
            probabilities[char][next_char] = count / total

    return probabilities


def display_text_info(text):
    """Display information about the sample text."""
    # Count unique characters
    unique_chars = sorted(set(text))
    num_unique = len(unique_chars)
    total_chars = len(text)
    total_bigrams = total_chars - 1  # WHY -1? Because pairs overlap by one position

    print(f"  {Colors.CYAN}📄 Text length:{Colors.RESET}          {Colors.BOLD}{total_chars}{Colors.RESET} characters")
    print(f"  {Colors.CYAN}🔤 Unique characters:{Colors.RESET}    {Colors.BOLD}{num_unique}{Colors.RESET}")
    print(f"  {Colors.CYAN}🔗 Total bigrams:{Colors.RESET}        {Colors.BOLD}{total_bigrams}{Colors.RESET}")
    print()

    # Show the unique characters in a nice format
    print(f"  {Colors.YELLOW}Characters found:{Colors.RESET}")
    # WHY do we display characters? So the student can see exactly what the model will learn from
    line = "  "
    for ch in unique_chars:
        # Replace invisible characters with readable names
        if ch == ' ':
            display = '␣'  # Space symbol
        elif ch == '\n':
            display = '↵'  # Newline symbol
        elif ch == '\t':
            display = '→'  # Tab symbol
        else:
            display = ch
        line += f" {Colors.GREEN}[{display}]{Colors.RESET}"
        if len(line) > 200:  # Prevent super-long lines in terminal
            print(line)
            line = "  "
    if line.strip():
        print(line)


def display_top_bigrams(counts, top_n=25):
    """
    Display the most frequent bigrams in a beautiful table.

    WHY show this?
    This helps students SEE the patterns that the model is learning.
    When they see that 'th' appears 30 times, they'll understand why
    the model generates 'th' so often!
    """
    # Flatten the nested dict into a list of (char_a, char_b, count) tuples
    # WHY flatten? So we can sort ALL bigrams together by count
    all_bigrams = []
    for char_a, next_chars in counts.items():
        for char_b, count in next_chars.items():
            all_bigrams.append((char_a, char_b, count))

    # Sort by count, highest first
    # WHY key=lambda x: x[2]? Because x[2] is the count (third element of tuple)
    # WHY reverse=True? We want the MOST common bigrams first
    all_bigrams.sort(key=lambda x: x[2], reverse=True)

    # Print table header
    print(f"  {Colors.BOLD}{Colors.WHITE}{'Rank':<6} {'Bigram':<12} {'Visual':<16} {'Count':<8} {'Bar'}{Colors.RESET}")
    print(f"  {Colors.DIM}{'─' * 56}{Colors.RESET}")

    # WHY do we only show top_n? Because there could be hundreds of bigrams,
    # and showing all of them would be overwhelming. The top ones tell the story.
    max_count = all_bigrams[0][2] if all_bigrams else 1

    for rank, (char_a, char_b, count) in enumerate(all_bigrams[:top_n], 1):
        # Make invisible characters readable
        display_a = '␣' if char_a == ' ' else ('↵' if char_a == '\n' else char_a)
        display_b = '␣' if char_b == ' ' else ('↵' if char_b == '\n' else char_b)

        # Create a visual bar — length proportional to count
        # WHY? Visual bars make it MUCH easier to compare frequencies at a glance
        bar_length = int((count / max_count) * 20)
        bar = '█' * bar_length

        # Color the top 5 differently to highlight them
        if rank <= 5:
            color = Colors.YELLOW
        elif rank <= 10:
            color = Colors.CYAN
        else:
            color = Colors.WHITE

        bigram_str = f"'{display_a}{display_b}'"
        arrow_str = f"'{display_a}' → '{display_b}'"

        print(f"  {color}{rank:<6} {bigram_str:<12} {arrow_str:<16} {count:<8} {Colors.GREEN}{bar}{Colors.RESET}")

    print(f"\n  {Colors.DIM}(Showing top {top_n} of {len(all_bigrams)} unique bigrams){Colors.RESET}")


def display_character_analysis(counts):
    """
    For a few interesting characters, show what typically follows them.

    WHY?
    This helps students build intuition about language patterns.
    They can see that after 'q', 'u' almost always follows — just like in real English!
    """
    # Pick some interesting characters to analyze
    interesting = ['t', 'a', 'e', ' ', 'i', 's']

    for char in interesting:
        if char not in counts:
            continue

        next_chars = counts[char]
        total = sum(next_chars.values())

        # Sort followers by count
        sorted_followers = sorted(next_chars.items(), key=lambda x: x[1], reverse=True)

        display_char = '␣ (space)' if char == ' ' else f"'{char}'"
        print(f"  {Colors.BOLD}{Colors.YELLOW}After {display_char}:{Colors.RESET}  ", end="")

        # Show top 5 followers
        parts = []
        for next_char, count in sorted_followers[:5]:
            pct = (count / total) * 100
            display_next = '␣' if next_char == ' ' else ('↵' if next_char == '\n' else next_char)
            parts.append(f"{Colors.CYAN}'{display_next}'{Colors.RESET}={Colors.GREEN}{pct:.0f}%{Colors.RESET}")

        print("  ".join(parts))


def display_probability_matrix(counts, top_chars=10):
    """
    Display a small probability matrix — a grid showing character relationships.

    WHY a matrix?
    This is how we VISUALIZE the model's "knowledge." Each cell shows how likely
    character B is to follow character A. This is literally the model's brain!
    """
    # Find the most common characters overall
    # WHY? We can't show ALL characters (too many), so we pick the most frequent ones
    char_frequency = {}
    for char_a, next_chars in counts.items():
        for char_b, count in next_chars.items():
            char_frequency[char_a] = char_frequency.get(char_a, 0) + count
            char_frequency[char_b] = char_frequency.get(char_b, 0) + count

    # Get top characters by frequency
    top = sorted(char_frequency.items(), key=lambda x: x[1], reverse=True)[:top_chars]
    top_char_list = [ch for ch, _ in top]

    # Print header row
    header = f"  {Colors.BOLD}{Colors.WHITE}{'':>6}"
    for ch in top_char_list:
        display = '␣' if ch == ' ' else ('↵' if ch == '\n' else ch)
        header += f" {Colors.CYAN}{display:>5}{Colors.RESET}"
    print(header)
    print(f"  {'':>6}{Colors.DIM}{'─' * (6 * len(top_char_list))}{Colors.RESET}")

    # Print each row
    for ch_a in top_char_list:
        display_a = '␣' if ch_a == ' ' else ('↵' if ch_a == '\n' else ch_a)
        row = f"  {Colors.CYAN}{display_a:>5}{Colors.RESET} │"

        for ch_b in top_char_list:
            if ch_a in counts and ch_b in counts[ch_a]:
                total = sum(counts[ch_a].values())
                prob = counts[ch_a][ch_b] / total
                # Color-code by probability
                if prob > 0.3:
                    color = Colors.GREEN
                elif prob > 0.1:
                    color = Colors.YELLOW
                else:
                    color = Colors.DIM
                row += f" {color}{prob:>4.0%}{Colors.RESET}"
            else:
                row += f" {Colors.DIM}{'  · ':>5}{Colors.RESET}"

        print(row)

    print(f"\n  {Colors.DIM}(Cells show: probability that column character follows row character){Colors.RESET}")


# ============================================================================
# MAIN EXECUTION
# ============================================================================
# WHY if __name__ == '__main__'?
# This is a Python convention that means: "Only run this code if this file
# is being executed directly, NOT if it's being imported by another file."
# This is important because step2_generate.py will import our functions!
# ============================================================================

if __name__ == '__main__':
    # ── Print the header ──
    print_header()

    # ── Step 1: Show the text we'll analyze ──
    print_section("🔍", "Step 1: Reading the text...",
                  "This is the text our AI will learn from")

    # Show a preview of the text (first 200 chars)
    preview = SAMPLE_TEXT[:200].replace('\n', ' ')
    print(f"  {Colors.WHITE}\"{preview}...\"{Colors.RESET}")
    print()

    # Show text statistics
    display_text_info(SAMPLE_TEXT)

    # ── Step 2: Count bigram frequencies ──
    print_section("📊", "Step 2: Counting bigram patterns...",
                  "For every character, we count what comes after it")

    # WHY lowercase? To treat 'T' and 't' as the same character.
    # Otherwise 'The' and 'the' would create different patterns, splitting our data.
    text_lower = SAMPLE_TEXT.lower()

    # This is where the magic happens! 🪄
    print(f"  {Colors.YELLOW}⏳ Counting patterns...{Colors.RESET}", end=" ", flush=True)
    bigram_counts = build_bigram_counts(text_lower)
    print(f"{Colors.GREEN}Done!{Colors.RESET}")

    # How many unique patterns did we find?
    total_unique = sum(len(v) for v in bigram_counts.values())
    print(f"  {Colors.CYAN}Found {Colors.BOLD}{total_unique}{Colors.RESET}{Colors.CYAN} unique bigram patterns!{Colors.RESET}")

    # ── Step 3: Display the results ──
    print_section("🏆", "Step 3: Top Bigram Patterns",
                  "The most common character pairs in our text")

    display_top_bigrams(bigram_counts, top_n=25)

    # ── Step 4: Character analysis ──
    print_section("🔬", "Step 4: Character Analysis",
                  "What typically follows each character?")

    display_character_analysis(bigram_counts)

    # ── Step 5: Probability matrix ──
    print_section("🧮", "Step 5: The Probability Matrix (Our AI's Brain!)",
                  "Each cell = probability that column follows row")

    display_probability_matrix(bigram_counts, top_chars=10)

    # ── Step 6: Build probabilities ──
    print_section("📐", "Step 6: Converting Counts to Probabilities",
                  "Probabilities let us make weighted random choices")

    probs = build_bigram_probabilities(bigram_counts)

    # Show a couple of examples
    example_chars = ['t', 'a', 'i']
    for ch in example_chars:
        if ch in probs:
            sorted_probs = sorted(probs[ch].items(), key=lambda x: x[1], reverse=True)
            total = sum(p for _, p in sorted_probs)
            print(f"  {Colors.BOLD}After '{ch}':{Colors.RESET}")
            for next_ch, prob in sorted_probs[:5]:
                display_next = '␣' if next_ch == ' ' else ('↵' if next_ch == '\n' else next_ch)
                bar = '▓' * int(prob * 30)
                print(f"    '{display_next}': {Colors.CYAN}{prob:.1%}{Colors.RESET}  {Colors.GREEN}{bar}{Colors.RESET}")
            remaining = len(sorted_probs) - 5
            if remaining > 0:
                print(f"    {Colors.DIM}... and {remaining} more possible characters{Colors.RESET}")
            print()

    # ── Summary ──
    print_section("🎓", "What You Just Built!",
                  "Let's recap what happened")

    print(f"""  {Colors.WHITE}1. You took a piece of text ({len(SAMPLE_TEXT)} characters){Colors.RESET}
  {Colors.WHITE}2. You counted every pair of consecutive characters{Colors.RESET}
  {Colors.WHITE}3. You converted those counts into probabilities{Colors.RESET}
  {Colors.WHITE}4. You now have a PROBABILITY TABLE — this IS the model!{Colors.RESET}

  {Colors.BOLD}{Colors.YELLOW}🧠 This probability table is your AI's "brain."{Colors.RESET}
  {Colors.WHITE}It "knows" that after 'e', a space is most common.{Colors.RESET}
  {Colors.WHITE}It "knows" that after 't', 'h' appears often.{Colors.RESET}
  {Colors.WHITE}It "knows" the patterns of the English language!{Colors.RESET}

  {Colors.MAGENTA}➡️  Now run step2_generate.py to generate text using this model!{Colors.RESET}""")

    # Print footer
    print_footer()

Complete Code: step2_generate.py

Python
#!/usr/bin/env python3
"""
╔══════════════════════════════════════════════════════════════╗
║        LEVEL 1 — STEP 2: GENERATING TEXT WITH BIGRAMS       ║
║                                                              ║
║   What we're doing:                                          ║
║   1. Rebuild our bigram model (so this script runs alone)    ║
║   2. Use probability distributions to pick next characters   ║
║   3. Generate brand new text, character by character!        ║
║                                                              ║
║   NO LIBRARIES NEEDED — Pure Python only!                    ║
╚══════════════════════════════════════════════════════════════╝
"""

# ============================================================================
# WHY: We import functions from step1 AND also define what we need here.
# This way, the script works whether or not step1 is importable.
# We also rebuild the model from scratch if import fails.
# ============================================================================

import os
import sys
import random  # WHY: We need random number generation to sample from probabilities
import time    # WHY: We use time.sleep() to create a dramatic text generation effect

# ============================================================================
# ANSI COLOR CODES (same as step1 — duplicated so this script is self-contained)
# ============================================================================

class Colors:
    """ANSI color codes for beautiful terminal output."""
    RED = '\033[91m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    BLUE = '\033[94m'
    MAGENTA = '\033[95m'
    CYAN = '\033[96m'
    WHITE = '\033[97m'
    BOLD = '\033[1m'
    DIM = '\033[2m'
    UNDERLINE = '\033[4m'
    RESET = '\033[0m'
    BG_GREEN = '\033[42m'
    BG_BLUE = '\033[44m'
    BG_YELLOW = '\033[43m'


def print_header():
    """Print a beautiful header for this script."""
    print(f"\n{Colors.BOLD}{Colors.MAGENTA}{'═' * 64}{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.YELLOW}  🚀 LEVEL 1 — STEP 2: GENERATING TEXT WITH BIGRAMS{Colors.RESET}")
    print(f"{Colors.DIM}{Colors.WHITE}  Watch your AI create text character by character!{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.MAGENTA}{'═' * 64}{Colors.RESET}\n")


def print_section(emoji, title, description=""):
    """Print a formatted section header."""
    print(f"\n{Colors.BOLD}{Colors.BLUE}{'─' * 60}{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.GREEN}  {emoji} {title}{Colors.RESET}")
    if description:
        print(f"{Colors.DIM}{Colors.WHITE}  {description}{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.BLUE}{'─' * 60}{Colors.RESET}\n")


def print_footer():
    """Print a beautiful footer for this script."""
    print(f"\n{Colors.BOLD}{Colors.MAGENTA}{'═' * 64}{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.GREEN}  🎉 Congratulations! You just built your first AI model!{Colors.RESET}")
    print(f"{Colors.DIM}{Colors.WHITE}  It's simple, but the core idea is the same as ChatGPT.{Colors.RESET}")
    print(f"{Colors.DIM}{Colors.WHITE}  Next: Level 2 — Neural Networks!{Colors.RESET}")
    print(f"{Colors.BOLD}{Colors.MAGENTA}{'═' * 64}{Colors.RESET}\n")


# ============================================================================
# THE SAME SAMPLE TEXT FROM STEP 1
# ============================================================================
# WHY duplicate the text? So this script is 100% self-contained and runnable
# even if step1_bigram.py isn't in the same directory.
# ============================================================================

SAMPLE_TEXT = """India has a rich history of scientific achievement that spans thousands of years.
Ancient Indian mathematicians gave the world the concept of zero, which transformed
mathematics forever. Aryabhata, born in 476 CE, calculated the value of pi to four
decimal places and proposed that the earth rotates on its axis. Charaka and Sushruta
were pioneers of medicine and surgery in ancient India. The decimal number system that
the whole world uses today originated in India. Ramanujan, one of the greatest
mathematical geniuses, made extraordinary contributions to number theory and infinite
series. Chandrasekhar won the Nobel Prize for his work on the structure and evolution
of stars. ISRO, the Indian Space Research Organisation, successfully launched missions
to the Moon and Mars, making India one of the few nations to achieve this feat. From
the ancient universities of Nalanda and Takshashila to modern research institutions,
India has always been a land of knowledge and discovery."""


def build_bigram_model(text):
    """
    Build a complete bigram model: counts + probabilities.

    WHY rebuild instead of importing?
    While we COULD import from step1, rebuilding here makes this script
    completely self-contained. A student can run this file alone without
    worrying about imports or file paths.

    Args:
        text (str): Input text to learn from

    Returns:
        tuple: (counts dict, probabilities dict)
    """
    # Step A: Count bigrams (same logic as step1)
    counts = {}
    for i in range(len(text) - 1):
        current = text[i]
        next_ch = text[i + 1]

        if current not in counts:
            counts[current] = {}
        if next_ch not in counts[current]:
            counts[current][next_ch] = 0
        counts[current][next_ch] += 1

    # Step B: Convert to probabilities
    probabilities = {}
    for char, next_chars in counts.items():
        total = sum(next_chars.values())
        probabilities[char] = {}
        for next_char, count in next_chars.items():
            probabilities[char][next_char] = count / total

    return counts, probabilities


def weighted_random_choice(probability_dict):
    """
    Choose a random character based on probability weights.

    WHY not just random.choice()?
    random.choice() picks uniformly — every option is equally likely.
    But we want WEIGHTED randomness: 'h' after 't' should be picked more often
    than 'z' after 't', because 'th' is way more common than 'tz'!

    HOW IT WORKS:
    Imagine a ruler from 0 to 1:
    |---'h'(0.45)---|--'e'(0.20)--|--' '(0.15)--|--'a'(0.10)--|-others-|
    0              0.45          0.65           0.80          0.90     1.0

    We throw a dart at a random point on this ruler.
    Characters with bigger sections are more likely to be hit! 🎯

    Args:
        probability_dict (dict): {character: probability}

    Returns:
        str: The randomly chosen character
    """
    # WHY random.random()? It gives us a random float between 0 and 1.
    # This is our "dart throw" on the probability ruler.
    r = random.random()

    # Walk along the ruler, accumulating probabilities
    cumulative = 0.0
    for char, prob in probability_dict.items():
        cumulative += prob
        # If our random number falls within this character's section → pick it!
        if r <= cumulative:
            return char

    # WHY this fallback? Due to floating-point precision, cumulative might not
    # reach exactly 1.0. If we somehow get past all entries, return the last one.
    return list(probability_dict.keys())[-1]


def generate_text(probabilities, start_char, length):
    """
    Generate text character by character using the bigram model.

    This is the moment of truth! 🎬

    Algorithm:
    1. Start with a character
    2. Look up: "what characters can follow this one, and with what probability?"
    3. Randomly pick the next character (weighted by probability)
    4. Use THAT character as the new current character
    5. Repeat!

    Args:
        probabilities (dict): Our bigram probability model
        start_char (str): The first character to start with
        length (int): How many characters to generate

    Returns:
        str: The generated text
    """
    # Start with our seed character
    result = start_char
    current_char = start_char

    for _ in range(length - 1):
        # Check if we have data for this character
        # WHY might we not? If a character only appears at the END of the text,
        # we never saw what follows it, so it's not in our probability table
        if current_char not in probabilities:
            # Pick a random known character to continue
            current_char = random.choice(list(probabilities.keys()))

        # Use our weighted random choice to pick the next character
        next_char = weighted_random_choice(probabilities[current_char])

        # Add it to our result
        result += next_char

        # The next character becomes the current character for the next iteration
        # This is the "bigram" part — we only look at the LAST character!
        current_char = next_char

    return result


def generate_text_animated(probabilities, start_char, length, delay=0.03):
    """
    Generate text with an animated display — watch it appear character by character!

    WHY animation?
    It helps students FEEL how the model works: each character is chosen
    one at a time, based only on the previous character. The slight delay
    makes each decision visible and tangible.

    Args:
        probabilities (dict): Our bigram probability model
        start_char (str): The first character to start with
        length (int): How many characters to generate
        delay (float): Seconds between characters (for dramatic effect!)

    Returns:
        str: The generated text
    """
    result = start_char
    current_char = start_char

    # Print the first character
    sys.stdout.write(f"  {Colors.GREEN}")
    sys.stdout.write(start_char)
    sys.stdout.flush()

    for i in range(length - 1):
        if current_char not in probabilities:
            current_char = random.choice(list(probabilities.keys()))

        next_char = weighted_random_choice(probabilities[current_char])
        result += next_char

        # Print each character with a tiny delay for dramatic effect ✨
        sys.stdout.write(next_char)
        sys.stdout.flush()
        time.sleep(delay)

        current_char = next_char

    sys.stdout.write(f"{Colors.RESET}\n")
    return result


def show_generation_process(probabilities, start_char='t', steps=10):
    """
    Show the step-by-step decision process of text generation.

    WHY show this?
    This is the most important educational part! Students can see EXACTLY
    how the model "thinks" — what options it considers and why it picks each one.
    """
    current = start_char
    generated = start_char

    print(f"  {Colors.BOLD}Starting character: '{start_char}'{Colors.RESET}")
    print()

    for step in range(steps):
        if current not in probabilities:
            current = random.choice(list(probabilities.keys()))

        # Get the top options
        options = sorted(probabilities[current].items(), key=lambda x: x[1], reverse=True)

        # Show the decision process
        display_current = '␣' if current == ' ' else ('↵' if current == '\n' else current)
        print(f"  {Colors.YELLOW}Step {step + 1}:{Colors.RESET} Current char = '{Colors.CYAN}{display_current}{Colors.RESET}'")

        # Show top 3 options
        print(f"    {Colors.DIM}Options:", end="")
        for ch, p in options[:4]:
            display = '␣' if ch == ' ' else ('↵' if ch == '\n' else ch)
            print(f" '{display}'={p:.0%}", end="")
        if len(options) > 4:
            print(f" ...+{len(options)-4} more", end="")
        print(f"{Colors.RESET}")

        # Make the choice
        next_char = weighted_random_choice(probabilities[current])
        display_next = '␣' if next_char == ' ' else ('↵' if next_char == '\n' else next_char)
        print(f"    {Colors.GREEN}→ Chose: '{display_next}'{Colors.RESET}")

        generated += next_char
        current = next_char

        # Show the text so far
        print(f"    {Colors.DIM}Text so far: \"{generated}\"{Colors.RESET}")
        print()

    return generated


# ============================================================================
# MAIN EXECUTION
# ============================================================================

if __name__ == '__main__':
    # Set random seed for reproducibility (optional — remove for truly random output)
    # WHY a seed? So students get consistent results when first running the code.
    # They can remove this line later to get different results each time!
    # random.seed(42)  # Uncomment this line for reproducible results

    # ── Print the header ──
    print_header()

    # ── Step 1: Rebuild the model ──
    print_section("🔧", "Step 1: Rebuilding the bigram model...",
                  "Training our AI on the sample text")

    text_lower = SAMPLE_TEXT.lower()
    counts, probs = build_bigram_model(text_lower)

    total_unique = sum(len(v) for v in counts.values())
    print(f"  {Colors.GREEN}✓{Colors.RESET} Model built! Learned {Colors.BOLD}{total_unique}{Colors.RESET} patterns")
    print(f"  {Colors.GREEN}✓{Colors.RESET} Vocabulary: {Colors.BOLD}{len(counts)}{Colors.RESET} unique characters")

    # ── Step 2: Show the generation process ──
    print_section("🔬", "Step 2: The Generation Process (Step by Step)",
                  "Watch how the AI 'decides' each character")

    print(f"  {Colors.YELLOW}Let's see exactly how text generation works:{Colors.RESET}")
    print(f"  {Colors.DIM}The model looks at the current character, checks its{Colors.RESET}")
    print(f"  {Colors.DIM}probability table, and randomly picks the next one.{Colors.RESET}\n")

    show_generation_process(probs, start_char='t', steps=10)

    # ── Step 3: Generate multiple samples ──
    print_section("📝", "Step 3: Generating Text Samples",
                  "Let's generate text of different lengths")

    samples = [
        ('t', 30, "Short (30 chars, starting with 't')"),
        ('i', 60, "Medium (60 chars, starting with 'i')"),
        ('a', 100, "Longer (100 chars, starting with 'a')"),
        ('t', 200, "Full paragraph (200 chars, starting with 't')"),
    ]

    for start, length, description in samples:
        print(f"  {Colors.BOLD}{Colors.YELLOW}📌 {description}:{Colors.RESET}")
        generated = generate_text(probs, start, length)
        # Clean up for display (replace newlines with spaces)
        display_text = generated.replace('\n', ' ')
        print(f"  {Colors.CYAN}\"{display_text}\"{Colors.RESET}")
        print()

    # ── Step 4: Animated generation ──
    print_section("🎬", "Step 4: Live Text Generation (Animated!)",
                  "Watch text appear character by character...")

    print(f"  {Colors.YELLOW}Generating 150 characters starting with 'i'...{Colors.RESET}")
    print(f"  {Colors.DIM}(Each character is chosen one at a time based on the previous one){Colors.RESET}\n")

    animated_text = generate_text_animated(probs, 'i', 150, delay=0.02)
    print()

    # ── Step 5: Compare real vs generated ──
    print_section("⚖️", "Step 5: Real Text vs Generated Text",
                  "Can you spot the difference?")

    # Get a chunk of real text
    real_chunk = text_lower[50:200].replace('\n', ' ')
    generated_chunk = generate_text(probs, 't', 150).replace('\n', ' ')

    print(f"  {Colors.BOLD}{Colors.GREEN}📗 REAL TEXT:{Colors.RESET}")
    print(f"  {Colors.WHITE}\"{real_chunk}\"{Colors.RESET}")
    print()
    print(f"  {Colors.BOLD}{Colors.MAGENTA}🤖 GENERATED TEXT:{Colors.RESET}")
    print(f"  {Colors.WHITE}\"{generated_chunk}\"{Colors.RESET}")
    print()

    print(f"  {Colors.YELLOW}📊 Analysis:{Colors.RESET}")
    print(f"  {Colors.WHITE}• The real text makes perfect sense — coherent sentences{Colors.RESET}")
    print(f"  {Colors.WHITE}• The generated text has SOME English patterns (common pairs like{Colors.RESET}")
    print(f"  {Colors.WHITE}  'th', 'he', 'in') but is mostly gibberish{Colors.RESET}")
    print(f"  {Colors.WHITE}• WHY? Our model only looks at ONE previous character!{Colors.RESET}")
    print(f"  {Colors.WHITE}  It has no idea about words, grammar, or meaning.{Colors.RESET}")
    print()

    # ── Step 6: Multiple runs show randomness ──
    print_section("🎲", "Step 6: Randomness in Action",
                  "Same starting character, different results each time!")

    print(f"  {Colors.YELLOW}Three different generations, all starting with 'th':{Colors.RESET}\n")
    for run in range(1, 4):
        gen = generate_text(probs, 't', 80).replace('\n', ' ')
        print(f"  {Colors.CYAN}Run {run}:{Colors.RESET} \"{gen}\"")
    print()
    print(f"  {Colors.DIM}Each run is different because we use weighted RANDOM choices!{Colors.RESET}")
    print(f"  {Colors.DIM}The probabilities are the same, but the random dice rolls differ.{Colors.RESET}")

    # ── Key Insights ──
    print_section("💡", "Key Insights — What Did We Learn?")

    print(f"""  {Colors.WHITE}1. {Colors.BOLD}A bigram model is just a LOOKUP TABLE{Colors.RESET}
     {Colors.DIM}For each character, it stores what might come next{Colors.RESET}

  {Colors.WHITE}2. {Colors.BOLD}Generation = repeated random sampling{Colors.RESET}
     {Colors.DIM}Pick a char → look up options → randomly choose → repeat{Colors.RESET}

  {Colors.WHITE}3. {Colors.BOLD}Context window = 1 character{Colors.RESET}
     {Colors.DIM}Our model only looks at the LAST character, which is why{Colors.RESET}
     {Colors.DIM}it can't form real words or sentences{Colors.RESET}

  {Colors.WHITE}4. {Colors.BOLD}More context = better predictions{Colors.RESET}
     {Colors.DIM}GPT-4 looks at 128,000 tokens of context — that's why it's so good!{Colors.RESET}

  {Colors.WHITE}5. {Colors.BOLD}The CORE IDEA is the same across all language models:{Colors.RESET}
     {Colors.YELLOW}  → Learn patterns from data{Colors.RESET}
     {Colors.YELLOW}  → Use those patterns to predict what comes next{Colors.RESET}

  {Colors.BOLD}{Colors.GREEN}╔══════════════════════════════════════════════════════════╗
  ║  You've gone from ZERO knowledge to building a model  ║
  ║  that generates text! That's incredible! 🌟             ║
  ╚══════════════════════════════════════════════════════════╝{Colors.RESET}

  {Colors.MAGENTA}⏭️  Ready for the next level? In Level 2, we'll build a{Colors.RESET}
  {Colors.MAGENTA}   NEURAL NETWORK that actually learns and improves!{Colors.RESET}""")

    # ── Print footer with celebration ──
    print_footer()

    # Final celebration! 🎉
    print(f"  {Colors.BOLD}{Colors.YELLOW}🎉 CONGRATULATIONS! You just built your first AI model! 🎉{Colors.RESET}")
    print(f"  {Colors.WHITE}You now understand:{Colors.RESET}")
    print(f"  {Colors.GREEN}  ✓ What prediction means{Colors.RESET}")
    print(f"  {Colors.GREEN}  ✓ How bigrams capture patterns{Colors.RESET}")
    print(f"  {Colors.GREEN}  ✓ How counting = learning{Colors.RESET}")
    print(f"  {Colors.GREEN}  ✓ How sampling = generation{Colors.RESET}")
    print(f"  {Colors.WHITE}These are the building blocks of EVERY language model.{Colors.RESET}\n")

Chapter 3

Neural Networks from Scratch

"The brain is a world consisting of a number of unexplored continents and great stretches of unknown territory." — Santiago Ramón y Cajal, Nobel Prize-winning neuroscientist

Learning Objectives

Explain why counting patterns (bigrams) has fundamental limitations
Describe how a biological neuron works and how we model it mathematically
Build a single artificial neuron from scratch and train it to learn logic gates
Understand activation functions (sigmoid, ReLU) and why non-linearity matters
Build a multi-layer neural network with forward pass computation
Explain backpropagation — the algorithm that makes neural networks learn
Train a character-level neural network to generate text
Interpret training loss curves and understand what they tell us about learning

3.1 From Counting to Learning 🔄

In Chapter 2, you built something remarkable — a model that generates text by counting character patterns. But you also saw its limitations. The bigram model:

Can only look at one character of context
Has no ability to generalise — if it hasn't seen a pattern, it can't predict it
Treats every character as completely independent from every other (it doesn't know 'a' and 'e' are both vowels)
Gets worse as we try to look at more context (the curse of dimensionality)

The bigram model does exactly what we tell it: count. But what if the computer could learn patterns by itself? What if, instead of us defining the rules, the machine could discover them?

That's exactly what neural networks do. And the beautiful thing is — the core idea is surprisingly simple.

Important

The Fundamental Shift: A bigram model is programmed — we tell it exactly what to count. A neural network is trained — we show it data and it figures out the patterns on its own. This shift from programming to training is the most important idea in modern AI.

3.2 What is a Neuron? 🧠

The Biological Inspiration

Your brain contains roughly 86 billion neurons. Each neuron:

Receives signals from other neurons through dendrites
Processes these signals in the cell body
Sends output through the axon if the total signal is strong enough
Connects to other neurons at synapses, with varying connection strengths

The key insight: some connections are stronger than others. When you learn something new, the connections between specific neurons get stronger or weaker. That's learning!

The Mathematical Model

We simplify this into a mathematical model:


inputs × weights → sum → activation → output

Let's use an analogy every Indian student will understand:

Tip

The Cricket Decision Analogy 🏏

Imagine a student deciding whether to go play cricket. Three factors matter:

| Factor | Value | Weight (Importance) | |--------|-------|-------------------| | Is homework done? | 1 (yes) or 0 (no) | 0.8 (very important!) | | Is weather good? | 1 (yes) or 0 (no) | 0.3 (nice but not critical) | | Are friends going? | 1 (yes) or 0 (no) | 0.6 (good motivation) |

The student multiplies each factor by its importance, adds them up, and makes a decision:

Score = (homework × 0.8) + (weather × 0.3) + (friends × 0.6) + bias

If the score is high → "Let's go play!" If low → "Better stay home."

That's exactly how an artificial neuron works!

3.3 Building a Single Neuron 🔬

Let's look at the actual Neuron class from our code. This is a complete artificial neuron built from scratch with NumPy:

Initialization — Starting with Random Guesses

Python
class Neuron:
    """
    A single artificial neuron — the fundamental building block of neural networks.

    Mathematical formula:
        output = sigmoid(w1*x1 + w2*x2 + ... + bias)
    """

    def __init__(self, num_inputs, learning_rate=0.5):
        """
        Initialize the neuron with random weights and zero bias.
        """
        # Initialize weights randomly between -1 and 1
        # Each weight represents how much the neuron "trusts" each input
        self.weights = np.random.uniform(-1, 1, num_inputs)

        # The bias is like the neuron's default tendency
        # Positive bias = tends to fire even without input
        # Negative bias = needs strong input to fire
        self.bias = 0.0

        # Controls how big each adjustment step is during learning
        self.learning_rate = learning_rate

Why random weights? If all weights start at zero, the neuron has no starting "opinion" — it can't learn effectively. Random weights give it a starting point to adjust from. Think of it as a student making initial guesses before learning the correct answers.

The Sigmoid Activation Function

The sigmoid function is the neuron's "decision maker." It squashes any number into the range (0, 1):

\sigma(x) = \frac{1}{1 + e^{-x}}

Python
    def sigmoid(self, x):
        """
        Sigmoid activation: σ(x) = 1 / (1 + e^(-x))

        - Negative x → output close to 0 ("no")
        - Positive x → output close to 1 ("yes")
        - x = 0 → output = 0.5 ("uncertain")
        """
        x = np.clip(x, -500, 500)  # Prevent overflow
        return 1 / (1 + np.exp(-x))

    def sigmoid_derivative(self, sigmoid_output):
        """
        Derivative of sigmoid: σ'(x) = σ(x) * (1 - σ(x))

        The derivative tells us the SLOPE — how sensitive the output
        is to changes in input. Essential for backpropagation!
        """
        return sigmoid_output * (1 - sigmoid_output)

Here's the sigmoid curve visualised:


    1.0 ┤                    ●●●●●●●
        │                 ●●●
        │               ●●
    0.5 ┤             ●●          ← "uncertain"
        │           ●●
        │        ●●●
    0.0 ┤  ●●●●●●●
        └──┬───┬───┬───┬───┬───┬──
          -5   -3   -1   +1   +3   +5

Note

🤔 Think About It

Why do we need the sigmoid function? Why not just use the raw weighted sum? Because without it, a neuron would just be a linear function — and stacking linear functions gives you... another linear function. Sigmoid introduces non-linearity, which allows neural networks to learn complex, curved patterns. Without activation functions, neural networks would be no more powerful than simple linear regression!

The Forward Pass — Making a Prediction

Python
    def forward(self, inputs, verbose=False):
        """
        Forward pass: compute the neuron's output.

        Steps:
        1. Compute weighted sum: Σ(wi * xi) + bias
        2. Apply sigmoid activation
        3. Return the output
        """
        # Step 1: Weighted sum — combines all inputs into a single number
        # np.dot computes: w1*x1 + w2*x2 + ... + wn*xn
        weighted_sum = np.dot(inputs, self.weights) + self.bias

        # Step 2: Sigmoid squashes it to (0, 1)
        output = self.sigmoid(weighted_sum)

        return output, weighted_sum

The Training Step — Learning from Mistakes

This is where the magic happens. The neuron learns by adjusting its weights after each mistake:

Python
    def train_step(self, inputs, target):
        """
        One step of training: forward → compute error → update weights.

        Like a teacher correcting a student:
        1. Student answers (forward pass)
        2. Teacher checks (compute error)
        3. Teacher gives feedback (compute gradient)
        4. Student adjusts (update weights)
        """
        # Step 1: Forward pass — make a prediction
        output, weighted_sum = self.forward(inputs)

        # Step 2: Compute error — how wrong are we?
        error = target - output

        # Step 3: Compute gradient — which direction to adjust
        sigmoid_deriv = self.sigmoid_derivative(output)
        gradient = error * sigmoid_deriv

        # Step 4: Update weights — nudge them to reduce error
        # weight_new = weight_old + learning_rate × gradient × input
        self.weights += self.learning_rate * gradient * inputs
        self.bias += self.learning_rate * gradient

        return error

Demo: Teaching a Neuron Logic Gates

Let's see this in action! We can teach a single neuron to learn the AND gate — a simple logic operation where the output is 1 only when BOTH inputs are 1:

Input 1	Input 2	AND Output
0	0	0
0	1	0
1	0	0
1	1	1

Python
# Create a neuron with 2 inputs
neuron = Neuron(num_inputs=2, learning_rate=0.5)

# AND gate training data
inputs_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
targets = np.array([0, 0, 0, 1], dtype=float)

# Train for 5000 epochs (showing the data 5000 times)
for epoch in range(5000):
    for i in range(len(inputs_data)):
        neuron.train_step(inputs_data[i], targets[i])

After 5000 epochs of training, the neuron learns to correctly implement the AND gate! The weights converge to values that make the neuron output ≈0 for all inputs except [1, 1], where it outputs ≈1.

Tip

Key Insight: A single neuron can learn any linearly separable problem. AND and OR are linearly separable. XOR is NOT — that's why we need networks of neurons (coming up next!).

3.4 Activation Functions — The Non-Linearity Secret 🔑

Why Non-Linearity Matters

Without activation functions, a neural network is just a fancy way of doing linear algebra. No matter how many layers you stack, the result is always a linear function:

f(x) = W_2 \cdot (W_1 \cdot x) = (W_2 \cdot W_1) \cdot x = W_{combined} \cdot x

Multiple linear layers collapse into a single linear layer! That's useless — we can't learn curves, boundaries, or complex patterns.

Activation functions like sigmoid break this linearity, allowing networks to learn any pattern.

Two Key Activation Functions

Sigmoid — used in our code:

\sigma(x) = \frac{1}{1 + e^{-x}} \qquad \text{Output range: (0, 1)}

Sigmoid derivative (crucial for backpropagation):

\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))

ReLU (Rectified Linear Unit) — used in modern deep learning:

f(x) = \max(0, x) \qquad \text{Output range: [0, ∞)}

Note

🤔 Think About It

Modern networks almost always use ReLU instead of sigmoid. Why? Sigmoid has a problem called vanishing gradients — for very large or very small inputs, the derivative is nearly zero, so learning slows to a crawl. ReLU doesn't have this problem (its derivative is either 0 or 1). But for our educational examples, sigmoid works perfectly and is easier to understand!

3.5 Building a Neural Network 🏗️

A single neuron can only learn simple patterns. To learn complex patterns, we connect many neurons in layers. Let's look at the NeuralNetwork class from step2_network.py:

Network Architecture

[Diagram: see interactive version]

The Network Class

Python
class NeuralNetwork:
    """
    A simple feedforward neural network with one hidden layer.

    Architecture:
        Input (n) → Hidden (16, sigmoid) → Output (m, softmax)

    WHY non-linearity matters:
    - Without activation functions, stacking layers is pointless
    - Multiple linear layers collapse into a single linear layer
    - Non-linearity allows the network to learn CURVES, not just lines
    """

    def __init__(self, input_size, hidden_size=16, output_size=4):
        """
        Initialize with Xavier-scaled random weights.

        WHY Xavier initialization?
        - Random weights too large → outputs explode
        - Random weights too small → outputs vanish
        - Xavier scales by 1/sqrt(n) to keep values reasonable
        """
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        # Weights: Input → Hidden (every input connects to every hidden neuron)
        scale_1 = np.sqrt(2.0 / (input_size + hidden_size))
        self.weights_input_hidden = np.random.randn(input_size, hidden_size) * scale_1
        self.biases_hidden = np.zeros(hidden_size)

        # Weights: Hidden → Output
        scale_2 = np.sqrt(2.0 / (hidden_size + output_size))
        self.weights_hidden_output = np.random.randn(hidden_size, output_size) * scale_2
        self.biases_output = np.zeros(output_size)

The Forward Pass

Python
    def forward(self, inputs, verbose=False):
        """
        Forward pass: push input through all layers to get output.

        Flow:
            Input → (weights × input + bias) → sigmoid → Hidden
            Hidden → (weights × hidden + bias) → softmax → Output
        """
        # LAYER 1: Input → Hidden
        # Matrix multiplication computes ALL weighted sums at once!
        hidden_raw = np.dot(inputs, self.weights_input_hidden) + self.biases_hidden
        hidden_activated = self.sigmoid(hidden_raw)  # Apply non-linearity

        # LAYER 2: Hidden → Output
        output_raw = np.dot(hidden_activated, self.weights_hidden_output) + self.biases_output
        output_probs = self.softmax(output_raw)  # Convert to probabilities

        return output_probs, hidden_raw, hidden_activated, output_raw

Softmax — Converting Scores to Probabilities

Python
    def softmax(self, x):
        """
        Softmax: converts raw scores to probabilities that sum to 1.

        softmax(xi) = e^(xi) / Σ(e^(xj))

        Example: Raw scores [2.0, 1.0, 0.5, 0.1]
                 → Probabilities [0.45, 0.17, 0.10, 0.07] (sum = 1.0)
        """
        x_shifted = x - np.max(x)  # Prevent overflow (e^1000 = crash!)
        exp_x = np.exp(x_shifted)
        return exp_x / np.sum(exp_x)

Tip

The Layers Analogy 🏫

Input layer = raw information (exam answers on paper)
Hidden layer = teachers detecting patterns ("this student knows algebra but struggles with geometry")
Output layer = final decision (grade: A, B, C, or D)

Think of the layers like a school processing system:

The first layer sees raw data. Middle layers find patterns. The output layer makes the final call.

3.6 The Magic of Backpropagation ✨

This is THE most important section in this chapter. Backpropagation is the algorithm that makes neural networks learn. Without it, neural networks would be useless.

The Big Picture

[Diagram: see interactive version]

Tip

The Teacher Grading Papers Analogy 📝

Imagine a teacher (the loss function) grading a student's (the network's) exam:

1. Forward Pass: The student answers the questions 2. Compute Loss: The teacher marks the answers — "You got 40% wrong" 3. Backward Pass (Chain Rule): The teacher traces back — "You got question 5 wrong BECAUSE you don't understand fractions, which is BECAUSE you didn't learn multiplication tables" 4. Update Weights: The teacher tells the student — "Practice multiplication tables more" (strengthen those connections)

Repeat this process thousands of times, and the student (network) masters the subject!

The Loss Function — How Wrong Are We?

We use cross-entropy loss, which is perfect for classification tasks:

L = -\sum_{i} y_i \cdot \log(\hat{y}_i)

Where y_i is the true label and \hat{y}_i is the predicted probability.

Python
    def compute_loss(self, predicted, target):
        """
        Cross-entropy loss: heavily penalizes confident WRONG predictions.

        - Predict right character with high probability → low loss
        - Predict wrong character → high loss
        """
        predicted_clipped = np.clip(predicted, 1e-15, 1.0)  # Prevent log(0)
        loss = -np.sum(target * np.log(predicted_clipped))
        return loss

The Backward Pass — Tracing the Error

This is the backward method from CharLevelNetwork in step3_train.py. Let's go through it line by line:

Python
    def backward(self, target):
        """
        Backward pass: compute gradients using the chain rule.

        THIS IS BACKPROPAGATION — the core of neural network learning!

        For output layer:
            ∂Loss/∂W2 = a1ᵀ · (predicted - target)
            ∂Loss/∂b2 = predicted - target

        For hidden layer:
            ∂Loss/∂W1 = xᵀ · (δ_hidden)
            ∂Loss/∂b1 = δ_hidden
            where δ_hidden = (predicted - target) · W2ᵀ × sigmoid'(a1)
        """
        # OUTPUT LAYER GRADIENT
        # The derivative of softmax + cross-entropy simplifies beautifully!
        delta_output = self.a2 - target  # Shape: (vocab_size,)

        # Gradient for W2: how much each hidden→output weight contributed
        dW2 = np.outer(self.a1, delta_output)  # Shape: (hidden, vocab)
        db2 = delta_output

        # HIDDEN LAYER GRADIENT
        # Step 1: Propagate error back through W2
        delta_hidden = np.dot(delta_output, self.W2.T)  # Shape: (hidden,)

        # Step 2: Multiply by sigmoid derivative (chain rule!)
        delta_hidden *= self.sigmoid_derivative(self.a1)

        # Gradient for W1
        dW1 = np.outer(self.x, delta_hidden)  # Shape: (vocab, hidden)
        db1 = delta_hidden

        # UPDATE WEIGHTS (Gradient Descent)
        # We move OPPOSITE to the gradient (gradient points toward MORE error)
        self.W2 -= self.learning_rate * dW2
        self.b2 -= self.learning_rate * db2
        self.W1 -= self.learning_rate * dW1
        self.b1 -= self.learning_rate * db1

Let's break down the math:

Step 1: The output error is simply predicted minus target:

\delta_{\text{output}} = \hat{y} - y

Step 2: The gradient for W₂ tells us how each weight contributed:

\frac{\partial L}{\partial W_2} = a_1^T \cdot \delta_{\text{output}}

Step 3: We propagate the error backwards through the weights:

\delta_{\text{hidden}} = (\delta_{\text{output}} \cdot W_2^T) \odot \sigma'(a_1)

The \odot symbol means element-wise multiplication. We multiply by the sigmoid derivative because the sigmoid "squashed" values during the forward pass — we need to account for that squashing.

Step 4: Update every weight by moving opposite to the gradient:

W_{\text{new}} = W_{\text{old}} - \eta \cdot \frac{\partial L}{\partial W}

Where \eta is the learning rate.

Warning

Common Pitfall: Learning Rate

Too high (e.g., 5.0): The network overshoots, bouncing wildly. Like a student who overcorrects every mistake.
Too low (e.g., 0.001): The network learns agonizingly slowly. Like a student who barely adjusts after feedback.
Just right (e.g., 0.5): Steady improvement. The sweet spot!

3.7 Training Loop Explained 🔄

The training loop in step3_train.py ties everything together. Here's the core of the train_network function:

Python
for epoch in range(epochs):  # epochs = 1500
    total_loss = 0.0

    # Shuffle training data each epoch
    # WHY: Prevents learning the ORDER of examples instead of patterns
    shuffle_idx = np.random.permutation(num_samples)

    for i in shuffle_idx:
        # 1. Forward pass: predict next character
        predicted = net.forward(inputs[i])

        # 2. Compute loss: how wrong is the prediction?
        loss = net.compute_loss(predicted, targets[i])
        total_loss += loss

        # 3. Backward pass: compute gradients and update weights
        net.backward(targets[i])

    avg_loss = total_loss / num_samples

Key Concepts in the Training Loop:

Epoch: One complete pass through ALL training examples. If you have 300 training pairs and train for 1500 epochs, the network sees each example 1500 times!

Shuffling: We randomise the order each epoch so the network doesn't memorise the sequence. Just like how a good teacher mixes up practice problems.

One-Hot Encoding: Each character is represented as a binary vector. If our vocabulary is ['a', 'b', 'c', 'd'], then:

'a' = [1, 0, 0, 0]
'b' = [0, 1, 0, 0]
'c' = [0, 0, 1, 0]

The Loss Going Down: When the network starts, its weights are random, so the loss is high (around 3.0). As it trains, the loss steadily decreases — the network is learning!


Epoch     0 │ Loss:   3.2814  │ Still learning...
Epoch   100 │ Loss:   2.7651  │ Still learning...
Epoch   500 │ Loss:   1.8203  │ Getting better...
Epoch  1000 │ Loss:   1.2145  │ Almost there!
Epoch  1499 │ Loss:   0.8932  │ 🎉 Mastered!

3.8 Seeing the Learning 📊

The step4_visualize.py script creates visualisations of the training process. The most important is the loss curve:


Loss
3.5 ┤██
    │███
    │████
2.5 ┤█████
    │██████
    │████████
1.5 ┤██████████
    │█████████████
    │████████████████
0.5 ┤█████████████████████
    └─────────────────────
     0    500   1000  1500
           Training Epoch

What does this tell us?

High loss at the start → the network is making random guesses
Rapidly decreasing loss → the network is learning the easiest patterns first
Slowly decreasing loss → the network is fine-tuning, learning subtler patterns
Flat loss at the end → the network has learned as much as it can from this data

Before vs After Comparison


🔴 BEFORE training (random weights):
   "xkq.pzmw bfvnj tglydc"  ← complete gibberish!

🟢 AFTER training (1500 epochs):
   "the network learns from data. the brain has"  ← recognisable English!

The network went from outputting random characters to generating text that resembles the training data. It learned which characters follow which — but using a neural network instead of simple counting!

Important

The Key Difference from Chapter 2: Our bigram model counted exact patterns: "after 't', 'h' appeared 42 times." The neural network learns distributed patterns: it represents characters as numbers in a hidden layer and discovers abstract relationships between them. This is why neural networks can generalise better.

💭 3.9 Discussion: What Can Neural Networks Learn? 🌟

### The Universal Approximation Theorem

Here's one of the most beautiful results in mathematics:

> With enough hidden neurons, a neural network with a single hidden layer can approximate ANY continuous function to arbitrary accuracy.

In plain language: if a pattern exists in the data, a neural network can learn it. Given enough neurons and enough data, there's essentially no limit to what patterns it can discover.

### Deep vs Shallow Networks

Our network has one hidden layer (a "shallow" network). Modern networks like GPT-4 have hundreds of layers (a "deep" network). Why does depth matter?

Think of it like this:

- Layer 1 might learn: "these character pairs are common"

- Layer 2 might learn: "these sequences of common pairs form syllables"

- Layer 3 might learn: "these syllables form words"

- Layer 4 might learn: "these words form phrases"

Each layer builds more abstract representations on top of the previous layer. Depth allows the network to learn hierarchical patterns — from simple to complex.

> [!NOTE]

> 🤔 Think About It

> Is this how the human brain works too? In some ways, yes! The visual cortex processes information in layers: the first layer detects edges, the next detects shapes, then objects, then faces. Each layer builds on the one before it. The analogy isn't perfect, but the principle of hierarchical feature extraction is shared.

Key Concepts Summary

Concept	Definition
Neuron	The fundamental unit of a neural network. Takes inputs, multiplies by weights, adds a bias, and applies an activation function.
Weight	A number that controls how much influence an input has on the neuron's output. Adjusted during training.
Bias	A number added to the weighted sum before activation. Shifts the activation threshold.
Activation Function	A non-linear function (like sigmoid or ReLU) that allows the network to learn complex patterns.
Forward Pass	The process of feeding input through the network to get a prediction.
Backpropagation	The algorithm that computes how much each weight contributed to the error, enabling learning.
Loss Function	Measures how wrong the network's predictions are. We try to minimise this.
Gradient	The direction and magnitude of the steepest increase in loss. We move opposite to it.
Learning Rate	Controls the step size of weight updates. Too high = unstable, too low = slow.
Epoch	One complete pass through all training data.
One-Hot Encoding	Representing categories (characters) as binary vectors. `'a'` = `[1, 0, 0, ...]`

📝 3.11 Exercises 📝

Hidden neurons experiment: Change the number of hidden neurons in CharLevelNetwork from 64 to 16, then to 128. How does it affect:

Training speed?

Final loss value?

Quality of generated text?

Learning rate experiment: Try these learning rates: 0.01, 0.5, and 5.0. What happens with each?

0.01: Does it converge? How long does it take?

0.5: This is the default — does it work well?

5.0: What goes wrong? (Hint: look at the loss curve)

Different training text: Replace SAMPLE_TEXT with a Hindi paragraph or a Bollywood dialogue. Run training and generate text. Does the network capture Hindi character patterns?

ReLU activation: Replace the sigmoid function with ReLU:

Python
   def relu(self, x):
       return np.maximum(0, x)

Does it train faster? Does the generated text quality change?

Overfitting experiment: Train for 50,000 epochs instead of 1,500. Does the loss keep going down? At some point, does the network start memorising the training text perfectly? (This is called overfitting — the network learns the data by heart instead of learning general patterns.)

💭 3.12 Discussion Questions 💭

Brain vs network: Our neural network has ~5,000 parameters. The human brain has ~86 billion neurons with ~100 trillion connections. What can a brain do that our tiny network can't? What fundamental capabilities are we missing?

Can networks be creative? When our trained network generates text, is it being "creative"? It's producing sequences it has never seen before, but based entirely on patterns from training data. Is human creativity any different?

The role of data: Our network was trained on a few hundred characters. ChatGPT was trained on billions of pages. How does the quantity and quality of training data affect what a network can learn?

Ethical questions: If a neural network learns to write like a famous Indian poet by training on their work, who owns the generated text? The programmer? The network? The poet?

The future of education: Could neural networks one day replace teachers? What can a human teacher do that an AI tutor cannot (or should not)?

📝 Chapter Summary

From counting to learning: Bigram models count explicitly. Neural networks learn patterns through training — a fundamental paradigm shift.
Neurons: The building block of neural networks. Each neuron computes a weighted sum of inputs, adds a bias, and applies an activation function.
Activation functions: Sigmoid squashes values to (0, 1). ReLU keeps positive values and zeros out negatives. Both introduce the non-linearity that makes neural networks powerful.
Neural networks: Multiple neurons connected in layers. Input layer → hidden layer(s) → output layer. Each layer transforms the data into more abstract representations.
Backpropagation: THE algorithm of deep learning. It uses the chain rule to trace errors backward through the network, computing how much each weight contributed to the error, then adjusting weights to reduce it.
Training: We showed the network thousands of examples, and it gradually reduced its loss from ~3.0 (random guessing) to ~0.9 (meaningful predictions).
The result: A network that started generating gibberish and learned to produce recognisable English text — all from scratch, with no frameworks!

⏭️ What's Next?

Our neural network works, but it has a major limitation: it looks at only one character at a time (just like the bigram model!). It doesn't understand sequences. It can't grasp that "the" is a word, or that "machine learning" is a phrase.

In Chapter 4: Embeddings and Attention, you'll learn:

How to represent characters and words as vectors (embeddings) in a continuous space
The revolutionary attention mechanism — "which parts of the input should I focus on?"
How attention allows models to understand relationships between distant words

This is where we start building toward the Transformer — the architecture behind ChatGPT and Gemini. The foundation you built in this chapter is exactly what you need. Let's go! 🚀

"It always seems impossible until it's done." — Nelson Mandela

You just built a neural network from scratch. Nothing is impossible now. 💪

Complete Source Code - Chapter 3

Below are the complete, runnable source files for this chapter. Every line is included.

Complete Code: step1_neuron.py

Python
"""
================================================================================
🧠 LEVEL 2 — STEP 1: A SINGLE NEURON
================================================================================
Build a single neuron from scratch using only Python + NumPy.
We'll see how a neuron learns to be an AND gate and an OR gate!

KEY CONCEPTS:
- A neuron takes inputs, multiplies by weights, adds a bias, then activates
- Sigmoid squashes any number into range (0, 1)
- Learning = adjusting weights to reduce error

NO DEEP LEARNING FRAMEWORKS — just NumPy and math!
================================================================================
"""

# ============================================================================
# IMPORTS
# ============================================================================
import numpy as np  # NumPy gives us fast math operations on arrays
import os           # For file path operations
import sys          # For system-level operations

# ============================================================================
# ANSI COLOR CODES
# ============================================================================
# WHY: We use ANSI escape codes to make terminal output colorful and readable.
# These codes tell the terminal to change text color/style.
# Format: \033[<code>m  where <code> is the color number.

class Colors:
    """ANSI color codes for beautiful terminal output."""
    RESET   = "\033[0m"      # Reset to default color
    BOLD    = "\033[1m"      # Bold text
    DIM     = "\033[2m"      # Dimmed text
    
    # Regular colors
    RED     = "\033[31m"     # For errors or wrong outputs
    GREEN   = "\033[32m"     # For correct outputs / success
    YELLOW  = "\033[33m"     # For warnings / highlights
    BLUE    = "\033[34m"     # For information
    MAGENTA = "\033[35m"     # For special highlights
    CYAN    = "\033[36m"     # For data values
    WHITE   = "\033[37m"     # For regular text
    
    # Bright colors
    BRIGHT_RED     = "\033[91m"
    BRIGHT_GREEN   = "\033[92m"
    BRIGHT_YELLOW  = "\033[93m"
    BRIGHT_BLUE    = "\033[94m"
    BRIGHT_MAGENTA = "\033[95m"
    BRIGHT_CYAN    = "\033[96m"


# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def print_header():
    """Print a beautiful header for this script."""
    print(f"\n{Colors.BRIGHT_CYAN}{'='*70}")
    print(f"  🧠  LEVEL 2 — STEP 1: A SINGLE NEURON FROM SCRATCH")
    print(f"{'='*70}{Colors.RESET}")
    print(f"{Colors.DIM}  Building the smallest unit of a neural network...{Colors.RESET}")
    print(f"{Colors.DIM}  Using only Python + NumPy. No frameworks!{Colors.RESET}\n")


def print_footer():
    """Print a beautiful footer for this script."""
    print(f"\n{Colors.BRIGHT_CYAN}{'='*70}")
    print(f"  ✅  STEP 1 COMPLETE! You now understand a single neuron!")
    print(f"  📝  Next: step2_network.py — Build a full network!")
    print(f"{'='*70}{Colors.RESET}\n")


def print_section(title, emoji="📌"):
    """Print a section header."""
    print(f"\n{Colors.BRIGHT_YELLOW}{'─'*70}")
    print(f"  {emoji}  {title}")
    print(f"{'─'*70}{Colors.RESET}\n")


# ============================================================================
# THE NEURON CLASS
# ============================================================================

class Neuron:
    """
    A single artificial neuron — the fundamental building block of neural networks.
    
    Think of it like a student making a decision:
    - It receives multiple INPUTS (pieces of information)
    - Each input has a WEIGHT (how much the student trusts that info)
    - It adds everything up (WEIGHTED SUM)
    - It passes through an ACTIVATION function (the decision threshold)
    - It produces an OUTPUT (the decision)
    
    Mathematical formula:
        output = sigmoid(w1*x1 + w2*x2 + ... + bias)
    """
    
    def __init__(self, num_inputs, learning_rate=0.5):
        """
        Initialize the neuron with random weights and zero bias.
        
        WHY random weights?
        - If all weights start at zero, the neuron has no starting "opinion"
        - Random weights give it a random starting point to learn from
        - Think of it as the student having some initial guesses
        
        WHY learning_rate?
        - Controls how big each adjustment step is
        - Too high = overshoots (student changes mind too drastically)
        - Too low = learns too slowly (student barely adjusts)
        - 0.5 is a reasonable starting point for simple problems
        """
        # Initialize weights randomly between -1 and 1
        # WHY: Each weight represents how much the neuron "trusts" each input
        # We use small random values so the neuron starts without strong opinions
        self.weights = np.random.uniform(-1, 1, num_inputs)
        
        # Initialize bias to zero
        # WHY: The bias is like the neuron's default tendency
        # A positive bias means the neuron tends to fire even without input
        # A negative bias means the neuron needs strong input to fire
        self.bias = 0.0
        
        # Store learning rate for weight updates
        # WHY: This controls the "step size" when adjusting weights
        self.learning_rate = learning_rate
    
    def sigmoid(self, x):
        """
        The Sigmoid activation function: σ(x) = 1 / (1 + e^(-x))
        
        WHY sigmoid?
        - Squashes ANY number into the range (0, 1)
        - This is perfect for "yes/no" decisions (probability)
        - It's smooth and differentiable (we can calculate gradients for learning)
        - Negative x → output close to 0 ("no")
        - Positive x → output close to 1 ("yes")
        - x = 0 → output = 0.5 ("uncertain")
        
        WHY clip x?
        - Very large values of x can cause overflow in e^(-x)
        - Clipping to [-500, 500] prevents numerical errors
        """
        x = np.clip(x, -500, 500)  # Prevent overflow in exponential
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_derivative(self, sigmoid_output):
        """
        Derivative of sigmoid: σ'(x) = σ(x) * (1 - σ(x))
        
        WHY do we need the derivative?
        - The derivative tells us the SLOPE (rate of change)
        - During learning, we need to know: "If I change the input slightly,
          how much does the output change?"
        - This is essential for backpropagation (learning from mistakes)
        
        WHY use sigmoid_output directly?
        - Beautiful math trick: sigmoid's derivative can be computed from
          the sigmoid value itself, saving computation!
        """
        return sigmoid_output * (1 - sigmoid_output)
    
    def forward(self, inputs, verbose=False):
        """
        Forward pass: compute the neuron's output for given inputs.
        
        Steps:
        1. Compute weighted sum: Σ(wi * xi) + bias
        2. Apply sigmoid activation
        3. Return the output
        
        This is like the student:
        1. Gathering all information and weighing it
        2. Making a decision based on the total
        """
        # Step 1: Weighted sum
        # WHY: Each input is multiplied by its weight, then all are summed
        # This combines all inputs into a single number
        # Think of it as: "How strong is the total evidence?"
        weighted_sum = np.dot(inputs, self.weights) + self.bias
        
        # Step 2: Activation (sigmoid)
        # WHY: The raw sum could be any number (-inf to +inf)
        # Sigmoid converts it to a probability between 0 and 1
        output = self.sigmoid(weighted_sum)
        
        # Verbose printing for educational purposes
        if verbose:
            print(f"  {Colors.CYAN}Inputs:        {inputs}{Colors.RESET}")
            print(f"  {Colors.MAGENTA}Weights:       {np.round(self.weights, 4)}{Colors.RESET}")
            print(f"  {Colors.BLUE}Bias:          {self.bias:.4f}{Colors.RESET}")
            print(f"  {Colors.YELLOW}Weighted Sum:  {weighted_sum:.4f}{Colors.RESET}")
            print(f"  {Colors.BRIGHT_GREEN}Sigmoid Output: {output:.4f}{Colors.RESET}")
            print()
        
        return output, weighted_sum
    
    def train_step(self, inputs, target):
        """
        One step of training: forward → compute error → update weights.
        
        This is the neuron LEARNING from one example.
        
        Like a teacher correcting a student:
        1. Student answers (forward pass)
        2. Teacher checks (compute error)
        3. Teacher gives feedback (compute gradient)
        4. Student adjusts (update weights)
        
        Parameters:
            inputs: the input values (what the neuron sees)
            target: the correct answer (what we WANT the neuron to output)
        
        Returns:
            error: how wrong the neuron was
        """
        # Step 1: Forward pass — make a prediction
        output, weighted_sum = self.forward(inputs)
        
        # Step 2: Compute error — how wrong are we?
        # WHY simple subtraction: For a single neuron, this works fine
        # For networks, we'd use more sophisticated loss functions
        error = target - output
        
        # Step 3: Compute gradient
        # WHY: The gradient tells us "which direction to adjust each weight"
        # gradient = error × sigmoid_derivative × input
        # - error: how wrong we are (magnitude and direction)
        # - sigmoid_derivative: how sensitive the output is to changes
        # - input: which inputs contributed to the error
        sigmoid_deriv = self.sigmoid_derivative(output)
        gradient = error * sigmoid_deriv
        
        # Step 4: Update weights and bias
        # WHY: We move each weight in the direction that reduces the error
        # weight_new = weight_old + learning_rate × gradient × input
        # The learning_rate controls how big each step is
        self.weights += self.learning_rate * gradient * inputs
        self.bias += self.learning_rate * gradient
        
        return error


# ============================================================================
# DEMONSTRATION FUNCTIONS
# ============================================================================

def demo_single_neuron():
    """
    Demonstrate how a single neuron processes inputs step by step.
    """
    print_section("DEMO 1: How a Single Neuron Works", "🔬")
    
    print(f"  {Colors.WHITE}A neuron is like a student deciding whether to go to cricket:{Colors.RESET}")
    print(f"  {Colors.DIM}  Input 1: Is homework done?    (1 = yes, 0 = no){Colors.RESET}")
    print(f"  {Colors.DIM}  Input 2: Is weather good?     (1 = yes, 0 = no){Colors.RESET}")
    print(f"  {Colors.DIM}  Input 3: Are friends going?   (1 = yes, 0 = no){Colors.RESET}\n")
    
    # Create a neuron with 3 inputs
    # WHY 3: We have 3 pieces of information to consider
    neuron = Neuron(num_inputs=3)
    
    # Set meaningful weights manually for demonstration
    # WHY these values: They represent how much the student cares about each factor
    neuron.weights = np.array([0.8, 0.3, 0.6])  # Homework most important!
    neuron.bias = -0.5  # Slight tendency to NOT go (responsible student!)
    
    print(f"  {Colors.BRIGHT_MAGENTA}Neuron Configuration:{Colors.RESET}")
    print(f"  {Colors.MAGENTA}  Weight for homework:  0.8 (most important!){Colors.RESET}")
    print(f"  {Colors.MAGENTA}  Weight for weather:   0.3 (nice but not critical){Colors.RESET}")
    print(f"  {Colors.MAGENTA}  Weight for friends:   0.6 (important motivation){Colors.RESET}")
    print(f"  {Colors.BLUE}  Bias:                -0.5 (tends to stay home){Colors.RESET}\n")
    
    # Test different scenarios
    scenarios = [
        ([1, 1, 1], "Homework ✓, Weather ✓, Friends ✓"),
        ([1, 0, 1], "Homework ✓, Weather ✗, Friends ✓"),
        ([0, 1, 1], "Homework ✗, Weather ✓, Friends ✓"),
        ([0, 0, 0], "Homework ✗, Weather ✗, Friends ✗"),
    ]
    
    for inputs_list, description in scenarios:
        inputs = np.array(inputs_list, dtype=float)
        print(f"  {Colors.BRIGHT_YELLOW}Scenario: {description}{Colors.RESET}")
        output, _ = neuron.forward(inputs, verbose=True)
        
        # Interpret the decision
        if output > 0.5:
            print(f"  {Colors.BRIGHT_GREEN}  → Decision: GO to cricket! "
                  f"(confidence: {output:.1%}){Colors.RESET}\n")
        else:
            print(f"  {Colors.BRIGHT_RED}  → Decision: STAY home. "
                  f"(confidence: {1-output:.1%}){Colors.RESET}\n")


def demo_and_gate():
    """
    Test a neuron on the AND gate truth table.
    AND gate: output is 1 ONLY when BOTH inputs are 1.
    """
    print_section("DEMO 2: Neuron as AND Gate (Before Training)", "🔌")
    
    print(f"  {Colors.WHITE}AND Gate Truth Table:{Colors.RESET}")
    print(f"  {Colors.DIM}  0 AND 0 = 0")
    print(f"  0 AND 1 = 0")
    print(f"  1 AND 0 = 0")
    print(f"  1 AND 1 = 1{Colors.RESET}\n")
    
    # Create neuron with random weights
    np.random.seed(42)  # WHY: Makes results reproducible for teaching
    neuron = Neuron(num_inputs=2, learning_rate=0.5)
    
    # AND gate data
    # WHY one-hot-like: Simple binary inputs for logic gates
    inputs_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
    targets = np.array([0, 0, 0, 1], dtype=float)
    
    print(f"  {Colors.BRIGHT_MAGENTA}Initial Random Weights: "
          f"{np.round(neuron.weights, 4)}{Colors.RESET}")
    print(f"  {Colors.BLUE}Initial Bias: {neuron.bias:.4f}{Colors.RESET}\n")
    
    # Test BEFORE training
    print(f"  {Colors.BRIGHT_RED}Before Training (random guesses):{Colors.RESET}")
    for i in range(len(inputs_data)):
        output, _ = neuron.forward(inputs_data[i])
        expected = targets[i]
        correct = "✓" if round(output) == expected else "✗"
        color = Colors.GREEN if correct == "✓" else Colors.RED
        print(f"    {Colors.CYAN}{inputs_data[i]}{Colors.RESET} → "
              f"{Colors.YELLOW}{output:.4f}{Colors.RESET} "
              f"(expected: {expected:.0f}) {color}{correct}{Colors.RESET}")
    
    return neuron, inputs_data, targets


def demo_or_gate():
    """
    Test a neuron on the OR gate truth table.
    OR gate: output is 1 when AT LEAST ONE input is 1.
    """
    print_section("DEMO 3: Neuron as OR Gate (Before Training)", "🔌")
    
    print(f"  {Colors.WHITE}OR Gate Truth Table:{Colors.RESET}")
    print(f"  {Colors.DIM}  0 OR 0 = 0")
    print(f"  0 OR 1 = 1")
    print(f"  1 OR 0 = 1")
    print(f"  1 OR 1 = 1{Colors.RESET}\n")
    
    np.random.seed(123)
    neuron = Neuron(num_inputs=2, learning_rate=0.5)
    
    inputs_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
    targets = np.array([0, 1, 1, 1], dtype=float)
    
    print(f"  {Colors.BRIGHT_MAGENTA}Initial Random Weights: "
          f"{np.round(neuron.weights, 4)}{Colors.RESET}")
    print(f"  {Colors.BLUE}Initial Bias: {neuron.bias:.4f}{Colors.RESET}\n")
    
    print(f"  {Colors.BRIGHT_RED}Before Training (random guesses):{Colors.RESET}")
    for i in range(len(inputs_data)):
        output, _ = neuron.forward(inputs_data[i])
        expected = targets[i]
        correct = "✓" if round(output) == expected else "✗"
        color = Colors.GREEN if correct == "✓" else Colors.RED
        print(f"    {Colors.CYAN}{inputs_data[i]}{Colors.RESET} → "
              f"{Colors.YELLOW}{output:.4f}{Colors.RESET} "
              f"(expected: {expected:.0f}) {color}{correct}{Colors.RESET}")
    
    return neuron, inputs_data, targets


def train_neuron(neuron, inputs_data, targets, gate_name, epochs=5000):
    """
    Train a neuron to learn a logic gate.
    
    This is the LEARNING process:
    - We show the neuron each example many times (epochs)
    - Each time, it adjusts its weights to reduce the error
    - Over time, it learns the correct behavior
    
    Like a student practicing math problems:
    - First attempts are mostly wrong
    - With practice, accuracy improves
    - Eventually, the student masters it!
    """
    print_section(f"TRAINING: Neuron Learning {gate_name} Gate", "🎓")
    
    print(f"  {Colors.WHITE}Training for {epochs} epochs...{Colors.RESET}")
    print(f"  {Colors.DIM}(Each epoch = showing ALL examples once){Colors.RESET}\n")
    
    # Track errors for display
    # WHY: We want to see the neuron improving over time
    milestone_epochs = [0, 10, 50, 100, 500, 1000, 2000, epochs-1]
    
    for epoch in range(epochs):
        total_error = 0
        
        # Train on each example
        # WHY shuffle? In real training, shuffling prevents the network
        # from learning the ORDER of examples instead of the patterns.
        # For this simple demo, we keep it ordered for clarity.
        for i in range(len(inputs_data)):
            error = neuron.train_step(inputs_data[i], targets[i])
            total_error += abs(error)
        
        # Print progress at milestones
        if epoch in milestone_epochs:
            avg_error = total_error / len(inputs_data)
            
            # Color based on error level
            if avg_error < 0.05:
                color = Colors.BRIGHT_GREEN
                bar = "█" * 20
                status = "🎉 Mastered!"
            elif avg_error < 0.1:
                color = Colors.GREEN
                bar_len = int(20 * (1 - avg_error))
                bar = "█" * bar_len + "░" * (20 - bar_len)
                status = "Almost there!"
            elif avg_error < 0.2:
                color = Colors.YELLOW
                bar_len = int(20 * (1 - avg_error))
                bar = "█" * bar_len + "░" * (20 - bar_len)
                status = "Getting better..."
            else:
                color = Colors.RED
                bar_len = int(20 * (1 - min(avg_error, 1.0)))
                bar = "█" * bar_len + "░" * (20 - bar_len)
                status = "Still learning..."
            
            print(f"  {Colors.DIM}Epoch {epoch:>5}{Colors.RESET} │ "
                  f"{color}Error: {avg_error:.4f} │ [{bar}] │ {status}{Colors.RESET}")
    
    # Show final results
    print(f"\n  {Colors.BRIGHT_GREEN}{'─'*50}")
    print(f"  ✅ Training Complete!{Colors.RESET}\n")
    
    print(f"  {Colors.BRIGHT_MAGENTA}Final Weights: "
          f"{np.round(neuron.weights, 4)}{Colors.RESET}")
    print(f"  {Colors.BLUE}Final Bias: {neuron.bias:.4f}{Colors.RESET}\n")
    
    print(f"  {Colors.BRIGHT_GREEN}After Training:{Colors.RESET}")
    all_correct = True
    for i in range(len(inputs_data)):
        output, _ = neuron.forward(inputs_data[i])
        expected = targets[i]
        correct = "✓" if round(output) == expected else "✗"
        color = Colors.GREEN if correct == "✓" else Colors.RED
        if correct == "✗":
            all_correct = False
        print(f"    {Colors.CYAN}{inputs_data[i]}{Colors.RESET} → "
              f"{Colors.YELLOW}{output:.4f}{Colors.RESET} "
              f"(expected: {expected:.0f}) → rounded: {round(output)} "
              f"{color}{correct}{Colors.RESET}")
    
    if all_correct:
        print(f"\n  {Colors.BRIGHT_GREEN}🎉 The neuron learned the {gate_name} gate "
              f"PERFECTLY!{Colors.RESET}")
    else:
        print(f"\n  {Colors.YELLOW}⚠️  The neuron is still learning. "
              f"Try more epochs!{Colors.RESET}")


def explain_learning():
    """
    Print a visual explanation of how the neuron learns.
    """
    print_section("HOW DOES THE NEURON LEARN?", "💡")
    
    print(f"""  {Colors.WHITE}The neuron learns through a simple 4-step process:{Colors.RESET}

  {Colors.BRIGHT_CYAN}┌──────────────────────────────────────────────────────┐
  │                                                      │
  │  Step 1: FORWARD PASS                                │
  │  ─────────────────                                   │
  │  Feed inputs through the neuron to get a prediction  │
  │  output = sigmoid(weights · inputs + bias)           │
  │                                                      │
  │  Step 2: COMPUTE ERROR                               │
  │  ────────────────                                    │
  │  error = expected_output - actual_output             │
  │  "How wrong was the neuron?"                         │
  │                                                      │
  │  Step 3: COMPUTE GRADIENT                            │
  │  ───────────────────                                 │
  │  gradient = error × sigmoid_derivative(output)       │
  │  "Which direction should we adjust?"                 │
  │                                                      │
  │  Step 4: UPDATE WEIGHTS                              │
  │  ─────────────────                                   │
  │  weight += learning_rate × gradient × input          │
  │  "Nudge weights to reduce the error"                 │
  │                                                      │
  └──────────────────────────────────────────────────────┘{Colors.RESET}

  {Colors.BRIGHT_YELLOW}KEY INSIGHT:{Colors.RESET}
  {Colors.YELLOW}This is like a student learning from a teacher:
    1. Student answers a question (forward pass)
    2. Teacher says "you're wrong by X" (error)
    3. Teacher says "adjust your thinking THIS way" (gradient)
    4. Student adjusts their understanding (weight update)
  Repeat thousands of times → student masters the subject!{Colors.RESET}
""")


# ============================================================================
# MAIN EXECUTION
# ============================================================================

if __name__ == "__main__":
    """
    Main execution block — runs when you execute this script directly.
    
    WHY __name__ == '__main__'?
    - This is a Python convention
    - Code inside this block ONLY runs when you run this file directly
    - If someone imports this file, this code won't execute
    - This lets us use the Neuron class in other files without running demos
    """
    
    # Print the beautiful header
    print_header()
    
    # Demo 1: Show how a single neuron works
    demo_single_neuron()
    
    # Demo 2: AND gate (before training)
    and_neuron, and_inputs, and_targets = demo_and_gate()
    
    # Demo 3: OR gate (before training)
    or_neuron, or_inputs, or_targets = demo_or_gate()
    
    # Explain how learning works
    explain_learning()
    
    # Train the AND gate neuron
    train_neuron(and_neuron, and_inputs, and_targets, "AND", epochs=5000)
    
    # Train a fresh OR gate neuron
    np.random.seed(123)
    or_neuron_fresh = Neuron(num_inputs=2, learning_rate=0.5)
    train_neuron(or_neuron_fresh, or_inputs, or_targets, "OR", epochs=5000)
    
    # Print the footer
    print_footer()

Complete Code: step2_network.py

Python
"""
================================================================================
🧠 LEVEL 2 — STEP 2: BUILDING A NEURAL NETWORK
================================================================================
Build a full neural network from scratch with:
  - Input layer → Hidden layer (16 neurons) → Output layer
  - Sigmoid activation for hidden layer
  - Softmax activation for output layer
  - Forward pass that shows data flowing through each layer

NO DEEP LEARNING FRAMEWORKS — just NumPy and math!
================================================================================
"""

# ============================================================================
# IMPORTS
# ============================================================================
import numpy as np  # NumPy for fast array operations
import os           # For file path operations


# ============================================================================
# ANSI COLOR CODES
# ============================================================================
# WHY: Colors make terminal output easier to read and more engaging
# Each color code starts with \033[ (escape sequence) and ends with m

class Colors:
    """ANSI color codes for beautiful terminal output."""
    RESET   = "\033[0m"
    BOLD    = "\033[1m"
    DIM     = "\033[2m"
    
    RED     = "\033[31m"
    GREEN   = "\033[32m"
    YELLOW  = "\033[33m"
    BLUE    = "\033[34m"
    MAGENTA = "\033[35m"
    CYAN    = "\033[36m"
    WHITE   = "\033[37m"
    
    BRIGHT_RED     = "\033[91m"
    BRIGHT_GREEN   = "\033[92m"
    BRIGHT_YELLOW  = "\033[93m"
    BRIGHT_BLUE    = "\033[94m"
    BRIGHT_MAGENTA = "\033[95m"
    BRIGHT_CYAN    = "\033[96m"

    # Background colors for extra flair
    BG_BLUE   = "\033[44m"
    BG_GREEN  = "\033[42m"
    BG_YELLOW = "\033[43m"


# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def print_header():
    """Print a beautiful header for this script."""
    print(f"\n{Colors.BRIGHT_MAGENTA}{'='*70}")
    print(f"  🧠  LEVEL 2 — STEP 2: BUILDING A NEURAL NETWORK")
    print(f"{'='*70}{Colors.RESET}")
    print(f"{Colors.DIM}  A multi-layer network with forward pass visualization!{Colors.RESET}")
    print(f"{Colors.DIM}  Input → Hidden (16 neurons, sigmoid) → Output (softmax){Colors.RESET}\n")


def print_footer():
    """Print a beautiful footer for this script."""
    print(f"\n{Colors.BRIGHT_MAGENTA}{'='*70}")
    print(f"  ✅  STEP 2 COMPLETE! You built a full neural network!")
    print(f"  📝  Next: step3_train.py — Train it to generate text!")
    print(f"{'='*70}{Colors.RESET}\n")


def print_section(title, emoji="📌"):
    """Print a section header with color."""
    print(f"\n{Colors.BRIGHT_YELLOW}{'─'*70}")
    print(f"  {emoji}  {title}")
    print(f"{'─'*70}{Colors.RESET}\n")


# ============================================================================
# NEURAL NETWORK CLASS
# ============================================================================

class NeuralNetwork:
    """
    A simple feedforward neural network with one hidden layer.
    
    Architecture:
        Input (n features) → Hidden (16 neurons, sigmoid) → Output (m classes, softmax)
    
    WHY this architecture?
    - One hidden layer is enough to learn many patterns (Universal Approximation Theorem)
    - 16 hidden neurons is enough for simple tasks but shows the concept clearly
    - Sigmoid in hidden layer: squashes values to (0, 1), introduces non-linearity
    - Softmax in output layer: converts raw scores into probabilities that sum to 1
    
    WHY non-linearity matters:
    - Without activation functions, stacking layers would be pointless
    - Multiple linear layers collapse into a single linear layer
    - Non-linearity (sigmoid) allows the network to learn CURVES, not just lines
    """
    
    def __init__(self, input_size, hidden_size=16, output_size=4):
        """
        Initialize the network with random weights.
        
        Parameters:
            input_size:  Number of input features (e.g., 4 for 4 inputs)
            hidden_size: Number of neurons in hidden layer (default: 16)
            output_size: Number of output classes (default: 4)
        
        WHY Xavier initialization?
        - Random weights that are too large → outputs explode to infinity
        - Random weights that are too small → outputs shrink to zero
        - Xavier initialization scales weights by 1/sqrt(n) to keep values reasonable
        - Named after Xavier Glorot who proposed this in 2010
        """
        # Store architecture info for display
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # ── Weights connecting Input → Hidden ──
        # Shape: (input_size, hidden_size)
        # WHY this shape: Each input connects to EVERY hidden neuron
        # So we need input_size × hidden_size connections
        # Xavier initialization: scale by sqrt(2 / (fan_in + fan_out))
        scale_1 = np.sqrt(2.0 / (input_size + hidden_size))
        self.weights_input_hidden = np.random.randn(input_size, hidden_size) * scale_1
        
        # ── Biases for Hidden Layer ──
        # Shape: (hidden_size,) — one bias per hidden neuron
        # WHY zeros: Biases start at zero; they'll be learned during training
        self.biases_hidden = np.zeros(hidden_size)
        
        # ── Weights connecting Hidden → Output ──
        # Shape: (hidden_size, output_size)
        # WHY: Each hidden neuron connects to EVERY output neuron
        scale_2 = np.sqrt(2.0 / (hidden_size + output_size))
        self.weights_hidden_output = np.random.randn(hidden_size, output_size) * scale_2
        
        # ── Biases for Output Layer ──
        # Shape: (output_size,) — one bias per output neuron
        self.biases_output = np.zeros(output_size)
    
    def sigmoid(self, x):
        """
        Sigmoid activation: σ(x) = 1 / (1 + e^(-x))
        
        WHY sigmoid for hidden layer?
        - Squashes values into (0, 1) range
        - Smooth and differentiable (needed for backpropagation)
        - Easy to understand conceptually: "how active is this neuron?"
        - A neuron with output close to 1 is "strongly activated"
        - A neuron with output close to 0 is "barely activated"
        """
        x = np.clip(x, -500, 500)  # Prevent overflow
        return 1 / (1 + np.exp(-x))
    
    def softmax(self, x):
        """
        Softmax activation: converts raw scores to probabilities.
        
        Formula: softmax(xi) = e^(xi) / Σ(e^(xj))
        
        WHY softmax for output layer?
        - We want PROBABILITIES (they must sum to 1.0)
        - If we're predicting "which class?", we want: P(class1) + P(class2) + ... = 1
        - Softmax naturally does this!
        
        WHY subtract max(x)?
        - Numerical stability! e^(large number) = infinity = crash
        - Subtracting max makes the largest value 0, preventing overflow
        - Math still works: softmax(x) = softmax(x - max(x))  [can be proven]
        """
        # Subtract max for numerical stability (prevents e^1000 = infinity)
        x_shifted = x - np.max(x)
        exp_x = np.exp(x_shifted)
        return exp_x / np.sum(exp_x)
    
    def forward(self, inputs, verbose=False):
        """
        Forward pass: push input through all layers to get output.
        
        Flow:
            Input → (weights × input + bias) → sigmoid → Hidden
            Hidden → (weights × hidden + bias) → softmax → Output
        
        This is like a relay race:
        - Input layer PASSES information to hidden layer
        - Hidden layer PROCESSES it (sigmoid squashes it)
        - Hidden layer PASSES processed info to output layer
        - Output layer CONVERTS it to probabilities (softmax)
        
        Returns:
            output_probs: probability distribution over output classes
            hidden_raw:   raw weighted sums before sigmoid (for visualization)
            hidden_activated: hidden values after sigmoid (for visualization)
            output_raw:   raw weighted sums before softmax (for visualization)
        """
        # ── LAYER 1: Input → Hidden ──
        # Matrix multiplication: each hidden neuron computes its weighted sum
        # WHY np.dot: This efficiently computes all weighted sums at once
        # Instead of looping over each neuron, matrix math does it in one shot!
        hidden_raw = np.dot(inputs, self.weights_input_hidden) + self.biases_hidden
        # Shape: (hidden_size,) — one value per hidden neuron
        
        # Apply sigmoid activation to hidden layer
        # WHY: Without this, the network is just a linear function
        # Sigmoid introduces the non-linearity that makes neural networks powerful
        hidden_activated = self.sigmoid(hidden_raw)
        
        # ── LAYER 2: Hidden → Output ──
        # Same process: matrix multiply hidden activations by output weights
        output_raw = np.dot(hidden_activated, self.weights_hidden_output) + self.biases_output
        
        # Apply softmax to get probabilities
        # WHY softmax here: We want the output to be a probability distribution
        # Each output value represents: "how likely is this class?"
        output_probs = self.softmax(output_raw)
        
        # Verbose output for educational purposes
        if verbose:
            self._print_forward_pass(inputs, hidden_raw, hidden_activated,
                                     output_raw, output_probs)
        
        return output_probs, hidden_raw, hidden_activated, output_raw
    
    def _print_forward_pass(self, inputs, hidden_raw, hidden_activated,
                            output_raw, output_probs):
        """
        Print the complete forward pass with beautiful formatting.
        Shows exactly what happens at each stage of the computation.
        """
        print(f"  {Colors.BRIGHT_CYAN}╔══════════════════════════════════════════════════════╗")
        print(f"  ║            FORWARD PASS VISUALIZATION               ║")
        print(f"  ╚══════════════════════════════════════════════════════╝{Colors.RESET}\n")
        
        # ── Input Layer ──
        print(f"  {Colors.BRIGHT_GREEN}▸ INPUT LAYER{Colors.RESET} "
              f"{Colors.DIM}({self.input_size} values){Colors.RESET}")
        print(f"    {Colors.CYAN}", end="")
        for i, val in enumerate(inputs):
            print(f"x{i}={val:.2f}  ", end="")
        print(f"{Colors.RESET}\n")
        
        print(f"  {Colors.DIM}    │  Matrix multiply by weights "
              f"({self.input_size}×{self.hidden_size}){Colors.RESET}")
        print(f"  {Colors.DIM}    │  Add biases ({self.hidden_size}){Colors.RESET}")
        print(f"  {Colors.DIM}    ▼{Colors.RESET}\n")
        
        # ── Hidden Layer (raw) ──
        print(f"  {Colors.BRIGHT_YELLOW}▸ HIDDEN LAYER — Raw Weighted Sums{Colors.RESET} "
              f"{Colors.DIM}(before activation){Colors.RESET}")
        self._print_neuron_values(hidden_raw, "h", Colors.YELLOW)
        
        print(f"\n  {Colors.DIM}    │  Apply sigmoid: σ(x) = 1/(1+e^(-x)){Colors.RESET}")
        print(f"  {Colors.DIM}    │  Squash each value to (0, 1){Colors.RESET}")
        print(f"  {Colors.DIM}    ▼{Colors.RESET}\n")
        
        # ── Hidden Layer (activated) ──
        print(f"  {Colors.BRIGHT_GREEN}▸ HIDDEN LAYER — After Sigmoid{Colors.RESET} "
              f"{Colors.DIM}(activated values){Colors.RESET}")
        self._print_neuron_values(hidden_activated, "a", Colors.GREEN, show_bar=True)
        
        print(f"\n  {Colors.DIM}    │  Matrix multiply by weights "
              f"({self.hidden_size}×{self.output_size}){Colors.RESET}")
        print(f"  {Colors.DIM}    │  Add biases ({self.output_size}){Colors.RESET}")
        print(f"  {Colors.DIM}    ▼{Colors.RESET}\n")
        
        # ── Output Layer (raw) ──
        print(f"  {Colors.BRIGHT_MAGENTA}▸ OUTPUT LAYER — Raw Scores{Colors.RESET} "
              f"{Colors.DIM}(before softmax){Colors.RESET}")
        self._print_neuron_values(output_raw, "o", Colors.MAGENTA)
        
        print(f"\n  {Colors.DIM}    │  Apply softmax: convert to probabilities{Colors.RESET}")
        print(f"  {Colors.DIM}    │  All values sum to 1.0{Colors.RESET}")
        print(f"  {Colors.DIM}    ▼{Colors.RESET}\n")
        
        # ── Output Layer (probabilities) ──
        print(f"  {Colors.BRIGHT_CYAN}▸ OUTPUT LAYER — Probabilities{Colors.RESET} "
              f"{Colors.DIM}(after softmax){Colors.RESET}")
        self._print_probability_bars(output_probs)
        
        # Sum check
        print(f"\n    {Colors.DIM}Sum of probabilities: {Colors.BRIGHT_GREEN}"
              f"{np.sum(output_probs):.6f}{Colors.RESET} "
              f"{Colors.DIM}(should be 1.000000) ✓{Colors.RESET}")
        
        # Prediction
        predicted_class = np.argmax(output_probs)
        print(f"\n  {Colors.BRIGHT_GREEN}  🎯 Predicted Class: {predicted_class} "
              f"(probability: {output_probs[predicted_class]:.4f}){Colors.RESET}")
    
    def _print_neuron_values(self, values, prefix, color, show_bar=False):
        """Print neuron values in a formatted grid."""
        # Print 4 values per row for readability
        for i in range(0, len(values), 4):
            row_vals = values[i:i+4]
            row_str = "    "
            for j, val in enumerate(row_vals):
                idx = i + j
                if show_bar:
                    # Show a mini bar chart for activated values (0-1 range)
                    bar_len = int(val * 10)
                    bar = "█" * bar_len + "░" * (10 - bar_len)
                    row_str += f"{color}{prefix}{idx:>2}={val:>6.3f} [{bar}]  {Colors.RESET}"
                else:
                    row_str += f"{color}{prefix}{idx:>2}={val:>8.4f}  {Colors.RESET}"
            print(row_str)
    
    def _print_probability_bars(self, probs):
        """Print probability values as horizontal bar charts."""
        max_idx = np.argmax(probs)
        for i, prob in enumerate(probs):
            bar_len = int(prob * 40)  # Scale to 40 characters wide
            bar = "█" * bar_len + "░" * (40 - bar_len)
            
            # Highlight the highest probability
            if i == max_idx:
                print(f"    {Colors.BRIGHT_GREEN}Class {i}: {prob:.4f} [{bar}] ◄ WINNER{Colors.RESET}")
            else:
                print(f"    {Colors.CYAN}Class {i}: {prob:.4f} [{bar}]{Colors.RESET}")


# ============================================================================
# VISUALIZATION FUNCTIONS
# ============================================================================

def print_network_architecture(net):
    """
    Print a visual ASCII art representation of the network architecture.
    
    WHY visualize?
    - Seeing the structure helps understand what's happening
    - You can see how many connections there are
    - It makes the concept tangible
    """
    print_section("NETWORK ARCHITECTURE", "🏗️")
    
    # Count total parameters
    total_params = (net.input_size * net.hidden_size +   # Input→Hidden weights
                   net.hidden_size +                      # Hidden biases
                   net.hidden_size * net.output_size +    # Hidden→Output weights
                   net.output_size)                       # Output biases
    
    total_weights = (net.input_size * net.hidden_size + 
                    net.hidden_size * net.output_size)
    total_biases = net.hidden_size + net.output_size
    
    print(f"  {Colors.BRIGHT_CYAN}Network Configuration:{Colors.RESET}")
    print(f"  {Colors.CYAN}  • Input size:    {net.input_size} neurons{Colors.RESET}")
    print(f"  {Colors.CYAN}  • Hidden size:   {net.hidden_size} neurons (sigmoid){Colors.RESET}")
    print(f"  {Colors.CYAN}  • Output size:   {net.output_size} neurons (softmax){Colors.RESET}")
    print(f"  {Colors.YELLOW}  • Total weights: {total_weights}{Colors.RESET}")
    print(f"  {Colors.YELLOW}  • Total biases:  {total_biases}{Colors.RESET}")
    print(f"  {Colors.BRIGHT_GREEN}  • Total params:  {total_params}{Colors.RESET}\n")
    
    # ASCII art network diagram
    print(f"  {Colors.BRIGHT_MAGENTA}Visual Architecture:{Colors.RESET}\n")
    
    # Determine how many neurons to show (max display for readability)
    in_show = min(net.input_size, 5)
    hid_show = min(net.hidden_size, 6)
    out_show = min(net.output_size, 5)
    
    # Build the visual layer by layer
    in_labels = [f"x{i}" for i in range(in_show)]
    if net.input_size > in_show:
        in_labels.append("...")
        in_labels.append(f"x{net.input_size-1}")
    
    hid_labels = [f"h{i}" for i in range(hid_show)]
    if net.hidden_size > hid_show:
        hid_labels.append("...")
        hid_labels.append(f"h{net.hidden_size-1}")
    
    out_labels = [f"y{i}" for i in range(out_show)]
    if net.output_size > out_show:
        out_labels.append("...")
        out_labels.append(f"y{net.output_size-1}")
    
    # Calculate layout
    max_rows = max(len(in_labels), len(hid_labels), len(out_labels))
    
    # Pad lists to same length
    def pad_list(lst, target_len):
        while len(lst) < target_len:
            lst.append("")
        return lst
    
    in_labels = pad_list(in_labels, max_rows)
    hid_labels = pad_list(hid_labels, max_rows)
    out_labels = pad_list(out_labels, max_rows)
    
    # Print header
    print(f"    {Colors.BRIGHT_GREEN}  INPUT       {Colors.BRIGHT_YELLOW}  HIDDEN        {Colors.BRIGHT_CYAN}  OUTPUT{Colors.RESET}")
    print(f"    {Colors.BRIGHT_GREEN}  LAYER       {Colors.BRIGHT_YELLOW}  LAYER         {Colors.BRIGHT_CYAN}  LAYER{Colors.RESET}")
    print(f"    {Colors.BRIGHT_GREEN}  ({net.input_size})       "
          f"{Colors.BRIGHT_YELLOW}  ({net.hidden_size}, sigmoid) "
          f"{Colors.BRIGHT_CYAN}  ({net.output_size}, softmax){Colors.RESET}")
    print()
    
    for i in range(max_rows):
        in_val = in_labels[i]
        hid_val = hid_labels[i]
        out_val = out_labels[i]
        
        # Input neuron
        if in_val and in_val != "...":
            in_part = f"{Colors.BRIGHT_GREEN}   ( {in_val:>3} ){Colors.RESET}"
        elif in_val == "...":
            in_part = f"{Colors.DIM}     ...   {Colors.RESET}"
        else:
            in_part = "           "
        
        # Connection lines
        if in_val and in_val != "..." and hid_val and hid_val != "...":
            conn1 = f"{Colors.DIM}─────►{Colors.RESET}"
        elif in_val and in_val != "...":
            conn1 = f"{Colors.DIM}──┐   {Colors.RESET}"
        elif hid_val and hid_val != "...":
            conn1 = f"{Colors.DIM}  └──►{Colors.RESET}"
        else:
            conn1 = "      "
        
        # Hidden neuron
        if hid_val and hid_val != "...":
            hid_part = f"{Colors.BRIGHT_YELLOW}( {hid_val:>3} ){Colors.RESET}"
        elif hid_val == "...":
            hid_part = f"{Colors.DIM}  ...  {Colors.RESET}"
        else:
            hid_part = "       "
        
        # Connection lines 2
        if hid_val and hid_val != "..." and out_val and out_val != "...":
            conn2 = f"{Colors.DIM}─────►{Colors.RESET}"
        elif hid_val and hid_val != "...":
            conn2 = f"{Colors.DIM}──┐   {Colors.RESET}"
        elif out_val and out_val != "...":
            conn2 = f"{Colors.DIM}  └──►{Colors.RESET}"
        else:
            conn2 = "      "
        
        # Output neuron
        if out_val and out_val != "...":
            out_part = f"{Colors.BRIGHT_CYAN}( {out_val:>3} ){Colors.RESET}"
        elif out_val == "...":
            out_part = f"{Colors.DIM}  ...  {Colors.RESET}"
        else:
            out_part = "       "
        
        print(f"  {in_part}{conn1}{hid_part}{conn2}{out_part}")
    
    print(f"\n  {Colors.DIM}  Note: In reality, EVERY input neuron connects to EVERY hidden neuron,")
    print(f"  and EVERY hidden neuron connects to EVERY output neuron.{Colors.RESET}")
    print(f"  {Colors.DIM}  That's {net.input_size}×{net.hidden_size} + {net.hidden_size}×{net.output_size} = {total_weights} connection weights!{Colors.RESET}")


def print_weight_statistics(net):
    """
    Print statistics about the network's weights.
    
    WHY: Understanding weight distributions helps diagnose network health.
    - If weights are too large: outputs explode (gradient explosion)
    - If weights are too small: outputs vanish (vanishing gradients)
    - Well-initialized weights should have mean ≈ 0 and small std
    """
    print_section("WEIGHT STATISTICS", "📊")
    
    layers = [
        ("Input → Hidden", net.weights_input_hidden),
        ("Hidden Biases", net.biases_hidden),
        ("Hidden → Output", net.weights_hidden_output),
        ("Output Biases", net.biases_output),
    ]
    
    print(f"  {Colors.WHITE}{'Layer':<20} {'Shape':<15} {'Mean':>8} {'Std':>8} "
          f"{'Min':>8} {'Max':>8}{Colors.RESET}")
    print(f"  {Colors.DIM}{'─'*20} {'─'*15} {'─'*8} {'─'*8} {'─'*8} {'─'*8}{Colors.RESET}")
    
    for name, weights in layers:
        mean = np.mean(weights)
        std = np.std(weights)
        wmin = np.min(weights)
        wmax = np.max(weights)
        shape = str(weights.shape)
        
        # Color code based on health
        if abs(mean) < 0.1 and std < 1.0:
            color = Colors.GREEN  # Healthy
        elif abs(mean) < 0.5:
            color = Colors.YELLOW  # Okay
        else:
            color = Colors.RED  # Concerning
        
        print(f"  {color}{name:<20} {shape:<15} {mean:>8.4f} {std:>8.4f} "
              f"{wmin:>8.4f} {wmax:>8.4f}{Colors.RESET}")
    
    print(f"\n  {Colors.BRIGHT_GREEN}✓ All weights look healthy! "
          f"(Xavier initialization working well){Colors.RESET}")


def demo_forward_pass(net):
    """
    Run a sample forward pass and display everything.
    """
    print_section("FORWARD PASS DEMO", "🚀")
    
    # Create a sample input
    # WHY this input: We use values between 0 and 1 to simulate typical features
    sample_input = np.random.rand(net.input_size)
    
    print(f"  {Colors.WHITE}Feeding a sample input through the network...{Colors.RESET}")
    print(f"  {Colors.DIM}(Using random input values between 0 and 1){Colors.RESET}\n")
    
    # Run forward pass with verbose output
    output_probs, hidden_raw, hidden_activated, output_raw = net.forward(
        sample_input, verbose=True
    )


def demo_multiple_inputs(net):
    """
    Show how the network responds to different inputs.
    """
    print_section("MULTIPLE INPUTS — SAME NETWORK", "🔄")
    
    print(f"  {Colors.WHITE}Let's see how the same network responds to different inputs:{Colors.RESET}\n")
    
    # Generate several different inputs
    test_inputs = [
        ("All zeros", np.zeros(net.input_size)),
        ("All ones", np.ones(net.input_size)),
        ("Random #1", np.random.rand(net.input_size)),
        ("Random #2", np.random.rand(net.input_size)),
        ("Alternating", np.array([1 if i % 2 == 0 else 0 for i in range(net.input_size)], dtype=float)),
    ]
    
    print(f"  {Colors.WHITE}{'Input Type':<15} ", end="")
    for i in range(net.output_size):
        print(f"{'Class '+str(i):>10} ", end="")
    print(f"{'Prediction':>12}{Colors.RESET}")
    
    print(f"  {Colors.DIM}{'─'*15} ", end="")
    for i in range(net.output_size):
        print(f"{'─'*10} ", end="")
    print(f"{'─'*12}{Colors.RESET}")
    
    for name, inp in test_inputs:
        output_probs, _, _, _ = net.forward(inp)
        predicted = np.argmax(output_probs)
        
        print(f"  {Colors.CYAN}{name:<15} ", end="")
        for prob in output_probs:
            # Color intensity based on probability
            if prob > 0.5:
                color = Colors.BRIGHT_GREEN
            elif prob > 0.25:
                color = Colors.YELLOW
            else:
                color = Colors.DIM
            print(f"{color}{prob:>10.4f} ", end="")
        print(f"{Colors.BRIGHT_GREEN}→ Class {predicted}{Colors.RESET}")
    
    print(f"\n  {Colors.BRIGHT_YELLOW}💡 Notice: Without training, the network gives "
          f"random-looking predictions!{Colors.RESET}")
    print(f"  {Colors.YELLOW}   This is because the weights are random. "
          f"Training will fix this!{Colors.RESET}")


def explain_concepts():
    """
    Print educational explanation of key concepts.
    """
    print_section("KEY CONCEPTS EXPLAINED", "💡")
    
    print(f"""  {Colors.BRIGHT_CYAN}┌──────────────────────────────────────────────────────────┐
  │                   WHAT JUST HAPPENED?                   │
  ├──────────────────────────────────────────────────────────┤
  │                                                          │
  │  1. INPUT LAYER receives raw data                        │
  │     → Just passes values through, no processing          │
  │                                                          │
  │  2. HIDDEN LAYER does the heavy lifting                  │
  │     → Multiplies inputs by weights (matrix multiplication)│
  │     → Adds biases (shifts the activation)                │
  │     → Applies sigmoid (introduces non-linearity)         │
  │     → Each neuron detects a different PATTERN            │
  │                                                          │
  │  3. OUTPUT LAYER makes the final decision                │
  │     → Combines hidden neurons' opinions                  │
  │     → Applies softmax (converts to probabilities)        │
  │     → Highest probability = the prediction               │
  │                                                          │
  ├──────────────────────────────────────────────────────────┤
  │                                                          │
  │  🔑 KEY INSIGHT:                                         │
  │  Right now the network is UNTRAINED — its weights are    │
  │  random, so predictions are random too!                  │
  │  In step3_train.py, we'll train it to learn patterns.    │
  │                                                          │
  └──────────────────────────────────────────────────────────┘{Colors.RESET}
""")

    # Activation function comparison
    print(f"  {Colors.BRIGHT_YELLOW}Activation Functions Used:{Colors.RESET}\n")
    
    print(f"  {Colors.GREEN}  SIGMOID (Hidden Layer):{Colors.RESET}")
    print(f"  {Colors.DIM}  σ(x) = 1 / (1 + e^(-x)){Colors.RESET}")
    print(f"  {Colors.DIM}  Output range: (0, 1){Colors.RESET}")
    print(f"  {Colors.DIM}  Used for: internal feature detection{Colors.RESET}\n")
    
    # ASCII sigmoid curve
    print(f"    {Colors.GREEN}1.0 ┤                    ●●●●●●●{Colors.RESET}")
    print(f"    {Colors.GREEN}    │                 ●●●{Colors.RESET}")
    print(f"    {Colors.GREEN}    │               ●●{Colors.RESET}")
    print(f"    {Colors.GREEN}0.5 ┤             ●●{Colors.RESET}")
    print(f"    {Colors.GREEN}    │           ●●{Colors.RESET}")
    print(f"    {Colors.GREEN}    │        ●●●{Colors.RESET}")
    print(f"    {Colors.GREEN}0.0 ┤  ●●●●●●●{Colors.RESET}")
    print(f"    {Colors.DIM}    └──┬───┬───┬───┬───┬───┬──{Colors.RESET}")
    print(f"    {Colors.DIM}      -5   -3   -1   +1   +3   +5{Colors.RESET}\n")
    
    print(f"  {Colors.MAGENTA}  SOFTMAX (Output Layer):{Colors.RESET}")
    print(f"  {Colors.DIM}  softmax(xi) = e^xi / Σ(e^xj){Colors.RESET}")
    print(f"  {Colors.DIM}  Output range: (0, 1) per class, sum = 1.0{Colors.RESET}")
    print(f"  {Colors.DIM}  Used for: final classification / probability distribution{Colors.RESET}\n")
    
    print(f"    {Colors.MAGENTA}  Raw scores:    [2.0,  1.0,  0.5,  0.1]{Colors.RESET}")
    print(f"    {Colors.DIM}       ↓ softmax{Colors.RESET}")
    print(f"    {Colors.MAGENTA}  Probabilities: [0.45, 0.17, 0.10, 0.07]  "
          f"{Colors.DIM}(sum ≈ 1.0){Colors.RESET}\n")


# ============================================================================
# MAIN EXECUTION
# ============================================================================

if __name__ == "__main__":
    """
    Main execution block.
    
    WHY: This pattern ensures the demo code only runs when this file
    is executed directly, not when it's imported by another file.
    """
    
    # Set random seed for reproducibility
    # WHY: Same seed = same random numbers = same output every time
    # This is important for teaching — students get the same results
    np.random.seed(42)
    
    # Print header
    print_header()
    
    # Create the neural network
    # WHY these sizes:
    # - 8 inputs: a reasonable small feature vector
    # - 16 hidden: enough neurons to learn patterns, few enough to display
    # - 4 outputs: simulates a 4-class classification problem
    input_size = 8
    hidden_size = 16
    output_size = 4
    
    print(f"  {Colors.WHITE}Creating a neural network:{Colors.RESET}")
    print(f"  {Colors.CYAN}  Input:  {input_size} neurons  (raw features){Colors.RESET}")
    print(f"  {Colors.CYAN}  Hidden: {hidden_size} neurons (sigmoid activation){Colors.RESET}")
    print(f"  {Colors.CYAN}  Output: {output_size} neurons  (softmax → probabilities){Colors.RESET}\n")
    
    net = NeuralNetwork(input_size, hidden_size, output_size)
    
    # Show network architecture
    print_network_architecture(net)
    
    # Show weight statistics
    print_weight_statistics(net)
    
    # Run a forward pass demo
    demo_forward_pass(net)
    
    # Show multiple inputs
    demo_multiple_inputs(net)
    
    # Explain concepts
    explain_concepts()
    
    # Print footer
    print_footer()

Complete Code: step3_train.py

Python
"""
================================================================================
🧠 LEVEL 2 — STEP 3: TRAINING A NEURAL NETWORK
================================================================================
Train a character-level neural network to predict the next character!

This script:
  1. Takes a sample text and creates training data (character pairs)
  2. One-hot encodes characters (binary representation)
  3. Builds a neural network with backpropagation FROM SCRATCH
  4. Trains the network to predict: given a character, what comes next?
  5. Generates text using the trained network

NO DEEP LEARNING FRAMEWORKS — everything from scratch with NumPy!
Training completes in under 60 seconds.
================================================================================
"""

# ============================================================================
# IMPORTS
# ============================================================================
import numpy as np   # For fast math operations on arrays
import os            # For file path operations (saving training history)
import json          # For saving/loading training history
import time          # For measuring training duration
import sys           # For system operations


# ============================================================================
# ANSI COLOR CODES
# ============================================================================
class Colors:
    """ANSI color codes for beautiful terminal output."""
    RESET   = "\033[0m"
    BOLD    = "\033[1m"
    DIM     = "\033[2m"
    
    RED     = "\033[31m"
    GREEN   = "\033[32m"
    YELLOW  = "\033[33m"
    BLUE    = "\033[34m"
    MAGENTA = "\033[35m"
    CYAN    = "\033[36m"
    WHITE   = "\033[37m"
    
    BRIGHT_RED     = "\033[91m"
    BRIGHT_GREEN   = "\033[92m"
    BRIGHT_YELLOW  = "\033[93m"
    BRIGHT_BLUE    = "\033[94m"
    BRIGHT_MAGENTA = "\033[95m"
    BRIGHT_CYAN    = "\033[96m"


# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def print_header():
    """Print a beautiful header for this script."""
    print(f"\n{Colors.BRIGHT_BLUE}{'='*70}")
    print(f"  🧠  LEVEL 2 — STEP 3: TRAINING A NEURAL NETWORK")
    print(f"{'='*70}{Colors.RESET}")
    print(f"{Colors.DIM}  Character-level text generation with backpropagation!{Colors.RESET}")
    print(f"{Colors.DIM}  Everything from scratch — no frameworks!{Colors.RESET}\n")


def print_footer():
    """Print a beautiful footer."""
    print(f"\n{Colors.BRIGHT_BLUE}{'='*70}")
    print(f"  ✅  STEP 3 COMPLETE! The network learned to generate text!")
    print(f"  📝  Next: step4_visualize.py — Visualize the training!")
    print(f"{'='*70}{Colors.RESET}\n")


def print_section(title, emoji="📌"):
    """Print a section header."""
    print(f"\n{Colors.BRIGHT_YELLOW}{'─'*70}")
    print(f"  {emoji}  {title}")
    print(f"{'─'*70}{Colors.RESET}\n")


# ============================================================================
# TRAINING DATA PREPARATION
# ============================================================================

# WHY this text: A short, meaningful paragraph with enough variety of characters
# to learn patterns. We keep it short so training finishes quickly.
# The text has repeated patterns (common English letter sequences) that the
# network can learn to reproduce.
SAMPLE_TEXT = (
    "the quick brown fox jumps over the lazy dog. "
    "a neural network learns from data. "
    "the brain has billions of neurons. "
    "each neuron connects to many others. "
    "learning happens when connections change. "
    "the network adjusts weights to reduce error. "
    "practice makes perfect in learning. "
    "data is the fuel for machine learning. "
)


def prepare_data(text):
    """
    Prepare character-level training data from text.
    
    Steps:
    1. Find all unique characters in the text (vocabulary)
    2. Create mappings: character → index and index → character
    3. Create training pairs: (current_char, next_char)
    4. One-hot encode everything
    
    WHY character-level?
    - Simplest form of text generation
    - No need for tokenizers or word dictionaries
    - Shows the core concept: predict what comes next
    
    WHY one-hot encoding?
    - Neural networks work with NUMBERS, not characters
    - One-hot = a vector of 0s with a single 1 at the character's index
    - Example: if vocab = ['a', 'b', 'c'], then 'b' = [0, 1, 0]
    - This treats each character as equally different from every other
    """
    # Step 1: Get unique characters and sort them
    # WHY sort: Ensures consistent ordering across runs
    chars = sorted(list(set(text)))
    vocab_size = len(chars)
    
    # Step 2: Create mappings
    # WHY two mappings: We need to go both ways
    # char_to_idx: 'a' → 0, 'b' → 1, etc. (for encoding input)
    # idx_to_char: 0 → 'a', 1 → 'b', etc. (for decoding output)
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    idx_to_char = {i: ch for i, ch in enumerate(chars)}
    
    # Step 3: Create training pairs
    # WHY pairs: We train the network to predict: "given THIS character, 
    # what character comes NEXT?"
    # Example: "hello" → [('h','e'), ('e','l'), ('l','l'), ('l','o')]
    input_indices = []
    target_indices = []
    
    for i in range(len(text) - 1):
        input_indices.append(char_to_idx[text[i]])
        target_indices.append(char_to_idx[text[i + 1]])
    
    # Step 4: One-hot encode
    # WHY one-hot: Each character becomes a binary vector
    # This lets the network treat each character as a separate "category"
    num_samples = len(input_indices)
    inputs_onehot = np.zeros((num_samples, vocab_size))
    targets_onehot = np.zeros((num_samples, vocab_size))
    
    for i in range(num_samples):
        inputs_onehot[i, input_indices[i]] = 1.0
        targets_onehot[i, target_indices[i]] = 1.0
    
    return (inputs_onehot, targets_onehot, input_indices, target_indices,
            chars, char_to_idx, idx_to_char, vocab_size)


# ============================================================================
# NEURAL NETWORK CLASS (WITH BACKPROPAGATION)
# ============================================================================

class CharLevelNetwork:
    """
    A neural network that learns to predict the next character.
    
    Architecture:
        Input (vocab_size) → Hidden (64 neurons, sigmoid) → Output (vocab_size, softmax)
    
    This class implements:
    - Forward pass (making predictions)
    - Backward pass (learning from mistakes) — BACKPROPAGATION!
    - Weight updates (gradient descent)
    - Text generation (using learned patterns)
    
    WHY 64 hidden neurons?
    - Enough to learn character-level patterns in our small text
    - Not so many that training is slow
    - A good balance for educational purposes
    """
    
    def __init__(self, vocab_size, hidden_size=64, learning_rate=0.5):
        """
        Initialize the network with random weights.
        
        WHY Xavier initialization?
        - Prevents gradients from exploding or vanishing
        - Keeps values in a reasonable range during forward pass
        """
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate
        
        # ── Weights: Input → Hidden ──
        # Shape: (vocab_size, hidden_size)
        # Each input character connects to every hidden neuron
        scale_1 = np.sqrt(2.0 / (vocab_size + hidden_size))
        self.W1 = np.random.randn(vocab_size, hidden_size) * scale_1
        self.b1 = np.zeros(hidden_size)
        
        # ── Weights: Hidden → Output ──
        # Shape: (hidden_size, vocab_size)
        # Each hidden neuron connects to every output character
        scale_2 = np.sqrt(2.0 / (hidden_size + vocab_size))
        self.W2 = np.random.randn(hidden_size, vocab_size) * scale_2
        self.b2 = np.zeros(vocab_size)
    
    def sigmoid(self, x):
        """
        Sigmoid activation: σ(x) = 1 / (1 + e^(-x))
        Squashes values to (0, 1) range.
        """
        x = np.clip(x, -500, 500)
        return 1.0 / (1.0 + np.exp(-x))
    
    def sigmoid_derivative(self, sigmoid_output):
        """
        Derivative of sigmoid: σ'(x) = σ(x) × (1 - σ(x))
        
        WHY we need this:
        - Backpropagation requires the derivative of each function
        - The derivative tells us "how sensitive is the output to input changes"
        - This is used in the chain rule to compute gradients
        """
        return sigmoid_output * (1.0 - sigmoid_output)
    
    def softmax(self, x):
        """
        Softmax: converts raw scores to probabilities that sum to 1.
        """
        x_shifted = x - np.max(x)  # Numerical stability
        exp_x = np.exp(x_shifted)
        return exp_x / np.sum(exp_x)
    
    def forward(self, x):
        """
        Forward pass: Input → Hidden → Output.
        
        Saves intermediate values for backpropagation.
        
        WHY save intermediate values?
        - During backward pass, we need to know what happened at each step
        - The chain rule requires values from the forward pass
        - Think of it as keeping your rough work so the teacher can check it
        """
        # ── Layer 1: Input → Hidden ──
        # z1 = x · W1 + b1 (weighted sum)
        self.x = x  # Save input for backward pass
        self.z1 = np.dot(x, self.W1) + self.b1  # Raw hidden values
        self.a1 = self.sigmoid(self.z1)  # Activated hidden values
        
        # ── Layer 2: Hidden → Output ──
        # z2 = a1 · W2 + b2 (weighted sum)
        self.z2 = np.dot(self.a1, self.W2) + self.b2  # Raw output values
        self.a2 = self.softmax(self.z2)  # Output probabilities
        
        return self.a2  # Return predicted probabilities
    
    def compute_loss(self, predicted, target):
        """
        Cross-entropy loss: measures how wrong our prediction is.
        
        Formula: Loss = -Σ target_i × log(predicted_i)
        
        WHY cross-entropy?
        - Perfect for classification (predicting categories)
        - Heavily penalizes confident WRONG predictions
        - If we predict the right character with high probability → low loss
        - If we predict the wrong character → high loss
        
        WHY clip predicted values?
        - log(0) = -infinity → crash!
        - Clipping to [1e-15, 1] prevents this numerical issue
        """
        # Clip to prevent log(0) which is undefined
        predicted_clipped = np.clip(predicted, 1e-15, 1.0)
        
        # Cross-entropy loss
        loss = -np.sum(target * np.log(predicted_clipped))
        return loss
    
    def backward(self, target):
        """
        Backward pass: compute gradients using the chain rule.
        
        THIS IS BACKPROPAGATION — the core of neural network learning!
        
        The chain rule in action:
        
        For output layer (Layer 2):
            ∂Loss/∂W2 = a1ᵀ · (predicted - target)
            ∂Loss/∂b2 = predicted - target
        
        For hidden layer (Layer 1):
            ∂Loss/∂W1 = xᵀ · (δ_hidden)
            ∂Loss/∂b1 = δ_hidden
            where δ_hidden = (predicted - target) · W2ᵀ × sigmoid'(a1)
        
        WHY this math?
        - We want to know: "How much did each weight contribute to the error?"
        - The chain rule lets us trace the error backwards through the network
        - Each weight gets a gradient: "move this direction to reduce error"
        
        Analogy:
        - A teacher marking an exam traces back through each step
        - "You got the final answer wrong BECAUSE you made an error in step 3"
        - The gradient tells each weight: "you were responsible for THIS much error"
        """
        # ── Output Layer Gradient ──
        # The derivative of softmax + cross-entropy simplifies beautifully!
        # δ_output = predicted - target
        # WHY so simple: This is one of the beautiful mathematical properties
        # of combining softmax with cross-entropy loss
        delta_output = self.a2 - target  # Shape: (vocab_size,)
        
        # Gradient for W2: how much each hidden→output weight contributed to error
        # WHY outer product: We need gradient for EVERY weight in the matrix
        # self.a1.reshape(-1, 1) × delta_output.reshape(1, -1) gives us the matrix
        dW2 = np.outer(self.a1, delta_output)  # Shape: (hidden_size, vocab_size)
        db2 = delta_output  # Shape: (vocab_size,)
        
        # ── Hidden Layer Gradient ──
        # Step 1: Propagate error back through W2
        # WHY dot with W2.T: We're tracing the error back through the connections
        # Each hidden neuron receives error proportional to its connection weight
        delta_hidden = np.dot(delta_output, self.W2.T)  # Shape: (hidden_size,)
        
        # Step 2: Multiply by sigmoid derivative (chain rule!)
        # WHY: The sigmoid "squashed" the values during forward pass
        # We need to account for this squashing when computing gradients
        delta_hidden *= self.sigmoid_derivative(self.a1)
        
        # Gradient for W1
        dW1 = np.outer(self.x, delta_hidden)  # Shape: (vocab_size, hidden_size)
        db1 = delta_hidden  # Shape: (hidden_size,)
        
        # ── Update Weights (Gradient Descent) ──
        # WHY subtract: We move OPPOSITE to the gradient direction
        # Gradient points toward INCREASING loss
        # We want to DECREASE loss, so we go in the opposite direction
        # learning_rate controls how big each step is
        self.W2 -= self.learning_rate * dW2
        self.b2 -= self.learning_rate * db2
        self.W1 -= self.learning_rate * dW1
        self.b1 -= self.learning_rate * db1
    
    def generate(self, start_char_idx, length, idx_to_char, temperature=1.0):
        """
        Generate text character by character.
        
        Process:
        1. Start with a character
        2. Feed it through the network → get probabilities for next character
        3. Sample from those probabilities → get next character
        4. Feed THAT character back in → repeat
        
        WHY temperature?
        - Controls how "creative" vs "predictable" the output is
        - temperature = 1.0: normal sampling
        - temperature < 1.0: more conservative (picks most likely characters)
        - temperature > 1.0: more random/creative
        - Think of it as: low temperature = a cautious student, 
                          high temperature = a creative student
        """
        generated = []
        current_idx = start_char_idx
        
        for _ in range(length):
            # Create one-hot input for current character
            x = np.zeros(self.vocab_size)
            x[current_idx] = 1.0
            
            # Forward pass
            probs = self.forward(x)
            
            # Apply temperature scaling
            # WHY: Adjusts the "sharpness" of the probability distribution
            if temperature != 1.0:
                log_probs = np.log(np.clip(probs, 1e-15, 1.0)) / temperature
                log_probs -= np.max(log_probs)
                probs = np.exp(log_probs)
                probs = probs / np.sum(probs)
            
            # Sample from probability distribution
            # WHY sample instead of argmax: Adds variety to generated text
            # If we always pick the most likely character, output is repetitive
            current_idx = np.random.choice(len(probs), p=probs)
            generated.append(idx_to_char[current_idx])
        
        return ''.join(generated)


# ============================================================================
# TRAINING FUNCTION
# ============================================================================

def train_network(text, epochs=1500, print_every=100, sample_every=500):
    """
    Train the character-level network on the given text.
    
    Parameters:
        text:         The training text
        epochs:       Number of complete passes through the data
        print_every:  Print loss every N epochs
        sample_every: Generate sample text every N epochs
    
    Returns:
        net:          The trained network
        history:      Training history (losses, samples, etc.)
    """
    print_section("PREPARING TRAINING DATA", "📦")
    
    # Prepare data
    (inputs, targets, input_indices, target_indices,
     chars, char_to_idx, idx_to_char, vocab_size) = prepare_data(text)
    
    num_samples = len(input_indices)
    
    print(f"  {Colors.WHITE}Training Text:{Colors.RESET}")
    # Print text with word wrapping
    for i in range(0, len(text), 60):
        print(f"  {Colors.CYAN}  \"{text[i:i+60]}\"{Colors.RESET}")
    
    print(f"\n  {Colors.BRIGHT_GREEN}Data Statistics:{Colors.RESET}")
    print(f"  {Colors.GREEN}  • Text length:     {len(text)} characters{Colors.RESET}")
    print(f"  {Colors.GREEN}  • Vocabulary size:  {vocab_size} unique characters{Colors.RESET}")
    print(f"  {Colors.GREEN}  • Training pairs:   {num_samples}{Colors.RESET}")
    print(f"  {Colors.GREEN}  • Characters:       {repr(''.join(chars))}{Colors.RESET}")
    
    print(f"\n  {Colors.BRIGHT_YELLOW}One-Hot Encoding Example:{Colors.RESET}")
    example_char = 'a'
    if example_char in char_to_idx:
        idx = char_to_idx[example_char]
        onehot = np.zeros(vocab_size)
        onehot[idx] = 1.0
        # Show just first 15 values to keep it readable
        display_len = min(15, vocab_size)
        print(f"  {Colors.YELLOW}  '{example_char}' → index {idx} → "
              f"[{', '.join(f'{int(v)}' for v in onehot[:display_len])}{'...' if vocab_size > display_len else ''}]{Colors.RESET}")
        print(f"  {Colors.DIM}  (A vector of {vocab_size} numbers — all zeros except "
              f"position {idx} which is 1){Colors.RESET}")
    
    # ── Create Network ──
    print_section("CREATING NETWORK", "🏗️")
    
    # WHY learning_rate=0.5: A reasonable starting value for this problem
    # Too high → training is unstable (overshoots)
    # Too low → training is too slow (doesn't converge in time)
    net = CharLevelNetwork(vocab_size=vocab_size, hidden_size=64, learning_rate=0.5)
    
    print(f"  {Colors.BRIGHT_CYAN}Network Architecture:{Colors.RESET}")
    print(f"  {Colors.CYAN}  Input:  {vocab_size} neurons (one per character){Colors.RESET}")
    print(f"  {Colors.CYAN}  Hidden: 64 neurons (sigmoid activation){Colors.RESET}")
    print(f"  {Colors.CYAN}  Output: {vocab_size} neurons (softmax → probabilities){Colors.RESET}")
    
    total_params = vocab_size * 64 + 64 + 64 * vocab_size + vocab_size
    print(f"  {Colors.YELLOW}  Total parameters: {total_params}{Colors.RESET}")
    
    # ── Generate text BEFORE training ──
    print_section("BEFORE TRAINING — Random Output", "🎲")
    
    start_idx = char_to_idx.get('t', 0)  # Start with 't'
    before_text = net.generate(start_idx, 100, idx_to_char, temperature=0.8)
    print(f"  {Colors.RED}Generated text (untrained network):{Colors.RESET}")
    print(f"  {Colors.BRIGHT_RED}  \"{before_text}\"{Colors.RESET}")
    print(f"\n  {Colors.DIM}  ^ This is gibberish because the weights are random!{Colors.RESET}")
    
    # ── Training Loop ──
    print_section("TRAINING", "🎓")
    
    print(f"  {Colors.WHITE}Training for {epochs} epochs...{Colors.RESET}")
    print(f"  {Colors.DIM}  (Each epoch = one pass through ALL training pairs){Colors.RESET}\n")
    
    # Track training history
    history = {
        "losses": [],
        "steps": [],
        "samples": [],
        "before_training": before_text,
        "after_training": ""  # Will be filled after training
    }
    
    start_time = time.time()
    
    # Header for training progress table
    print(f"  {Colors.WHITE}{'Epoch':>7} │ {'Loss':>10} │ {'Progress Bar':^25} │ {'Time':>6}{Colors.RESET}")
    print(f"  {Colors.DIM}{'─'*7}─┼─{'─'*10}─┼─{'─'*25}─┼─{'─'*6}{Colors.RESET}")
    
    for epoch in range(epochs):
        total_loss = 0.0
        
        # Shuffle training data each epoch
        # WHY shuffle: Prevents the network from learning the ORDER of examples
        # instead of the actual patterns. Randomizing improves generalization.
        shuffle_idx = np.random.permutation(num_samples)
        
        # ── Mini training loop ──
        # Process each training pair
        for i in shuffle_idx:
            # Forward pass: predict next character
            predicted = net.forward(inputs[i])
            
            # Compute loss: how wrong is the prediction?
            loss = net.compute_loss(predicted, targets[i])
            total_loss += loss
            
            # Backward pass: compute gradients and update weights
            net.backward(targets[i])
        
        # Average loss for this epoch
        avg_loss = total_loss / num_samples
        
        # Record history
        if epoch % print_every == 0 or epoch == epochs - 1:
            elapsed = time.time() - start_time
            history["losses"].append(float(avg_loss))
            history["steps"].append(epoch)
            
            # Create progress bar
            progress = epoch / epochs
            bar_len = 20
            filled = int(bar_len * progress)
            bar = "█" * filled + "░" * (bar_len - filled)
            
            # Color based on loss level
            if avg_loss < 1.0:
                loss_color = Colors.BRIGHT_GREEN
            elif avg_loss < 2.0:
                loss_color = Colors.GREEN
            elif avg_loss < 3.0:
                loss_color = Colors.YELLOW
            else:
                loss_color = Colors.RED
            
            print(f"  {Colors.CYAN}{epoch:>7}{Colors.RESET} │ "
                  f"{loss_color}{avg_loss:>10.4f}{Colors.RESET} │ "
                  f"{Colors.BLUE}[{bar}]{Colors.RESET} {progress:>4.0%} │ "
                  f"{Colors.DIM}{elapsed:>5.1f}s{Colors.RESET}")
        
        # Generate sample text periodically
        if epoch % sample_every == 0 and epoch > 0:
            sample_text = net.generate(start_idx, 60, idx_to_char, temperature=0.8)
            history["samples"].append({"step": epoch, "text": sample_text})
            print(f"  {Colors.BRIGHT_MAGENTA}  ↳ Sample: \"{sample_text[:50]}...\"{Colors.RESET}")
    
    elapsed_total = time.time() - start_time
    
    # ── Generate text AFTER training ──
    print_section("AFTER TRAINING — Learned Output", "✨")
    
    after_text = net.generate(start_idx, 150, idx_to_char, temperature=0.8)
    history["after_training"] = after_text
    
    print(f"  {Colors.BRIGHT_GREEN}Generated text (trained network):{Colors.RESET}")
    for i in range(0, len(after_text), 60):
        print(f"  {Colors.GREEN}  \"{after_text[i:i+60]}\"{Colors.RESET}")
    
    # ── Comparison ──
    print_section("COMPARISON: Before vs After", "🔄")
    
    print(f"  {Colors.BRIGHT_RED}BEFORE (random weights — gibberish):{Colors.RESET}")
    print(f"  {Colors.RED}  \"{before_text[:80]}\"{Colors.RESET}\n")
    
    print(f"  {Colors.BRIGHT_GREEN}AFTER ({epochs} epochs of training):{Colors.RESET}")
    print(f"  {Colors.GREEN}  \"{after_text[:80]}\"{Colors.RESET}\n")
    
    print(f"  {Colors.BRIGHT_YELLOW}📊 Training Statistics:{Colors.RESET}")
    print(f"  {Colors.YELLOW}  • Total time:       {elapsed_total:.1f} seconds{Colors.RESET}")
    print(f"  {Colors.YELLOW}  • Final loss:       {history['losses'][-1]:.4f}{Colors.RESET}")
    print(f"  {Colors.YELLOW}  • Starting loss:    {history['losses'][0]:.4f}{Colors.RESET}")
    improvement = ((history['losses'][0] - history['losses'][-1]) / 
                   history['losses'][0] * 100)
    print(f"  {Colors.YELLOW}  • Loss reduction:   {improvement:.1f}%{Colors.RESET}")
    
    # ── Save Training History ──
    # WHY save to JSON: So step4_visualize.py can load it and create plots
    # JSON is human-readable and easy to parse
    script_dir = os.path.dirname(os.path.abspath(__file__))
    history_path = os.path.join(script_dir, "training_history.json")
    
    with open(history_path, 'w') as f:
        json.dump(history, f, indent=2)
    
    print(f"\n  {Colors.BRIGHT_CYAN}💾 Training history saved to:{Colors.RESET}")
    print(f"  {Colors.CYAN}  {history_path}{Colors.RESET}")
    
    return net, history


# ============================================================================
# MAIN EXECUTION
# ============================================================================

if __name__ == "__main__":
    """
    Main execution: train the character-level network and generate text.
    
    WHY __name__ == '__main__':
    - Only runs when this file is executed directly
    - Allows importing the CharLevelNetwork class without running training
    """
    
    # Print header
    print_header()
    
    # Set random seed for reproducibility
    # WHY: Same seed = same results = students can verify their output matches
    np.random.seed(42)
    
    # Train the network!
    # epochs=1500 keeps training under 60 seconds on most machines
    net, history = train_network(
        text=SAMPLE_TEXT,
        epochs=1500,
        print_every=100,
        sample_every=500
    )
    
    # Generate a few more samples to show variety
    print_section("BONUS: Multiple Generated Samples", "🎲")
    
    # Rebuild data to get char mappings
    chars = sorted(list(set(SAMPLE_TEXT)))
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    idx_to_char = {i: ch for i, ch in enumerate(chars)}
    
    # Generate with different starting characters
    start_chars = ['t', 'a', 'n', 'l', 'd']
    for i, start_char in enumerate(start_chars):
        if start_char in char_to_idx:
            start_idx = char_to_idx[start_char]
            generated = net.generate(start_idx, 80, idx_to_char, temperature=0.7)
            print(f"  {Colors.CYAN}Starting with '{start_char}':{Colors.RESET}")
            print(f"  {Colors.GREEN}  \"{generated}\"{Colors.RESET}\n")
    
    # Print footer
    print_footer()

Complete Code: step4_visualize.py

Python
"""
================================================================================
🧠 LEVEL 2 — STEP 4: VISUALIZING TRAINING
================================================================================
Visualize the training results from step3_train.py!

This script:
  1. Loads training_history.json (saved by step3_train.py)
  2. Plots the training loss curve with matplotlib
  3. Shows sample generated text at different training stages
  4. Saves the plot as training_results.png
  5. Also prints an ASCII loss curve for terminals without display
  6. Compares Before vs After training quality

Requires: matplotlib, numpy, json
Run step3_train.py FIRST to generate the training history!
================================================================================
"""

# ============================================================================
# IMPORTS
# ============================================================================
import numpy as np      # For numerical operations
import os               # For file path operations
import json             # For loading training history
import sys              # For system operations

# We import matplotlib in a try/except block because some systems
# may not have a display (e.g., remote servers).
# WHY Agg backend: It renders to files without needing a display
try:
    import matplotlib
    matplotlib.use('Agg')  # Use non-interactive backend (saves to file)
    import matplotlib.pyplot as plt
    from matplotlib.gridspec import GridSpec
    HAS_MATPLOTLIB = True
except ImportError:
    HAS_MATPLOTLIB = False
    print("⚠️  matplotlib not installed. Only ASCII visualization available.")


# ============================================================================
# ANSI COLOR CODES
# ============================================================================
class Colors:
    """ANSI color codes for beautiful terminal output."""
    RESET   = "\033[0m"
    BOLD    = "\033[1m"
    DIM     = "\033[2m"
    
    RED     = "\033[31m"
    GREEN   = "\033[32m"
    YELLOW  = "\033[33m"
    BLUE    = "\033[34m"
    MAGENTA = "\033[35m"
    CYAN    = "\033[36m"
    WHITE   = "\033[37m"
    
    BRIGHT_RED     = "\033[91m"
    BRIGHT_GREEN   = "\033[92m"
    BRIGHT_YELLOW  = "\033[93m"
    BRIGHT_BLUE    = "\033[94m"
    BRIGHT_MAGENTA = "\033[95m"
    BRIGHT_CYAN    = "\033[96m"


# ============================================================================
# HELPER FUNCTIONS
# ============================================================================

def print_header():
    """Print a beautiful header for this script."""
    print(f"\n{Colors.BRIGHT_GREEN}{'='*70}")
    print(f"  📊  LEVEL 2 — STEP 4: VISUALIZING TRAINING RESULTS")
    print(f"{'='*70}{Colors.RESET}")
    print(f"{Colors.DIM}  Plotting loss curves and comparing output quality!{Colors.RESET}\n")


def print_footer():
    """Print a beautiful footer."""
    print(f"\n{Colors.BRIGHT_GREEN}{'='*70}")
    print(f"  ✅  STEP 4 COMPLETE! Training visualization done!")
    print(f"  🎉  Level 2 is complete — you built a neural network from scratch!")
    print(f"{'='*70}{Colors.RESET}\n")


def print_section(title, emoji="📌"):
    """Print a section header."""
    print(f"\n{Colors.BRIGHT_YELLOW}{'─'*70}")
    print(f"  {emoji}  {title}")
    print(f"{'─'*70}{Colors.RESET}\n")


# ============================================================================
# LOAD TRAINING HISTORY
# ============================================================================

def load_training_history():
    """
    Load the training history saved by step3_train.py.
    
    WHY JSON?
    - Human-readable format (you can open it in any text editor)
    - Easy to parse in any programming language
    - Standard format for data exchange
    
    Returns:
        dict with keys: losses, steps, samples, before_training, after_training
    """
    # Get the directory where THIS script is located
    # WHY: We want to find training_history.json in the SAME directory
    # This works regardless of where the script is run FROM
    script_dir = os.path.dirname(os.path.abspath(__file__))
    history_path = os.path.join(script_dir, "training_history.json")
    
    if not os.path.exists(history_path):
        print(f"  {Colors.BRIGHT_RED}❌ Error: training_history.json not found!{Colors.RESET}")
        print(f"  {Colors.RED}  Please run step3_train.py first.{Colors.RESET}")
        print(f"  {Colors.DIM}  Expected path: {history_path}{Colors.RESET}")
        sys.exit(1)
    
    with open(history_path, 'r') as f:
        history = json.load(f)
    
    print(f"  {Colors.BRIGHT_GREEN}✓ Loaded training history from:{Colors.RESET}")
    print(f"  {Colors.CYAN}  {history_path}{Colors.RESET}\n")
    print(f"  {Colors.WHITE}History contents:{Colors.RESET}")
    print(f"  {Colors.CYAN}  • {len(history['losses'])} loss data points{Colors.RESET}")
    print(f"  {Colors.CYAN}  • {len(history['steps'])} step markers{Colors.RESET}")
    print(f"  {Colors.CYAN}  • {len(history['samples'])} sample generations{Colors.RESET}")
    
    return history


# ============================================================================
# MATPLOTLIB VISUALIZATION
# ============================================================================

def create_matplotlib_plot(history):
    """
    Create a beautiful matplotlib plot of the training results.
    
    Plot contains:
    1. Top panel: Training loss curve (the main visualization)
    2. Bottom panel: Sample generated text at different stages
    
    WHY matplotlib?
    - Industry standard for scientific plotting in Python
    - Produces publication-quality figures
    - Highly customizable
    """
    if not HAS_MATPLOTLIB:
        print(f"  {Colors.YELLOW}⚠️  matplotlib not available. "
              f"Skipping graphical plot.{Colors.RESET}")
        return
    
    print_section("MATPLOTLIB VISUALIZATION", "📈")
    
    # ── Set up the figure ──
    # WHY GridSpec: Gives us precise control over subplot layout
    # figsize=(12, 8): 12 inches wide, 8 inches tall
    fig = plt.figure(figsize=(12, 8))
    
    # Use a dark background for modern look
    # WHY dark: Looks professional and is easier on the eyes
    fig.patch.set_facecolor('#1a1a2e')
    
    # Create grid: top panel (loss curve) takes 60%, bottom (text samples) takes 40%
    gs = GridSpec(2, 1, height_ratios=[3, 2], hspace=0.35)
    
    # ══════════════════════════════════════════════════════════════════
    # TOP PANEL: Training Loss Curve
    # ══════════════════════════════════════════════════════════════════
    
    ax1 = fig.add_subplot(gs[0])
    ax1.set_facecolor('#16213e')
    
    steps = history['steps']
    losses = history['losses']
    
    # Plot the loss curve with a gradient-like effect
    # WHY multiple visual elements: Makes the plot more informative and beautiful
    
    # Fill area under curve (semi-transparent)
    # WHY fill: Gives a sense of the "volume" of loss reduction
    ax1.fill_between(steps, losses, alpha=0.3, color='#e94560')
    
    # Main line
    ax1.plot(steps, losses, color='#e94560', linewidth=2.5, label='Training Loss',
             marker='o', markersize=4, markerfacecolor='white', markeredgecolor='#e94560')
    
    # Add annotation for first and last loss
    # WHY annotations: Help the viewer immediately understand the improvement
    ax1.annotate(f'Start: {losses[0]:.2f}',
                xy=(steps[0], losses[0]),
                xytext=(steps[0] + (steps[-1]-steps[0])*0.1, losses[0]*0.95),
                fontsize=10, color='#ff6b6b',
                arrowprops=dict(arrowstyle='->', color='#ff6b6b', lw=1.5),
                fontweight='bold')
    
    ax1.annotate(f'End: {losses[-1]:.2f}',
                xy=(steps[-1], losses[-1]),
                xytext=(steps[-1] * 0.75, losses[-1] + (losses[0]-losses[-1])*0.15),
                fontsize=10, color='#51cf66',
                arrowprops=dict(arrowstyle='->', color='#51cf66', lw=1.5),
                fontweight='bold')
    
    # Mark sample generation points with vertical lines
    for sample in history.get('samples', []):
        step = sample['step']
        if step in steps:
            idx = steps.index(step)
            loss_at_step = losses[idx]
        else:
            # Interpolate
            loss_at_step = None
        
        ax1.axvline(x=step, color='#ffd93d', linestyle='--', alpha=0.4, linewidth=1)
    
    # Styling
    ax1.set_title('🧠 Neural Network Training — Loss Over Time',
                  fontsize=16, color='white', fontweight='bold', pad=15)
    ax1.set_xlabel('Training Epoch', fontsize=12, color='#a0a0a0')
    ax1.set_ylabel('Cross-Entropy Loss', fontsize=12, color='#a0a0a0')
    ax1.tick_params(colors='#a0a0a0')
    ax1.grid(True, alpha=0.15, color='white')
    ax1.legend(fontsize=11, loc='upper right', facecolor='#16213e',
              edgecolor='#444', labelcolor='white')
    
    # Set spine colors
    for spine in ax1.spines.values():
        spine.set_color('#444')
    
    # ══════════════════════════════════════════════════════════════════
    # BOTTOM PANEL: Sample Generated Text
    # ══════════════════════════════════════════════════════════════════
    
    ax2 = fig.add_subplot(gs[1])
    ax2.set_facecolor('#16213e')
    ax2.set_xlim(0, 1)
    ax2.set_ylim(0, 1)
    ax2.axis('off')
    
    ax2.set_title('📝 Generated Text at Different Training Stages',
                  fontsize=14, color='white', fontweight='bold', pad=10)
    
    # Add Before Training text
    y_pos = 0.9
    ax2.text(0.02, y_pos, '🔴 Before Training:',
            fontsize=11, color='#ff6b6b', fontweight='bold',
            transform=ax2.transAxes, fontfamily='monospace')
    
    before_text = history.get('before_training', 'N/A')[:70]
    ax2.text(0.02, y_pos - 0.1, f'"{before_text}"',
            fontsize=9, color='#ff9999',
            transform=ax2.transAxes, fontfamily='monospace',
            style='italic')
    
    # Add sample texts during training
    y_pos -= 0.25
    samples = history.get('samples', [])
    for i, sample in enumerate(samples[:3]):  # Show max 3 samples
        step = sample['step']
        text = sample['text'][:60]
        
        color_val = 0.4 + (i + 1) * 0.2  # Gradually greener
        text_color = (1.0 - color_val, 0.5 + color_val * 0.5, color_val * 0.5)
        
        ax2.text(0.02, y_pos, f'🟡 Epoch {step}:',
                fontsize=10, color='#ffd93d', fontweight='bold',
                transform=ax2.transAxes, fontfamily='monospace')
        ax2.text(0.02, y_pos - 0.08, f'"{text}"',
                fontsize=9, color=text_color,
                transform=ax2.transAxes, fontfamily='monospace',
                style='italic')
        y_pos -= 0.2
    
    # Add After Training text
    after_text = history.get('after_training', 'N/A')[:70]
    ax2.text(0.02, y_pos, '🟢 After Training:',
            fontsize=11, color='#51cf66', fontweight='bold',
            transform=ax2.transAxes, fontfamily='monospace')
    ax2.text(0.02, y_pos - 0.1, f'"{after_text}"',
            fontsize=9, color='#8ce99a',
            transform=ax2.transAxes, fontfamily='monospace',
            style='italic')
    
    # ── Save the plot ──
    # WHY save to same directory: Keeps all Level 2 files together
    script_dir = os.path.dirname(os.path.abspath(__file__))
    plot_path = os.path.join(script_dir, "training_results.png")
    
    # WHY dpi=150: Good balance between file size and quality
    # WHY bbox_inches='tight': Removes excess whitespace
    plt.savefig(plot_path, dpi=150, bbox_inches='tight',
               facecolor=fig.get_facecolor(), edgecolor='none')
    plt.close()
    
    print(f"  {Colors.BRIGHT_GREEN}✓ Plot saved to:{Colors.RESET}")
    print(f"  {Colors.CYAN}  {plot_path}{Colors.RESET}")
    
    return plot_path


# ============================================================================
# ASCII LOSS CURVE
# ============================================================================

def print_ascii_loss_curve(history):
    """
    Print an ASCII art loss curve for terminals without graphical display.
    
    WHY ASCII plot?
    - Works in ANY terminal (no GUI needed)
    - Great for remote servers / SSH sessions
    - Shows the same information as the matplotlib plot
    - Fun and educational!
    
    The plot uses Unicode block characters to draw bars.
    """
    print_section("ASCII LOSS CURVE", "📉")
    
    losses = history['losses']
    steps = history['steps']
    
    if not losses:
        print(f"  {Colors.RED}No loss data available!{Colors.RESET}")
        return
    
    # ── Calculate plot dimensions ──
    max_loss = max(losses)
    min_loss = min(losses)
    plot_height = 15   # Number of rows in the plot
    plot_width = min(50, len(losses))  # Number of columns
    
    # Resample losses if we have more data points than columns
    # WHY resample: We might have 100+ data points but only 50 columns
    if len(losses) > plot_width:
        indices = np.linspace(0, len(losses) - 1, plot_width, dtype=int)
        sampled_losses = [losses[i] for i in indices]
        sampled_steps = [steps[i] for i in indices]
    else:
        sampled_losses = losses
        sampled_steps = steps
    
    # ── Draw the plot ──
    print(f"  {Colors.BRIGHT_CYAN}  Training Loss Over Time{Colors.RESET}")
    print(f"  {Colors.DIM}  (Each column = one recorded epoch){Colors.RESET}\n")
    
    # Y-axis labels and grid
    for row in range(plot_height, -1, -1):
        # Calculate the loss value for this row
        if max_loss == min_loss:
            loss_at_row = max_loss
        else:
            loss_at_row = min_loss + (max_loss - min_loss) * (row / plot_height)
        
        # Y-axis label (show every 3rd row)
        if row % 3 == 0 or row == plot_height:
            y_label = f"{loss_at_row:>6.2f}"
        else:
            y_label = "      "
        
        # Draw the row
        row_str = f"  {Colors.DIM}{y_label} ┤{Colors.RESET}"
        
        for col in range(len(sampled_losses)):
            loss_val = sampled_losses[col]
            
            # Normalize loss to plot height
            if max_loss == min_loss:
                bar_height = plot_height // 2
            else:
                bar_height = int((loss_val - min_loss) / (max_loss - min_loss) * plot_height)
            
            # Draw bar character based on whether this row is filled
            if bar_height >= row:
                # Color gradient from red (high loss) to green (low loss)
                if col < len(sampled_losses) * 0.3:
                    color = Colors.BRIGHT_RED
                elif col < len(sampled_losses) * 0.6:
                    color = Colors.BRIGHT_YELLOW
                else:
                    color = Colors.BRIGHT_GREEN
                row_str += f"{color}█{Colors.RESET}"
            else:
                row_str += " "
        
        print(row_str)
    
    # X-axis
    print(f"  {Colors.DIM}       └{'─' * len(sampled_losses)}{Colors.RESET}")
    
    # X-axis labels (first, middle, last)
    if sampled_steps:
        first = sampled_steps[0]
        last = sampled_steps[-1]
        mid = sampled_steps[len(sampled_steps) // 2]
        label_line = f"        {first:<{len(sampled_losses)//2}}"
        label_line += f"{mid}"
        remaining = len(sampled_losses) - len(label_line) + 8
        if remaining > 0:
            label_line += " " * remaining + f"{last}"
        print(f"  {Colors.DIM}{label_line}{Colors.RESET}")
    
    print(f"  {Colors.DIM}        {'Training Epoch':^{len(sampled_losses)}}{Colors.RESET}")
    
    # ── Print statistics ──
    print(f"\n  {Colors.BRIGHT_YELLOW}Statistics:{Colors.RESET}")
    print(f"  {Colors.YELLOW}  • Starting loss: {losses[0]:.4f}{Colors.RESET}")
    print(f"  {Colors.YELLOW}  • Final loss:    {losses[-1]:.4f}{Colors.RESET}")
    print(f"  {Colors.YELLOW}  • Best loss:     {min(losses):.4f} "
          f"(at epoch {steps[losses.index(min(losses))]}){Colors.RESET}")
    
    improvement = ((losses[0] - losses[-1]) / losses[0] * 100)
    print(f"  {Colors.BRIGHT_GREEN}  • Improvement:   {improvement:.1f}% reduction{Colors.RESET}")


# ============================================================================
# BEFORE vs AFTER COMPARISON
# ============================================================================

def print_comparison(history):
    """
    Print a detailed comparison of before vs after training.
    
    WHY this comparison?
    - This is the most dramatic demonstration of learning
    - Students can SEE that the network improved
    - It connects the abstract loss curve to concrete output quality
    """
    print_section("BEFORE vs AFTER COMPARISON", "🔄")
    
    before = history.get('before_training', 'N/A')
    after = history.get('after_training', 'N/A')
    
    # ── Before Training Box ──
    print(f"  {Colors.BRIGHT_RED}┌──────────────────────────────────────────────────────────┐{Colors.RESET}")
    print(f"  {Colors.BRIGHT_RED}│  🔴 BEFORE TRAINING (Random Weights)                    │{Colors.RESET}")
    print(f"  {Colors.BRIGHT_RED}├──────────────────────────────────────────────────────────┤{Colors.RESET}")
    
    # Word-wrap the before text
    for i in range(0, min(len(before), 120), 56):
        line = before[i:i+56]
        padding = 56 - len(line)
        print(f"  {Colors.RED}│  \"{line}\"{' ' * padding}│{Colors.RESET}")
    
    print(f"  {Colors.BRIGHT_RED}├──────────────────────────────────────────────────────────┤{Colors.RESET}")
    print(f"  {Colors.RED}│  Quality: Random gibberish — no patterns learned       │{Colors.RESET}")
    print(f"  {Colors.RED}│  Loss:    ~{history['losses'][0]:.2f} (high = very wrong)"
          f"{'':20}│{Colors.RESET}")
    print(f"  {Colors.BRIGHT_RED}└──────────────────────────────────────────────────────────┘{Colors.RESET}")
    
    print()
    
    # ── After Training Box ──
    print(f"  {Colors.BRIGHT_GREEN}┌──────────────────────────────────────────────────────────┐{Colors.RESET}")
    print(f"  {Colors.BRIGHT_GREEN}│  🟢 AFTER TRAINING (Learned Weights)                    │{Colors.RESET}")
    print(f"  {Colors.BRIGHT_GREEN}├──────────────────────────────────────────────────────────┤{Colors.RESET}")
    
    for i in range(0, min(len(after), 120), 56):
        line = after[i:i+56]
        padding = 56 - len(line)
        print(f"  {Colors.GREEN}│  \"{line}\"{' ' * padding}│{Colors.RESET}")
    
    print(f"  {Colors.BRIGHT_GREEN}├──────────────────────────────────────────────────────────┤{Colors.RESET}")
    print(f"  {Colors.GREEN}│  Quality: Recognizable English words and patterns!     │{Colors.RESET}")
    print(f"  {Colors.GREEN}│  Loss:    ~{history['losses'][-1]:.2f} (low = getting it right!)"
          f"{'':17}│{Colors.RESET}")
    print(f"  {Colors.BRIGHT_GREEN}└──────────────────────────────────────────────────────────┘{Colors.RESET}")
    
    print(f"\n  {Colors.BRIGHT_YELLOW}💡 Key Insight:{Colors.RESET}")
    print(f"  {Colors.YELLOW}  The network learned to associate characters with what typically{Colors.RESET}")
    print(f"  {Colors.YELLOW}  follows them in English text. It's not perfect (it's a tiny network{Colors.RESET}")
    print(f"  {Colors.YELLOW}  with a tiny dataset), but it shows the PRINCIPLE of how language{Colors.RESET}")
    print(f"  {Colors.YELLOW}  models work: predict the next token based on patterns in data!{Colors.RESET}")
    
    # ── Training Progress Through Samples ──
    samples = history.get('samples', [])
    if samples:
        print(f"\n  {Colors.BRIGHT_MAGENTA}Training Progress (Generated Text at Each Stage):{Colors.RESET}\n")
        
        for i, sample in enumerate(samples):
            step = sample['step']
            text = sample['text'][:60]
            
            # Progress bar
            if history['steps']:
                max_step = history['steps'][-1]
                progress = step / max_step if max_step > 0 else 0
            else:
                progress = 0
            
            bar_len = 15
            filled = int(bar_len * progress)
            bar = "█" * filled + "░" * (bar_len - filled)
            
            # Color transitions from red to green
            if progress < 0.33:
                color = Colors.RED
            elif progress < 0.66:
                color = Colors.YELLOW
            else:
                color = Colors.GREEN
            
            print(f"  {Colors.DIM}  Epoch {step:>5}{Colors.RESET} "
                  f"{Colors.BLUE}[{bar}]{Colors.RESET} "
                  f"{color}\"{text}\"{Colors.RESET}")
        
        print(f"\n  {Colors.DIM}  Notice how the text gradually improves from random to "
              f"recognizable!{Colors.RESET}")


# ============================================================================
# LEARNING SUMMARY
# ============================================================================

def print_learning_summary():
    """
    Print a summary of what was learned in this level.
    """
    print_section("🎓 LEVEL 2 COMPLETE — WHAT YOU LEARNED", "🏆")
    
    print(f"""  {Colors.BRIGHT_CYAN}┌──────────────────────────────────────────────────────────┐
  │             🎉 CONGRATULATIONS! 🎉                     │
  │         You built a neural network from scratch!        │
  ├──────────────────────────────────────────────────────────┤
  │                                                          │
  │  Step 1: Single Neuron                                   │
  │    ✓ Weighted sum + sigmoid activation                   │
  │    ✓ Learning AND/OR gates                               │
  │                                                          │
  │  Step 2: Neural Network                                  │
  │    ✓ Multiple layers with matrix multiplication          │
  │    ✓ Forward pass visualization                          │
  │    ✓ Sigmoid + Softmax activations                       │
  │                                                          │
  │  Step 3: Training                                        │
  │    ✓ Character-level data preparation                    │
  │    ✓ One-hot encoding                                    │
  │    ✓ Backpropagation FROM SCRATCH                        │
  │    ✓ Cross-entropy loss                                  │
  │    ✓ Text generation!                                    │
  │                                                          │
  │  Step 4: Visualization                                   │
  │    ✓ Loss curve plotting                                 │
  │    ✓ Before vs After comparison                          │
  │    ✓ Training progress analysis                          │
  │                                                          │
  ├──────────────────────────────────────────────────────────┤
  │                                                          │
  │  🔮 Next Level:                                          │
  │  Level 3 will introduce more advanced concepts!          │
  │                                                          │
  └──────────────────────────────────────────────────────────┘{Colors.RESET}
""")


# ============================================================================
# MAIN EXECUTION
# ============================================================================

if __name__ == "__main__":
    """
    Main execution block.
    
    This script REQUIRES step3_train.py to have been run first,
    since it loads the training_history.json file that step3 creates.
    """
    
    # Print header
    print_header()
    
    # Load training history
    print_section("LOADING TRAINING HISTORY", "📂")
    history = load_training_history()
    
    # Create matplotlib plot (saved as image)
    plot_path = create_matplotlib_plot(history)
    
    # Print ASCII loss curve (works everywhere)
    print_ascii_loss_curve(history)
    
    # Print before vs after comparison
    print_comparison(history)
    
    # Print learning summary
    print_learning_summary()
    
    # Print footer
    print_footer()

Part III

The Transformer Revolution

The architecture that changed everything

Chapter 4

Transformers and the Magic of Attention

Learning Objectives

Explain what the Transformer architecture is and why it revolutionised AI
Convert text into numerical representations using embeddings
Understand why positional encoding is needed and how sinusoidal waves solve it
Implement self-attention from scratch and explain every term in the attention formula
Describe how multi-head attention lets a model see text from multiple perspectives
Build a complete Transformer block with residual connections, layer norm, and feed-forward networks
Assemble a Mini-Transformer — a working language model in under 200 lines of PyTorch

4.1 The Breakthrough That Changed Everything

In June 2017, a team of eight researchers at Google published a paper with a deceptively simple title: "Attention Is All You Need." That paper didn't just introduce a new model — it rewrote the rules of artificial intelligence.

Before the Transformer, the dominant architectures for language tasks were Recurrent Neural Networks (RNNs) and their more sophisticated cousins, LSTMs (Long Short-Term Memory networks). These models processed text one word at a time, like a student reading a textbook left-to-right, never skipping ahead, never glancing back without effort. They worked, but they were painfully slow to train — because every word had to wait for the previous one to be processed — and they struggled to remember things said far earlier in a paragraph.

The Transformer threw away the conveyor belt. Instead of processing words sequentially, it processes all words at once, in parallel, and uses a mechanism called attention to figure out which words are relevant to each other. Imagine an entire classroom of students working on a problem simultaneously, each student free to glance at any other student's notes. That is the Transformer.

The impact was staggering. Within two years, Transformer-based models — BERT, GPT-2, T5 — were shattering records on virtually every natural language processing benchmark. Today, every large language model you've heard of — GPT-4, Claude, Gemini, LLaMA — is built on this architecture. When you chat with ChatGPT or use Google Translate, there is a Transformer under the hood.

Important

The Transformer is not just one breakthrough — it is the foundation of modern AI. Understanding it deeply is the single most important step in your AI journey.

Let's build one from scratch.

4.2 From Words to Numbers: Embeddings

Here is a fundamental truth: computers don't understand words. They understand numbers. So the very first step in any language model is to convert text into numbers.

The Naive Approach: One-Hot Encoding

The simplest idea is one-hot encoding. If your vocabulary has, say, 26 characters, represent each character as a vector of length 26 with a single 1 and the rest 0s. So a = [1, 0, 0, ..., 0], b = [0, 1, 0, ..., 0], and so on.

This works technically, but it has two fatal flaws:

The vectors are enormous. If your vocabulary has 50,000 words, each vector has 50,000 dimensions. That is extremely wasteful.
Every word is equally different from every other word. The distance between "king" and "queen" is the same as the distance between "king" and "mango." The representation carries zero information about meaning.

Dense Embeddings: Rich Descriptions

A far better idea is to represent each word (or character) as a short, dense vector — say, 64 or 128 numbers — where similar words end up with similar vectors.

Think of it like this:

A student's roll number tells you nothing about them. Roll number 42 could be anyone. But a description — [tall, curious, loves science, good at cricket, from Lucknow] — tells you a lot. You can immediately see that this student is more similar to another science-loving cricketer than to a quiet artist.

An embedding is that description. It is a learned vector of numbers that captures the meaning and relationships of a token.

Building a Vocabulary

Let's start at the very beginning: mapping characters to numbers. This is the simplest form of tokenization (real models like GPT use subword tokenization, but character-level is easiest to learn with).

Python
def build_vocabulary(text):
    """
    Build a character-level vocabulary from text.

    Returns:
        char_to_idx: Dictionary mapping character → number
        idx_to_char: Dictionary mapping number → character
        vocab_size:  Total number of unique characters
    """
    # Get all unique characters and sort them
    chars = sorted(list(set(text)))

    # Create the two-way mapping
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    idx_to_char = {i: ch for i, ch in enumerate(chars)}

    return char_to_idx, idx_to_char, len(chars)


def encode(text, char_to_idx):
    """Convert a string into a list of numbers using our vocabulary."""
    return [char_to_idx[ch] for ch in text]


def decode(indices, idx_to_char):
    """Convert a list of numbers back into a string."""
    return ''.join([idx_to_char[i] for i in indices])

Feed in the sentence "The sun rises in the east and sets in the west. India is a beautiful country." and you get a mapping like ' '→0, '.'→1, 'I'→2, 'T'→3, 'a'→4, 'b'→5, .... Each unique character gets an index.

The Token Embedding Layer

Now we need to turn each index into a vector. In PyTorch, nn.Embedding does exactly this — it is essentially a lookup table. You give it an index, and it returns the corresponding row from a learnable matrix:

Python
import torch
import torch.nn as nn

class TokenEmbedding(nn.Module):
    """
    Converts token indices into dense vectors.

    Args:
        vocab_size: Number of unique tokens
        embed_dim:  Size of each embedding vector
    """
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        # nn.Embedding is like a lookup table:
        # It stores a matrix of shape (vocab_size × embed_dim)
        # When you give it index 3, it returns row 3 of the matrix
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.embed_dim = embed_dim

    def forward(self, x):
        # x shape: (batch_size, sequence_length) — indices
        # output shape: (batch_size, sequence_length, embed_dim) — vectors
        return self.embedding(x)

If our embedding dimension is 8 (tiny, for illustration — real models use 768 or more), then each character becomes a vector of 8 numbers. The character 'T' might become [+0.312, -0.821, +0.047, ...]. These numbers are random at first and are learned during training — the model discovers what representation works best.

Tip

Think of nn.Embedding as a dictionary where the keys are token indices and the values are vectors. But unlike a regular dictionary, these vectors are trainable parameters — they get updated every time the model learns from data.

4.3 Positional Encoding: Telling the Model About Order

Here is a problem you might not have noticed: the Transformer processes all tokens at the same time. There is no concept of "first" or "second" or "last." But order matters — a lot.

Consider these two sentences:

"कुत्ता आदमी को काटता है" (Dog bites man)
"आदमी कुत्ते को काटता है" (Man bites dog)

Same words, completely different meanings. If the model can't tell which word came first, it cannot distinguish between these two sentences.

The solution from the original paper is elegant: add a unique positional signal to each token's embedding. The Transformer uses sinusoidal (wave-based) encoding — sine and cosine functions at different frequencies.

Why waves? Think of it like tuning into a radio station. Each station (position) has a unique combination of frequencies. Even though two stations might share one frequency, the full combination is always unique. Similarly, each position gets a unique fingerprint of sine and cosine values.

The Formulas

PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)

PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Here, pos is the position in the sequence and i is the dimension index. Even dimensions get \sin, odd dimensions get \cos.

Python
import math

class PositionalEncoding(nn.Module):
    """
    Adds positional information to embeddings using sinusoidal patterns.
    """
    def __init__(self, embed_dim, max_seq_len=512):
        super().__init__()

        # Create a matrix to store all positional encodings
        pe = torch.zeros(max_seq_len, embed_dim)

        # Position indices: [0, 1, 2, ..., max_seq_len-1]
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)

        # Division term: creates different frequencies for each dimension
        div_term = torch.exp(
            torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
        )

        # Apply sin to even indices (0, 2, 4, ...)
        pe[:, 0::2] = torch.sin(position * div_term)

        # Apply cos to odd indices (1, 3, 5, ...)
        pe[:, 1::2] = torch.cos(position * div_term)

        # Add batch dimension: (1, max_seq_len, embed_dim)
        pe = pe.unsqueeze(0)

        # Register as buffer (saved with model but not trained)
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len, embed_dim)
        Returns:
            x + positional encoding
        """
        seq_len = x.size(1)
        # Add positional encoding to the input
        return x + self.pe[:, :seq_len, :]

Notice that the positional encoding is added to the embedding, not concatenated. After this step, the same character at different positions in a sentence will have different vector representations. The model now knows what a token is (from the embedding) and where it is (from the positional encoding).

Note

The positional encoding is registered as a buffer, not a parameter. This means it is saved with the model but is not updated during training — it is a fixed mathematical pattern. Some modern models (including our final Mini-Transformer) use learned positional embeddings instead, which are trained just like token embeddings.

4.4 The Heart of the Transformer: Self-Attention

This is it. If you understand this one section deeply, you understand the engine that powers all of modern AI.

The Classroom Analogy

Imagine a classroom with 30 students. The teacher asks a question: "What is the role of the monsoon in Indian agriculture?"

Now, every student in the class could potentially answer. But some students are more relevant than others. The student who studied geography has the most useful answer. The student who studied mathematics? Probably less relevant — but maybe she can add something about rainfall statistics. The student who plays cricket all day? Perhaps not useful at all for this question.

Self-attention is exactly this. Each word in a sentence "asks a question" and then looks at every other word to decide: "How relevant are you to me?" The word then collects a weighted combination of information from all other words, paying more attention to the relevant ones.

Query, Key, and Value: Three Ways to Look at Every Token

Every token in the sequence gets transformed into three vectors: a Query (Q), a Key (K), and a Value (V). Let's understand these through three analogies:

Analogy 1 — The Library:

Query = the search term you type into the library catalog ("monsoon agriculture India")
Key = the title/tags of each book on the shelf ("Indian Climate Patterns," "History of Cricket," "Monsoon and Farming")
Value = the actual content of each book

You match your query against all keys to find relevant books, then read the content (values) of those books.

Analogy 2 — The Classroom:

Query = the question being asked ("What causes monsoons?")
Key = each student's expertise tag ("geography expert," "maths nerd," "cricket captain")
Value = the actual answer each student would give

You compare your question against everyone's expertise, then listen most carefully to the most relevant students.

Analogy 3 — Google Search:

Query = what you type in the search bar
Key = the title/description of each web page
Value = the actual content of each web page

Google ranks pages by matching your query to their keys, then shows you the values of the best matches.

The Implementation

Python
import torch.nn.functional as F

class SelfAttention(nn.Module):
    """
    Single-head self-attention mechanism.
    This is the core building block of the Transformer.
    """

    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim

        # Three weight matrices — these are LEARNED during training!
        # W_q: transforms input into "what am I looking for?"
        # W_k: transforms input into "what do I contain?"
        # W_v: transforms input into "what information do I give?"
        self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)

        # Scaling factor
        self.scale = math.sqrt(embed_dim)

    def forward(self, x, mask=None):
        """
        Args:
            x: Input tensor of shape (batch, seq_len, embed_dim)
            mask: Optional causal mask

        Returns:
            output: Attention output, same shape as input
            attention_weights: The attention matrix
        """
        batch_size, seq_len, _ = x.shape

        # STEP 1: Create Q, K, V
        Q = self.W_q(x)  # (batch, seq_len, embed_dim)
        K = self.W_k(x)  # (batch, seq_len, embed_dim)
        V = self.W_v(x)  # (batch, seq_len, embed_dim)

        # STEP 2: Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch, seq_len, seq_len)

        # STEP 3: Scale
        scores = scores / self.scale

        # STEP 4: Apply causal mask (optional)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        # STEP 5: Softmax → probabilities
        attention_weights = F.softmax(scores, dim=-1)

        # STEP 6: Weighted sum of values
        output = torch.matmul(attention_weights, V)

        return output, attention_weights

Step-by-Step Numerical Walkthrough

Let's trace through the computation with a tiny example. Suppose we have 3 tokens and an embedding dimension of 4.

Input matrix X (3 tokens × 4 dimensions):

X = \begin{bmatrix} 1.0 & 0.5 & -0.3 & 0.8 \\ 0.2 & -0.7 & 1.1 & 0.4 \\ -0.5 & 0.9 & 0.6 & -0.2 \end{bmatrix}

Step 1: Multiply by weight matrices W_Q, W_K, W_V to get Q, K, V. (In practice these are learned; let's say the multiplication yields Q, K, V each of shape 3×4.)

Step 2: Compute raw scores: \text{scores} = Q \cdot K^T. This is a 3×3 matrix — each entry (i, j) tells us how much token i is interested in token j.

\text{scores} = \begin{bmatrix} 2.1 & 0.8 & -0.3 \\ 0.8 & 1.9 & 1.2 \\ -0.3 & 1.2 & 2.5 \end{bmatrix}

Step 3: Scale by \frac{1}{\sqrt{d_k}} = \frac{1}{\sqrt{4}} = \frac{1}{2}:

\text{scaled} = \begin{bmatrix} 1.05 & 0.40 & -0.15 \\ 0.40 & 0.95 & 0.60 \\ -0.15 & 0.60 & 1.25 \end{bmatrix}

Step 4: Apply softmax to each row (so each row sums to 1):

\text{weights} = \begin{bmatrix} 0.50 & 0.26 & 0.15 \\ 0.25 & 0.43 & 0.30 \\ 0.14 & 0.29 & 0.55 \end{bmatrix}

Step 5: Multiply weights by V. Each token's output is a weighted combination of all value vectors.

The result: token 0 pays most attention to itself (0.50), token 1 pays most attention to itself (0.43) but also notices token 2 (0.30), and so on. The model has learned to focus on what matters.

The Attention Formula

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^T}{\sqrt{d_k}}\right) \cdot V

This single equation is the beating heart of every modern language model. Let's understand every part:

Component	What it does
`Q K^T`	Dot product between queries and keys → raw similarity scores
`\sqrt{d_k}`	Scaling factor to keep gradients healthy
`\text{softmax}`	Converts raw scores into probabilities (0 to 1, summing to 1)
`\cdot V`	Weighted sum of values using those probabilities

Why Scale by `\sqrt{d_k}`?

This is a subtle but critical detail. When you compute the dot product of two random vectors of dimension d_k, the result has a variance of approximately d_k. If d_k = 64, the dot products can easily be in the range of ±50 or more.

What happens when you feed very large numbers into softmax? The output becomes extremely "peaked" — one element gets a probability near 1.0, and everything else gets near 0.0. This is essentially a hard argmax, and the gradients become vanishingly small. The model stops learning.

By dividing by \sqrt{d_k}, we bring the variance back to approximately 1, keeping softmax in a range where gradients flow well.

Warning

Without the \sqrt{d_k} scaling, training becomes unstable — the model either learns nothing or converges to poor solutions. This seemingly small detail makes a huge difference in practice.

Causal Masking: Why GPT Can't Look at the Future

When you're generating text one token at a time ("The capital of India is ___"), the model must predict the next word using only the words that came before it. It cannot peek at the answer.

This is enforced with a causal mask — a lower-triangular matrix of 1s and 0s:

Python
def create_causal_mask(seq_len):
    """
    Create a lower-triangular mask for autoregressive models.

    [[1, 0, 0, 0],     ← token 0 can only see token 0
     [1, 1, 0, 0],     ← token 1 can see tokens 0, 1
     [1, 1, 1, 0],     ← token 2 can see tokens 0, 1, 2
     [1, 1, 1, 1]]     ← token 3 can see all tokens
    """
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.unsqueeze(0)  # Add batch dimension

Where the mask is 0, we set the attention score to -\infty. After softmax, e^{-\infty} = 0 — those positions get zero attention weight. The model is effectively blind to future tokens.

Note

BERT-style models (used for understanding, not generation) do not use causal masking — they attend in both directions. GPT-style models (used for generation) always use it. This is the fundamental difference between "encoder" and "decoder" Transformers.

4.5 Multi-Head Attention: Multiple Perspectives

Single-head attention is powerful, but it has a limitation: one attention pattern per token. In reality, a word can relate to other words in many different ways simultaneously.

Consider the sentence: "The student from Delhi who loves physics scored the highest marks."

The word "scored" needs to attend to:

"student" — to know who scored
"highest" — to know how much was scored
"marks" — to know what was scored

One attention head would struggle to capture all three relationships at once. The solution? Use multiple heads, each learning a different type of relationship.

Think of it like reading a poem. One head reads for meaning, another for rhyme scheme, a third for emotional tone, and a fourth for grammatical structure. Each perspective is partial, but together they form a rich understanding.

Mechanically, multi-head attention splits the embedding dimension among heads. If embed_dim = 128 and num_heads = 4, each head works with 32 dimensions. After each head computes attention independently, the results are concatenated and projected back:

Python
class MultiHeadAttention(nn.Module):
    """
    Multi-Head Self-Attention.
    Splits the input into multiple "heads", runs attention on each,
    then combines the results.
    """

    def __init__(self, embed_dim, num_heads):
        super().__init__()
        assert embed_dim % num_heads == 0, \
            f"embed_dim ({embed_dim}) must be divisible by num_heads ({num_heads})"

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # One big linear layer for Q, K, V (more efficient than 3 separate ones)
        self.W_qkv = nn.Linear(embed_dim, 3 * embed_dim, bias=False)

        # Output projection: combines all heads back together
        self.W_out = nn.Linear(embed_dim, embed_dim, bias=False)

        self.scale = math.sqrt(self.head_dim)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.shape

        # Step 1: Compute Q, K, V all at once
        qkv = self.W_qkv(x)  # (batch, seq_len, 3 * embed_dim)

        # Step 2: Split into Q, K, V and reshape for multi-head
        qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch, heads, seq_len, head_dim)
        Q, K, V = qkv[0], qkv[1], qkv[2]

        # Step 3: Compute attention scores for ALL heads at once
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale

        # Step 4: Apply causal mask
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        # Step 5: Softmax
        attention_weights = F.softmax(scores, dim=-1)

        # Step 6: Weighted sum of values
        output = torch.matmul(attention_weights, V)

        # Step 7: Combine heads back together
        output = output.transpose(1, 2).reshape(batch_size, seq_len, self.embed_dim)

        # Step 8: Final projection
        output = self.W_out(output)

        return output

Tip

Notice the efficiency trick: instead of three separate linear layers for Q, K, and V, we use one large linear layer (W_qkv) that produces all three at once. This is mathematically identical but runs faster on GPUs because it is a single matrix multiplication.

4.6 The Transformer Block

Attention tells the model what to look at. But it also needs to process that information, stabilise its numbers, and preserve what it already knew. That is the job of the full Transformer block, which wraps attention with three additional components.

Residual Connections: "You Still Have Your Own Notes"

Imagine a student attends a lecture. Even if the lecture was confusing and the student didn't grasp much, she still has her own notes from before the lecture. A residual connection does the same thing — it adds the original input back to the output:

\text{output} = x + \text{Attention}(x)

This means the model never loses the original information. In the worst case (the attention layer learns nothing useful), the input passes through unchanged. In practice, residual connections make deep networks much easier to train.

Layer Normalization: "Grading on a Curve"

After many matrix multiplications, the numbers in our vectors can drift — some becoming very large, others very small. Layer normalization is like grading on a curve: it rescales each vector to have mean 0 and standard deviation 1.

Without normalization, training deep Transformers becomes unstable. The numbers "explode" or "vanish," and the model fails to learn.

Feed-Forward Network: "Processing What You Gathered"

After attention has gathered relevant information from across the sequence, the FFN processes that information. It is a simple two-layer network applied independently to each token:

\text{FFN}(x) = \text{ReLU}(x \cdot W_1 + b_1) \cdot W_2 + b_2

The hidden layer is typically 4× wider than the embedding dimension. This "expand then shrink" pattern gives the model a larger computational space to work in — like having scratch paper to work out a problem — before compressing the result back down.

Python
class FeedForward(nn.Module):
    """
    Position-wise Feed-Forward Network.
    Each position (token) is processed INDEPENDENTLY through the same network.
    """

    def __init__(self, embed_dim, ff_dim=None):
        super().__init__()
        ff_dim = ff_dim or 4 * embed_dim

        self.net = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),    # Expand
            nn.ReLU(),                        # Non-linearity
            nn.Linear(ff_dim, embed_dim),     # Shrink back
        )

    def forward(self, x):
        return self.net(x)

The Complete Transformer Block

Now let's put attention, FFN, residual connections, and layer norm together:

Python
class TransformerBlock(nn.Module):
    """
    A single Transformer block.
    This is the fundamental repeating unit in models like GPT.
    """

    def __init__(self, embed_dim, num_heads, ff_dim=None, dropout=0.1):
        super().__init__()

        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.ffn = FeedForward(embed_dim, ff_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Sub-layer 1: Attention + Residual
        normed = self.norm1(x)
        attended = self.attention(normed, mask=mask)
        attended = self.dropout(attended)
        x = x + attended            # ← Residual connection

        # Sub-layer 2: FFN + Residual
        normed = self.norm2(x)
        fed_forward = self.ffn(normed)
        fed_forward = self.dropout(fed_forward)
        x = x + fed_forward         # ← Residual connection

        return x

The beauty of this design: input and output have the same shape. This means you can stack as many blocks as you want. The output of Block 1 feeds directly into Block 2, Block 2 into Block 3, and so on. Deeper stacks learn more complex patterns:

Block 1: Basic patterns (which characters tend to appear together)
Blocks 2–3: Higher-level patterns (word structure, common phrases)
Blocks 4+: Complex patterns (meaning, grammar, context, reasoning)

4.7 Putting It All Together: The Mini-Transformer

Now let's assemble every piece into a complete language model. This is a miniature version of GPT — same architecture, just smaller:

Python
class MiniTransformer(nn.Module):
    """
    A complete mini-Transformer language model.
    Takes character indices as input, predicts the next character.
    """

    def __init__(self, vocab_size, embed_dim=64, num_heads=4,
                 num_blocks=4, max_seq_len=256, dropout=0.1):
        super().__init__()

        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.max_seq_len = max_seq_len

        # Token embedding: character index → vector
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)

        # Positional embedding: position → vector (learned, not sinusoidal)
        self.position_embedding = nn.Embedding(max_seq_len, embed_dim)

        # Dropout after embeddings
        self.dropout = nn.Dropout(dropout)

        # Stack of Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, dropout=dropout)
            for _ in range(num_blocks)
        ])

        # Final layer normalization
        self.final_norm = nn.LayerNorm(embed_dim)

        # Output projection: vector → vocabulary scores (logits)
        self.output_head = nn.Linear(embed_dim, vocab_size, bias=False)

        # Weight tying: share weights between input embedding and output head
        self.output_head.weight = self.token_embedding.weight

    def forward(self, idx, targets=None):
        batch_size, seq_len = idx.shape

        # Step 1: Token embeddings
        tok_emb = self.token_embedding(idx)

        # Step 2: Positional embeddings
        positions = torch.arange(seq_len, device=idx.device)
        pos_emb = self.position_embedding(positions)

        # Step 3: Combine and apply dropout
        x = self.dropout(tok_emb + pos_emb)

        # Step 4: Causal mask
        mask = torch.tril(torch.ones(seq_len, seq_len, device=idx.device))
        mask = mask.unsqueeze(0)

        # Step 5: Pass through transformer blocks
        for block in self.blocks:
            x = block(x, mask=mask)

        # Step 6: Final normalization
        x = self.final_norm(x)

        # Step 7: Project to vocabulary size
        logits = self.output_head(x)

        # Compute loss if targets are provided
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, self.vocab_size),
                targets.view(-1)
            )

        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """Generate text autoregressively."""
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.max_seq_len:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature

            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')

            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_token], dim=1)

        return idx

Architecture Diagram


          Input: "The cat sat on the mat"
                      │
                      ▼
          ┌───────────────────────┐
          │   Token Embedding     │  char index → vector (64d)
          └───────────┬───────────┘
                      │
                      + ◄──── Position Embedding (learned, 64d)
                      │
                      ▼
                   Dropout
                      │
          ┌───────────┴───────────┐
          │                       │
          │   ┌───────────────┐   │
          │   │  LayerNorm    │   │
          │   │  Multi-Head   │   │ ── Transformer
          │   │  Attention    │   │    Block 1
          │   │  + Residual   │   │
          │   ├───────────────┤   │
          │   │  LayerNorm    │   │
          │   │  FFN          │   │
          │   │  + Residual   │   │
          │   └───────────────┘   │
          │         ...           │  ×4 blocks
          │   ┌───────────────┐   │
          │   │  Block 4      │   │
          │   └───────────────┘   │
          │                       │
          └───────────┬───────────┘
                      │
                      ▼
          ┌───────────────────────┐
          │   Final LayerNorm     │
          └───────────┬───────────┘
                      │
                      ▼
          ┌───────────────────────┐
          │   Output Head         │  vector → vocab scores
          │   (weight-tied)       │
          └───────────┬───────────┘
                      │
                      ▼
            Logits (65 scores per position)
            → softmax → probabilities → sample → next token

Parameter Count Breakdown

With vocab_size=65, embed_dim=64, num_heads=4, num_blocks=4, max_seq_len=128:

Component	Parameters
Token Embedding (65 × 64)	4,160
Position Embedding (128 × 64)	8,192
Transformer Blocks (×4)	~66,000
Final LayerNorm	128
Output Head	(shared with token embedding)
Total	~78,000

For comparison:

Our Mini-Transformer: ~78,000 parameters
GPT-2 (small): 124,000,000 parameters
GPT-3: 175,000,000,000 parameters
GPT-4 (estimated): ~1,700,000,000,000 parameters

Same architecture. Different scale.

💭 4.8 Discussion: Why Transformers Beat Everything

Before 2017, RNNs and LSTMs ruled natural language processing. Why did Transformers replace them so completely?

### 1. Parallelisation

RNNs process tokens one at a time: word 5 must wait for word 4 to finish, which must wait for word 3, and so on. This is inherently sequential — you cannot speed it up by throwing more GPUs at it.

Transformers process all tokens simultaneously. The attention computation is a matrix multiplication — the exact kind of operation that GPUs are designed to do blazingly fast. Training a Transformer on 8 GPUs is nearly 8× faster. Training an RNN on 8 GPUs is barely faster than on 1.

### 2. Long-Range Dependencies

In an RNN, information from word 1 has to pass through every subsequent word to reach word 100. By that point, the signal has degraded — the vanishing gradient problem. LSTMs improved this but didn't eliminate it.

In a Transformer, word 100 can attend directly to word 1 in a single step. The attention mechanism doesn't care about distance. A word on page 1 of a document can directly influence a word on page 50.

### 3. Scaling Laws

Perhaps the most important advantage: Transformers scale predictably. Research has shown that as you increase model size, data size, and compute, Transformer performance improves in a smooth, predictable curve. This led to the modern paradigm: make the model bigger, give it more data, and it gets better. This insight fuelled the race from GPT-2 (1.5B parameters) to GPT-4 (estimated 1.7T parameters).

> [!IMPORTANT]

> The Transformer's key advantage is not intelligence — it is scalability. The same architecture works for a 78K-parameter toy model and a trillion-parameter frontier model. This universality is unprecedented in AI history.

Key Concepts Summary

Concept	What It Does	Analogy
Embedding	Converts tokens into dense vectors	Roll number → student description
Positional Encoding	Adds position information to embeddings	Radio frequencies giving each station a unique ID
Self-Attention	Lets each token attend to all other tokens	Students in a class looking at each other's notes
Query, Key, Value	Three projections of each token	Search term, book title, book content
Scaling (`\sqrt{d_k}`)	Prevents softmax from saturating	Keeping exam scores in a reasonable range
Causal Mask	Prevents seeing future tokens	No peeking at the answer key
Multi-Head Attention	Multiple attention patterns in parallel	Reading a poem for meaning, rhyme, and emotion simultaneously
FFN	Processes gathered information	Doing homework after collecting notes from classmates
Residual Connection	Preserves original input	Keeping your own notes even after a confusing lecture
Layer Norm	Stabilises numbers	Grading on a curve
Transformer Block	Attention + FFN + residual + norm	One complete round of classroom discussion and homework

📝 4.10 Exercises

Exercise 1: Embedding Exploration

Run step1_embedding.py and observe the embedding vectors for the characters 'a' and 'b'. Are they similar? Why or why not? (Hint: the model is untrained.) Modify embed_dim from 8 to 32 and observe how the vectors change.

Exercise 2: Attention Matrix Interpretation

Run step2_attention.py and study the attention weight matrix printed for "The cat". Which character pays the most attention to which other character? Now change the sentence to "aaaaaa" (all same characters). What do you expect the attention matrix to look like, and why?

Exercise 3: Masking Experiment

In the SelfAttention class, remove the causal mask (set mask=None always). Run the model and compare the attention weights with and without the mask. Write a paragraph explaining why GPT-style models need the mask but BERT-style models don't.

Exercise 4: Multi-Head Intuition

In step3_transformer_block.py, change num_heads from 4 to 1 (keeping embed_dim the same). How does the parameter count change? Why might using more heads be better even though the total parameters stay the same?

Exercise 5: Scale the Model

Modify step4_put_it_together.py to create a larger model with embed_dim=128, num_heads=8, and num_blocks=6. Calculate the expected parameter count by hand, then verify it against the code's output. How does it compare to GPT-2?

💭 4.11 Discussion Questions

The Attention Bottleneck: Self-attention computes a score between every pair of tokens, making its computational cost O(n^2) where n is the sequence length. If you double the sequence length, the cost quadruples. Why is this a problem for processing very long documents (like an entire novel)? What approaches might help?

Learned vs. Fixed Positional Encoding: Our final Mini-Transformer uses learned positional embeddings, while the original 2017 paper used sinusoidal (fixed) encodings. What are the trade-offs? Can a model with learned embeddings generalise to sequences longer than it was trained on?

Weight Tying: In our MiniTransformer, the token embedding matrix and the output head share the same weights (self.output_head.weight = self.token_embedding.weight). Why does this make intuitive sense? (Hint: think about what both layers represent.)

Why ReLU in the FFN? The feed-forward network uses ReLU (Rectified Linear Unit) as its non-linearity. What would happen if we removed the non-linearity entirely? Would the FFN still be useful? (Hint: think about what two consecutive linear layers reduce to.)

The Scaling Revolution: The same Transformer architecture powers models from 78K parameters (our toy model) to 1.7 trillion parameters (GPT-4). What does this tell us about the relationship between architecture and scale in modern AI? Is architecture or data more important?

Tip

What's Next? In Chapter 5, you will take this Mini-Transformer and train it on real text. You'll watch it go from producing random garbage to generating coherent English — character by character. The architecture is ready. Now it's time to teach it to think.

Complete Source Code - Chapter 4

Below are the complete, runnable source files for this chapter. Every line is included.

Complete Code: step1_embedding.py

Python
"""
🟠 Level 3, Step 1: Embeddings — Turning Words into Numbers
=============================================================

Before a Transformer can process text, it needs to convert characters (or words)
into NUMBERS. This is called "embedding".

But there's a catch — the model also needs to know the ORDER of the characters.
"cat" and "tac" have the same characters but different meanings!

That's why we add "positional encoding" — a special pattern that tells the model
WHERE each character is in the sequence.

This script shows you:
  1. How to build a vocabulary (character → number)
  2. How token embedding works (number → vector)
  3. How positional encoding works (adding position information)
"""

import torch
import torch.nn as nn
import math

# ============================================================================
# 🎨 ANSI Colors for beautiful terminal output
# ============================================================================
class Colors:
    HEADER = '\033[95m'
    BLUE = '\033[94m'
    CYAN = '\033[96m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    DIM = '\033[2m'
    RESET = '\033[0m'

def print_header(text):
    print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
    print(f"  {text}")
    print(f"{'='*60}{Colors.RESET}\n")

def print_step(num, text):
    print(f"{Colors.BOLD}{Colors.CYAN}📌 Step {num}: {text}{Colors.RESET}")

def print_info(text):
    print(f"  {Colors.DIM}{text}{Colors.RESET}")

def print_success(text):
    print(f"  {Colors.GREEN}✓ {text}{Colors.RESET}")

# ============================================================================
# 📖 STEP 1: Build a Vocabulary
# ============================================================================
# A vocabulary maps each unique character to a number (index).
# For example: 'a' → 0, 'b' → 1, 'c' → 2, ...
# This is the simplest form of "tokenization".
# Real models like GPT use "subword" tokenization (BPE), which breaks words
# into pieces like "play" + "ing". But character-level is easier to understand!
# ============================================================================

def build_vocabulary(text):
    """
    Build a character-level vocabulary from text.
    
    Returns:
        char_to_idx: Dictionary mapping character → number
        idx_to_char: Dictionary mapping number → character
        vocab_size:  Total number of unique characters
    """
    # Get all unique characters and sort them
    # sorted() ensures the mapping is consistent every time
    chars = sorted(list(set(text)))
    
    # Create the two-way mapping
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    idx_to_char = {i: ch for i, ch in enumerate(chars)}
    
    return char_to_idx, idx_to_char, len(chars)


def encode(text, char_to_idx):
    """Convert a string into a list of numbers using our vocabulary."""
    return [char_to_idx[ch] for ch in text]


def decode(indices, idx_to_char):
    """Convert a list of numbers back into a string."""
    return ''.join([idx_to_char[i] for i in indices])


# ============================================================================
# 📖 STEP 2: Token Embedding
# ============================================================================
# An embedding turns each character INDEX into a VECTOR (list of numbers).
# 
# Why? Because a single number (like 5) doesn't carry much meaning.
# But a vector (like [0.2, -0.5, 0.8, 0.1]) can represent complex relationships:
#   - Similar characters will have similar vectors
#   - The model LEARNS these vectors during training!
#
# Think of it like this:
#   Index 5 → just a label, like a student's roll number
#   Vector [0.2, -0.5, 0.8] → the student's actual abilities/personality
# ============================================================================

class TokenEmbedding(nn.Module):
    """
    Converts token indices into dense vectors.
    
    Args:
        vocab_size: Number of unique tokens
        embed_dim:  Size of each embedding vector
    """
    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        # nn.Embedding is like a lookup table:
        # It stores a matrix of shape (vocab_size × embed_dim)
        # When you give it index 3, it returns row 3 of the matrix
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.embed_dim = embed_dim
    
    def forward(self, x):
        # x shape: (batch_size, sequence_length) — indices
        # output shape: (batch_size, sequence_length, embed_dim) — vectors
        return self.embedding(x)


# ============================================================================
# 📖 STEP 3: Positional Encoding
# ============================================================================
# The Transformer processes ALL tokens at once (not one-by-one like RNNs).
# This means it has NO idea about word order!
# "The cat sat on the mat" and "mat the on sat cat the" look the same to it.
#
# Positional encoding ADDS a unique pattern to each position:
#   Position 0: add pattern [sin(0), cos(0), sin(0), cos(0), ...]
#   Position 1: add pattern [sin(1), cos(1), sin(0.1), cos(0.1), ...]
#   Position 2: add pattern [sin(2), cos(2), sin(0.2), cos(0.2), ...]
#
# The patterns use sin/cos waves at different frequencies so each position
# gets a UNIQUE fingerprint. The model can then learn to use this info!
# ============================================================================

class PositionalEncoding(nn.Module):
    """
    Adds positional information to embeddings using sinusoidal patterns.
    
    The famous formula from "Attention Is All You Need":
        PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
        PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    def __init__(self, embed_dim, max_seq_len=512):
        super().__init__()
        
        # Create a matrix to store all positional encodings
        pe = torch.zeros(max_seq_len, embed_dim)
        
        # Position indices: [0, 1, 2, ..., max_seq_len-1]
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        
        # Division term: creates different frequencies for each dimension
        # Even dimensions use sin, odd dimensions use cos
        div_term = torch.exp(
            torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
        )
        
        # Apply sin to even indices (0, 2, 4, ...)
        pe[:, 0::2] = torch.sin(position * div_term)
        
        # Apply cos to odd indices (1, 3, 5, ...)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension: (1, max_seq_len, embed_dim)
        pe = pe.unsqueeze(0)
        
        # Register as buffer (saved with model but not trained)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        """
        Args:
            x: Tensor of shape (batch_size, seq_len, embed_dim)
        Returns:
            x + positional encoding
        """
        seq_len = x.size(1)
        # Add positional encoding to the input
        # The position info is ADDED to the embedding, not concatenated
        return x + self.pe[:, :seq_len, :]


# ============================================================================
# 🚀 MAIN: Let's see it all in action!
# ============================================================================

if __name__ == '__main__':
    print_header("🟠 Level 3, Step 1: Embeddings & Positional Encoding")
    
    # --- Step 1: Build Vocabulary ---
    print_step(1, "Building the Vocabulary")
    
    sample_text = "The sun rises in the east and sets in the west. India is a beautiful country."
    print_info(f'Sample text: "{sample_text}"')
    
    char_to_idx, idx_to_char, vocab_size = build_vocabulary(sample_text)
    
    print(f"\n  {Colors.YELLOW}Vocabulary ({vocab_size} unique characters):{Colors.RESET}")
    for ch, idx in sorted(char_to_idx.items(), key=lambda x: x[1]):
        display_ch = repr(ch) if ch == ' ' else f"'{ch}'"
        print(f"    {display_ch:6s} → {idx}")
    
    print_success(f"Built vocabulary with {vocab_size} characters")
    
    # --- Step 2: Encode Text ---
    print_step(2, "Encoding Text into Numbers")
    
    test_text = "The sun"
    encoded = encode(test_text, char_to_idx)
    print(f"\n  Text:    \"{test_text}\"")
    print(f"  Encoded: {encoded}")
    print(f"  Decoded: \"{decode(encoded, idx_to_char)}\"")
    print_success("Text successfully converted to numbers!")
    
    # --- Step 3: Token Embedding ---
    print_step(3, "Converting Numbers to Vectors (Token Embedding)")
    
    embed_dim = 8  # Small for visualization (real models use 768+)
    token_emb = TokenEmbedding(vocab_size, embed_dim)
    
    # Convert our encoded text to a tensor
    input_tensor = torch.tensor([encoded])  # Shape: (1, 7) — batch=1, seq=7
    print(f"\n  Input shape:  {list(input_tensor.shape)} (batch_size=1, seq_len={len(encoded)})")
    
    # Get embeddings
    embedded = token_emb(input_tensor)
    print(f"  Output shape: {list(embedded.shape)} (batch_size=1, seq_len={len(encoded)}, embed_dim={embed_dim})")
    
    print(f"\n  {Colors.YELLOW}Embedding vectors for each character:{Colors.RESET}")
    for i, ch in enumerate(test_text):
        vec = embedded[0, i].detach().numpy()
        vec_str = ', '.join([f'{v:+.3f}' for v in vec])
        display_ch = 'SPC' if ch == ' ' else ch
        print(f"    '{display_ch}' → [{vec_str}]")
    
    print_success("Each character is now a rich vector of numbers!")
    
    # --- Step 4: Positional Encoding ---
    print_step(4, "Adding Positional Information")
    
    pos_enc = PositionalEncoding(embed_dim, max_seq_len=100)
    
    # Show the positional encoding patterns
    print(f"\n  {Colors.YELLOW}Positional encoding patterns:{Colors.RESET}")
    for pos in range(min(5, len(test_text))):
        pe_vals = pos_enc.pe[0, pos].numpy()
        pe_str = ', '.join([f'{v:+.3f}' for v in pe_vals])
        print(f"    Position {pos}: [{pe_str}]")
    
    # Apply positional encoding
    embedded_with_pos = pos_enc(embedded)
    
    print(f"\n  {Colors.YELLOW}Before vs After positional encoding:{Colors.RESET}")
    for i, ch in enumerate(test_text[:4]):
        before = embedded[0, i].detach().numpy()
        after = embedded_with_pos[0, i].detach().numpy()
        display_ch = 'SPC' if ch == ' ' else ch
        before_str = ', '.join([f'{v:+.3f}' for v in before[:4]])
        after_str = ', '.join([f'{v:+.3f}' for v in after[:4]])
        print(f"    '{display_ch}' before: [{before_str}, ...]")
        print(f"    '{display_ch}' after:  [{after_str}, ...]")
        print()
    
    print_success("Position information added! Same character at different positions now has different vectors.")
    
    # --- Summary ---
    print_header("📝 Summary")
    print(f"""  The embedding pipeline:
  
    Text: "The sun"
       │
       ▼
    {Colors.CYAN}Tokenize{Colors.RESET}: Convert characters to indices
       │  'T'→{char_to_idx.get('T', '?')}, 'h'→{char_to_idx.get('h', '?')}, 'e'→{char_to_idx.get('e', '?')}, ...
       │
       ▼
    {Colors.CYAN}Embed{Colors.RESET}: Look up vector for each index
       │  Index {char_to_idx.get('T', '?')} → [{', '.join([f'{v:.2f}' for v in embedded[0, 0].detach().numpy()[:3]])}, ...]
       │
       ▼
    {Colors.CYAN}Add Position{Colors.RESET}: Add sinusoidal position pattern
       │  Vector + Position Pattern = Final Embedding
       │
       ▼
    Ready for Attention! → Go to step2_attention.py
""")
    
    print(f"  {Colors.BOLD}{Colors.GREEN}✅ Step 1 Complete! Next: python step2_attention.py{Colors.RESET}\n")

Complete Code: step2_attention.py

Python
"""
🟠 Level 3, Step 2: Self-Attention — The Core of Transformers
===============================================================

This is THE most important mechanism in modern AI.

Self-attention allows each token in a sequence to "look at" every other token
and decide how much to pay attention to it.

We'll build it from scratch:
  1. Create Query (Q), Key (K), Value (V) matrices
  2. Compute attention scores
  3. Apply causal mask (no peeking at the future!)
  4. Softmax to get probabilities
  5. Weighted sum of values

By the end, you'll understand the formula:
    Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# ============================================================================
# 🎨 Colors
# ============================================================================
class Colors:
    HEADER = '\033[95m'
    BLUE = '\033[94m'
    CYAN = '\033[96m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    DIM = '\033[2m'
    RESET = '\033[0m'

def print_header(text):
    print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
    print(f"  {text}")
    print(f"{'='*60}{Colors.RESET}\n")

def print_step(num, text):
    print(f"{Colors.BOLD}{Colors.CYAN}📌 Step {num}: {text}{Colors.RESET}")

def print_info(text):
    print(f"  {Colors.DIM}{text}{Colors.RESET}")

def print_success(text):
    print(f"  {Colors.GREEN}✓ {text}{Colors.RESET}")

def print_matrix(name, matrix, row_labels=None, col_labels=None):
    """Pretty-print a 2D matrix with labels."""
    print(f"\n  {Colors.YELLOW}{name}:{Colors.RESET}")
    rows, cols = matrix.shape
    
    # Column headers
    if col_labels:
        header = "         " + "  ".join([f"{l:>7s}" for l in col_labels])
        print(f"  {header}")
        print(f"  {'─' * len(header)}")
    
    for i in range(rows):
        label = f"  {row_labels[i]:>6s} │ " if row_labels else f"  Row {i}: "
        vals = "  ".join([f"{matrix[i,j]:>7.3f}" for j in range(cols)])
        print(f"{label}{vals}")


# ============================================================================
# 🧠 SELF-ATTENTION FROM SCRATCH
# ============================================================================

class SelfAttention(nn.Module):
    """
    Single-head self-attention mechanism.
    
    This is the core building block of the Transformer.
    
    How it works:
        1. Take input X (sequence of vectors)
        2. Create three versions: Q (Query), K (Key), V (Value)
        3. Compute attention = softmax(Q·K^T / √d) · V
        4. Return attention output
    """
    
    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim
        
        # Three weight matrices — these are LEARNED during training!
        # W_q: transforms input into "what am I looking for?"
        # W_k: transforms input into "what do I contain?"  
        # W_v: transforms input into "what information do I give?"
        self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
        
        # Scaling factor to prevent dot products from getting too large
        self.scale = math.sqrt(embed_dim)
    
    def forward(self, x, mask=None, verbose=False):
        """
        Args:
            x: Input tensor of shape (batch, seq_len, embed_dim)
            mask: Optional causal mask
            verbose: If True, print intermediate values
        
        Returns:
            output: Attention output, same shape as input
            attention_weights: The attention matrix (for visualization)
        """
        batch_size, seq_len, _ = x.shape
        
        # ===== STEP 1: Create Q, K, V =====
        # Each is a different "view" of the same input
        Q = self.W_q(x)  # (batch, seq_len, embed_dim)
        K = self.W_k(x)  # (batch, seq_len, embed_dim)
        V = self.W_v(x)  # (batch, seq_len, embed_dim)
        
        if verbose:
            print_step("A", "Computed Q (Query), K (Key), V (Value)")
            print_info(f"Q shape: {list(Q.shape)}")
            print_info(f"K shape: {list(K.shape)}")
            print_info(f"V shape: {list(V.shape)}")
        
        # ===== STEP 2: Compute Attention Scores =====
        # Score = Q · K^T (dot product between queries and keys)
        # High score = this query is very interested in this key
        scores = torch.matmul(Q, K.transpose(-2, -1))  # (batch, seq_len, seq_len)
        
        if verbose:
            print_step("B", "Computed raw attention scores (Q · K^T)")
            print_info(f"Scores shape: {list(scores.shape)} — each token has a score for every other token")
        
        # ===== STEP 3: Scale =====
        # Divide by √d_k to prevent scores from getting too large
        # Large scores → softmax becomes too "peaked" (one token gets all attention)
        # Scaled scores → softer distribution → better learning
        scores = scores / self.scale
        
        if verbose:
            print_step("C", f"Scaled scores by 1/√{self.embed_dim} = 1/{self.scale:.2f}")
        
        # ===== STEP 4: Apply Causal Mask (Optional) =====
        # In GPT-style models, each token can only attend to tokens BEFORE it
        # We set future positions to -infinity so softmax turns them to 0
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
            
            if verbose:
                print_step("D", "Applied causal mask (future tokens set to -∞)")
                print_info("This prevents the model from 'cheating' by looking ahead!")
        
        # ===== STEP 5: Softmax =====
        # Convert scores to probabilities (0 to 1, summing to 1)
        attention_weights = F.softmax(scores, dim=-1)
        
        if verbose:
            print_step("E", "Applied softmax → attention weights (probabilities)")
            print_info("Each row sums to 1.0 — it's a probability distribution!")
        
        # ===== STEP 6: Weighted Sum of Values =====
        # Multiply attention weights by V to get the final output
        # Each token's output is a weighted combination of ALL values
        output = torch.matmul(attention_weights, V)  # (batch, seq_len, embed_dim)
        
        if verbose:
            print_step("F", "Computed output = attention_weights × V")
            print_info(f"Output shape: {list(output.shape)} — same as input!")
            print_success("Each token now contains information from tokens it attended to!")
        
        return output, attention_weights


# ============================================================================
# 🎭 Create Causal Mask
# ============================================================================

def create_causal_mask(seq_len):
    """
    Create a lower-triangular mask for autoregressive (GPT-style) models.
    
    The mask looks like:
        [[1, 0, 0, 0],     ← token 0 can only see token 0
         [1, 1, 0, 0],     ← token 1 can see tokens 0, 1
         [1, 1, 1, 0],     ← token 2 can see tokens 0, 1, 2
         [1, 1, 1, 1]]     ← token 3 can see all tokens
    """
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.unsqueeze(0)  # Add batch dimension


# ============================================================================
# 🚀 MAIN: Interactive Demo
# ============================================================================

if __name__ == '__main__':
    print_header("🟠 Level 3, Step 2: Self-Attention from Scratch")
    
    # --- Setup ---
    torch.manual_seed(42)  # For reproducible results
    
    # Our example sentence (character-level)
    sentence = "The cat"
    tokens = list(sentence)
    seq_len = len(tokens)
    embed_dim = 8  # Small for visualization
    
    print(f"  {Colors.BOLD}Example sentence: \"{sentence}\"{Colors.RESET}")
    print(f"  Tokens: {tokens}")
    print(f"  Sequence length: {seq_len}")
    print(f"  Embedding dimension: {embed_dim}")
    
    # Create random embeddings (in real model, these come from Step 1)
    x = torch.randn(1, seq_len, embed_dim)
    
    # --- Build Attention ---
    print_header("🧠 Building Self-Attention")
    
    attention = SelfAttention(embed_dim)
    
    # --- Without Mask (bidirectional) ---
    print_header("📊 Attention WITHOUT Causal Mask (Bidirectional)")
    print_info("Every token can see every other token")
    
    output_bi, weights_bi = attention(x, mask=None, verbose=True)
    
    # Show attention matrix
    print_matrix(
        "Attention Weights (who pays attention to whom?)",
        weights_bi[0].detach(),
        row_labels=tokens,
        col_labels=tokens
    )
    
    # Visual attention grid
    print(f"\n  {Colors.YELLOW}Visual Attention Grid:{Colors.RESET}")
    print(f"  (█ = high attention, ░ = low attention)\n")
    
    header = "         " + "  ".join([f"{t:>3s}" for t in tokens])
    print(f"  {header}")
    for i, token in enumerate(tokens):
        row = f"  {token:>6s} │ "
        for j in range(seq_len):
            w = weights_bi[0, i, j].item()
            if w > 0.3:
                row += f" {Colors.GREEN}██{Colors.RESET} "
            elif w > 0.15:
                row += f" {Colors.YELLOW}▓▓{Colors.RESET} "
            else:
                row += f" {Colors.DIM}░░{Colors.RESET} "
        print(row)
    
    # --- With Causal Mask ---
    print_header("🎭 Attention WITH Causal Mask (Autoregressive / GPT-style)")
    print_info("Each token can only see itself and tokens BEFORE it")
    
    causal_mask = create_causal_mask(seq_len)
    
    print(f"\n  {Colors.YELLOW}Causal Mask:{Colors.RESET}")
    for i, token in enumerate(tokens):
        row = f"  {token:>6s} │ "
        for j in range(seq_len):
            if causal_mask[0, i, j] == 1:
                row += f" {Colors.GREEN}✓{Colors.RESET}  "
            else:
                row += f" {Colors.RED}✗{Colors.RESET}  "
        print(row)
    
    output_causal, weights_causal = attention(x, mask=causal_mask, verbose=True)
    
    print_matrix(
        "Causal Attention Weights",
        weights_causal[0].detach(),
        row_labels=tokens,
        col_labels=tokens
    )
    
    # Visual grid for causal attention
    print(f"\n  {Colors.YELLOW}Visual Causal Attention Grid:{Colors.RESET}")
    print(f"  (█ = high attention, ░ = low, ✗ = masked)\n")
    
    header = "         " + "  ".join([f"{t:>3s}" for t in tokens])
    print(f"  {header}")
    for i, token in enumerate(tokens):
        row = f"  {token:>6s} │ "
        for j in range(seq_len):
            if causal_mask[0, i, j] == 0:
                row += f" {Colors.RED}✗✗{Colors.RESET} "
            else:
                w = weights_causal[0, i, j].item()
                if w > 0.3:
                    row += f" {Colors.GREEN}██{Colors.RESET} "
                elif w > 0.15:
                    row += f" {Colors.YELLOW}▓▓{Colors.RESET} "
                else:
                    row += f" {Colors.DIM}░░{Colors.RESET} "
        print(row)
    
    # --- Summary ---
    print_header("📝 Summary")
    print(f"""  Self-Attention in 6 steps:
  
    1. {Colors.CYAN}Create Q, K, V{Colors.RESET} from input using learned weight matrices
    2. {Colors.CYAN}Score{Colors.RESET} = Q · K^T  (how much does each token care about others?)
    3. {Colors.CYAN}Scale{Colors.RESET} by 1/√d_k (keep numbers reasonable)
    4. {Colors.CYAN}Mask{Colors.RESET} future tokens (for GPT-style models)
    5. {Colors.CYAN}Softmax{Colors.RESET} to get probabilities
    6. {Colors.CYAN}Output{Colors.RESET} = attention_weights × V (weighted combination)

  {Colors.BOLD}The Formula:{Colors.RESET}
    Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V

  {Colors.BOLD}{Colors.GREEN}✅ Step 2 Complete! Next: python step3_transformer_block.py{Colors.RESET}
""")

Complete Code: step3_transformer_block.py

Python
"""
🟠 Level 3, Step 3: The Transformer Block
============================================

A Transformer Block combines several components into one powerful unit:

    1. Multi-Head Self-Attention — look at the sequence from multiple perspectives
    2. Feed-Forward Network — process the information  
    3. Layer Normalization — keep numbers stable
    4. Residual Connections — preserve original information

This is the building block that gets stacked to make GPT, Claude, etc.
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# ============================================================================
# 🎨 Colors
# ============================================================================
class Colors:
    HEADER = '\033[95m'
    BLUE = '\033[94m'
    CYAN = '\033[96m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    DIM = '\033[2m'
    RESET = '\033[0m'

def print_header(text):
    print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
    print(f"  {text}")
    print(f"{'='*60}{Colors.RESET}\n")

def print_step(num, text):
    print(f"{Colors.BOLD}{Colors.CYAN}📌 Step {num}: {text}{Colors.RESET}")

def print_info(text):
    print(f"  {Colors.DIM}{text}{Colors.RESET}")

def print_success(text):
    print(f"  {Colors.GREEN}✓ {text}{Colors.RESET}")

def count_parameters(module, name=""):
    """Count and print the number of parameters in a module."""
    total = sum(p.numel() for p in module.parameters())
    trainable = sum(p.numel() for p in module.parameters() if p.requires_grad)
    if name:
        print(f"    {name:30s}: {trainable:>8,} parameters")
    return trainable


# ============================================================================
# 🔀 MULTI-HEAD ATTENTION
# ============================================================================
# Instead of one attention head, we use MULTIPLE heads.
# Each head learns to focus on different types of relationships:
#   Head 1 might focus on: "what word comes before me?"
#   Head 2 might focus on: "what is the subject of this sentence?"
#   Head 3 might focus on: "is there a negation word nearby?"
#
# We split the embedding dimension among heads:
#   embed_dim=128, num_heads=4 → each head works with 32 dimensions
# ============================================================================

class MultiHeadAttention(nn.Module):
    """
    Multi-Head Self-Attention.
    
    Splits the input into multiple "heads", runs attention on each,
    then combines the results.
    """
    
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        
        assert embed_dim % num_heads == 0, \
            f"embed_dim ({embed_dim}) must be divisible by num_heads ({num_heads})"
        
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads  # Dimension per head
        
        # One big linear layer for Q, K, V (more efficient than 3 separate ones)
        self.W_qkv = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
        
        # Output projection: combines all heads back together
        self.W_out = nn.Linear(embed_dim, embed_dim, bias=False)
        
        self.scale = math.sqrt(self.head_dim)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: (batch, seq_len, embed_dim)
            mask: Optional causal mask
        Returns:
            output: (batch, seq_len, embed_dim)
        """
        batch_size, seq_len, _ = x.shape
        
        # Step 1: Compute Q, K, V all at once
        qkv = self.W_qkv(x)  # (batch, seq_len, 3 * embed_dim)
        
        # Step 2: Split into Q, K, V
        qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch, heads, seq_len, head_dim)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        
        # Step 3: Compute attention scores for ALL heads at once
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        # scores shape: (batch, heads, seq_len, seq_len)
        
        # Step 4: Apply causal mask
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Step 5: Softmax
        attention_weights = F.softmax(scores, dim=-1)
        
        # Step 6: Weighted sum of values
        output = torch.matmul(attention_weights, V)
        # output shape: (batch, heads, seq_len, head_dim)
        
        # Step 7: Combine heads back together
        output = output.transpose(1, 2)  # (batch, seq_len, heads, head_dim)
        output = output.reshape(batch_size, seq_len, self.embed_dim)
        
        # Step 8: Final projection
        output = self.W_out(output)
        
        return output


# ============================================================================
# 🔧 FEED-FORWARD NETWORK
# ============================================================================
# After attention figures out RELATIONSHIPS between tokens,
# the FFN PROCESSES that information.
# 
# It's a simple 2-layer network:
#   Input → Expand (4x bigger) → ReLU → Shrink (back to original) → Output
#
# The "expand then shrink" pattern gives the model a larger space to
# compute in, then compresses the result back down.
# ============================================================================

class FeedForward(nn.Module):
    """
    Position-wise Feed-Forward Network.
    
    Each position (token) is processed INDEPENDENTLY through the same network.
    It's like giving each student the same worksheet to fill out.
    """
    
    def __init__(self, embed_dim, ff_dim=None):
        super().__init__()
        
        # Default: expand to 4x the embedding dimension
        if ff_dim is None:
            ff_dim = 4 * embed_dim
        
        self.net = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),    # Expand
            nn.ReLU(),                        # Non-linearity (the "thinking" part)
            nn.Linear(ff_dim, embed_dim),     # Shrink back
        )
    
    def forward(self, x):
        return self.net(x)


# ============================================================================
# 🧱 THE COMPLETE TRANSFORMER BLOCK
# ============================================================================
# This combines everything:
#
#   Input
#     │
#     ├────────────────────┐  (Residual)
#     ▼                    │
#   LayerNorm              │
#     ▼                    │
#   Multi-Head Attention   │
#     ▼                    │
#   ADD ◄──────────────────┘
#     │
#     ├────────────────────┐  (Residual)
#     ▼                    │
#   LayerNorm              │
#     ▼                    │
#   Feed-Forward           │
#     ▼                    │
#   ADD ◄──────────────────┘
#     │
#   Output
# ============================================================================

class TransformerBlock(nn.Module):
    """
    A single Transformer block.
    
    This is the fundamental repeating unit in models like GPT.
    Stack many of these together to get a full Transformer model.
    """
    
    def __init__(self, embed_dim, num_heads, ff_dim=None, dropout=0.1):
        super().__init__()
        
        # Layer Normalization: keeps values in a reasonable range
        # Think of it as "grading on a curve" — normalizes each student's scores
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        
        # Multi-Head Self-Attention
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        
        # Feed-Forward Network
        self.ffn = FeedForward(embed_dim, ff_dim)
        
        # Dropout: randomly "turns off" some neurons during training
        # Prevents the model from memorizing (overfitting) the training data
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Args:
            x: (batch, seq_len, embed_dim)
            mask: Optional causal mask
        Returns:
            output: (batch, seq_len, embed_dim) — same shape as input!
        """
        # === Sub-layer 1: Attention with Residual Connection ===
        # 1. Normalize
        normed = self.norm1(x)
        # 2. Apply attention
        attended = self.attention(normed, mask=mask)
        # 3. Dropout (only during training)
        attended = self.dropout(attended)
        # 4. Residual connection: ADD original input back
        #    This ensures information isn't lost through the attention layer
        x = x + attended
        
        # === Sub-layer 2: FFN with Residual Connection ===
        # 1. Normalize
        normed = self.norm2(x)
        # 2. Apply feed-forward
        fed_forward = self.ffn(normed)
        # 3. Dropout
        fed_forward = self.dropout(fed_forward)
        # 4. Residual connection
        x = x + fed_forward
        
        return x


# ============================================================================
# 🚀 MAIN: See it in action
# ============================================================================

if __name__ == '__main__':
    print_header("🟠 Level 3, Step 3: The Transformer Block")
    
    torch.manual_seed(42)
    
    # Configuration
    embed_dim = 32    # Embedding dimension
    num_heads = 4     # Number of attention heads
    seq_len = 8       # Sequence length
    batch_size = 1
    
    print(f"  {Colors.BOLD}Configuration:{Colors.RESET}")
    print(f"    Embedding dimension: {embed_dim}")
    print(f"    Number of heads:     {num_heads}")
    print(f"    Head dimension:      {embed_dim // num_heads}")
    print(f"    Sequence length:     {seq_len}")
    print(f"    Feed-forward dim:    {4 * embed_dim}")
    
    # --- Build Components ---
    print_header("🔧 Building Components")
    
    print_step(1, "Multi-Head Attention")
    mha = MultiHeadAttention(embed_dim, num_heads)
    count_parameters(mha, "Multi-Head Attention")
    print_info(f"  → {num_heads} heads, each with dim={embed_dim // num_heads}")
    
    print()
    print_step(2, "Feed-Forward Network")
    ffn = FeedForward(embed_dim)
    count_parameters(ffn, "Feed-Forward Network")
    print_info(f"  → Expand: {embed_dim} → {4*embed_dim} → {embed_dim}")
    
    print()
    print_step(3, "Layer Normalization")
    ln = nn.LayerNorm(embed_dim)
    count_parameters(ln, "Layer Norm (×2)")
    print_info("  → Normalizes values to mean=0, std=1")
    
    # --- Build Full Transformer Block ---
    print_header("🧱 Complete Transformer Block")
    
    block = TransformerBlock(embed_dim, num_heads)
    total_params = count_parameters(block, "Total Transformer Block")
    
    print(f"\n  {Colors.YELLOW}Architecture:{Colors.RESET}")
    print(f"""
    ┌────────────────────────────────────┐
    │           INPUT ({embed_dim}d)              │
    └────────────────┬───────────────────┘
                     │
                     ├──────────────┐ (residual)
                     ▼              │
              ┌─────────────┐      │
              │  LayerNorm  │      │
              └──────┬──────┘      │
                     ▼              │
              ┌─────────────┐      │
              │  Multi-Head │      │
              │  Attention  │      │
              │  ({num_heads} heads)  │      │
              └──────┬──────┘      │
                     ▼              │
                   ADD ◄───────────┘
                     │
                     ├──────────────┐ (residual)
                     ▼              │
              ┌─────────────┐      │
              │  LayerNorm  │      │
              └──────┬──────┘      │
                     ▼              │
              ┌─────────────┐      │
              │ Feed-Forward│      │
              │ {embed_dim}→{4*embed_dim}→{embed_dim}  │      │
              └──────┬──────┘      │
                     ▼              │
                   ADD ◄───────────┘
                     │
    ┌────────────────┴───────────────────┐
    │          OUTPUT ({embed_dim}d)              │
    └────────────────────────────────────┘
    """)
    
    # --- Forward Pass ---
    print_header("🔄 Running a Forward Pass")
    
    # Create causal mask
    mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
    
    # Random input (simulating embedded tokens)
    x = torch.randn(batch_size, seq_len, embed_dim)
    
    print_step(1, "Input")
    print_info(f"Shape: {list(x.shape)} (batch={batch_size}, seq={seq_len}, dim={embed_dim})")
    print_info(f"First token vector (first 8 dims): [{', '.join(f'{v:.3f}' for v in x[0,0,:8].tolist())}]")
    
    print()
    print_step(2, "Processing through Transformer Block...")
    output = block(x, mask=mask)
    
    print()
    print_step(3, "Output")
    print_info(f"Shape: {list(output.shape)} — Same as input! ✓")
    print_info(f"First token vector (first 8 dims): [{', '.join(f'{v:.3f}' for v in output[0,0,:8].tolist())}]")
    
    print(f"\n  {Colors.YELLOW}Notice:{Colors.RESET}")
    print(f"  → Input shape  = {list(x.shape)}")
    print(f"  → Output shape = {list(output.shape)}")
    print(f"  → {Colors.GREEN}Shapes are identical!{Colors.RESET} This means we can STACK blocks.")
    print(f"    The output of Block 1 becomes the input to Block 2!")
    
    # --- Stacking Demo ---
    print_header("📚 Stacking Multiple Blocks")
    
    num_blocks = 4
    blocks = nn.ModuleList([
        TransformerBlock(embed_dim, num_heads) for _ in range(num_blocks)
    ])
    
    # Pass through all blocks
    current = x
    for i, b in enumerate(blocks):
        current = b(current, mask=mask)
        print(f"  Block {i+1}: {list(current.shape)} ✓")
    
    stack_params = sum(count_parameters(b) for b in blocks)
    print(f"\n  Total parameters in {num_blocks}-block stack: {Colors.BOLD}{stack_params:,}{Colors.RESET}")
    
    # --- Summary ---
    print_header("📝 Summary")
    print(f"""  A Transformer Block contains:
  
    1. {Colors.CYAN}Multi-Head Attention{Colors.RESET} — learns relationships between tokens
    2. {Colors.CYAN}Feed-Forward Network{Colors.RESET} — processes the information
    3. {Colors.CYAN}Layer Normalization{Colors.RESET}  — keeps numbers stable
    4. {Colors.CYAN}Residual Connections{Colors.RESET} — preserves original information
    
  Key insight: {Colors.BOLD}Input and output shapes are the same!{Colors.RESET}
  This means we can stack as many blocks as we want.
  
  More blocks = deeper understanding:
    Block 1: Basic patterns (which characters go together)
    Block 2-3: Higher-level patterns (word structure)  
    Block 4+: Complex patterns (meaning, context)

  {Colors.BOLD}{Colors.GREEN}✅ Step 3 Complete! Next: python step4_put_it_together.py{Colors.RESET}
""")

Complete Code: step4_put_it_together.py

Python
"""
🟠 Level 3, Step 4: Putting It All Together — A Complete Mini-Transformer
==========================================================================

Now we assemble all the pieces from Steps 1-3 into a COMPLETE model:

    Text → Tokenize → Embed → Position → [Transformer Blocks] → Output Logits

This is essentially a tiny version of GPT!
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# ============================================================================
# 🎨 Colors
# ============================================================================
class Colors:
    HEADER = '\033[95m'
    BLUE = '\033[94m'
    CYAN = '\033[96m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    DIM = '\033[2m'
    RESET = '\033[0m'

def print_header(text):
    print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
    print(f"  {text}")
    print(f"{'='*60}{Colors.RESET}\n")

def print_step(num, text):
    print(f"{Colors.BOLD}{Colors.CYAN}📌 Step {num}: {text}{Colors.RESET}")

def print_info(text):
    print(f"  {Colors.DIM}{text}{Colors.RESET}")

def print_success(text):
    print(f"  {Colors.GREEN}✓ {text}{Colors.RESET}")


# ============================================================================
# 🔀 Multi-Head Attention (from Step 2 & 3)
# ============================================================================
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.W_qkv = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
        self.W_out = nn.Linear(embed_dim, embed_dim, bias=False)
        self.scale = math.sqrt(self.head_dim)
    
    def forward(self, x, mask=None):
        B, T, C = x.shape
        qkv = self.W_qkv(x).reshape(B, T, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        weights = F.softmax(scores, dim=-1)
        out = torch.matmul(weights, V)
        out = out.transpose(1, 2).reshape(B, T, C)
        return self.W_out(out)


# ============================================================================
# 🔧 Feed-Forward Network (from Step 3)
# ============================================================================
class FeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim=None):
        super().__init__()
        ff_dim = ff_dim or 4 * embed_dim
        self.net = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim),
        )
    
    def forward(self, x):
        return self.net(x)


# ============================================================================
# 🧱 Transformer Block (from Step 3)
# ============================================================================
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.ffn = FeedForward(embed_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x, mask=None):
        x = x + self.dropout(self.attention(self.norm1(x), mask))
        x = x + self.dropout(self.ffn(self.norm2(x)))
        return x


# ============================================================================
# 🏗️ THE COMPLETE MINI-TRANSFORMER MODEL
# ============================================================================
# This is it! The full model that can:
#   1. Take in a sequence of character indices
#   2. Process them through embeddings + transformer blocks
#   3. Output probabilities for the NEXT character
#
# Architecture:
#   Input indices → Token Embedding → + Positional Embedding
#                     → TransformerBlock × N
#                     → LayerNorm → Linear → Logits
# ============================================================================

class MiniTransformer(nn.Module):
    """
    A complete mini-Transformer language model.
    
    This is a simplified version of GPT:
    - Takes character indices as input
    - Predicts the next character
    - Can generate text autoregressively
    """
    
    def __init__(self, vocab_size, embed_dim=64, num_heads=4, 
                 num_blocks=4, max_seq_len=256, dropout=0.1):
        super().__init__()
        
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.max_seq_len = max_seq_len
        
        # Token embedding: character index → vector
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
        
        # Positional embedding: position → vector
        # (Using learned positional embeddings instead of sinusoidal — simpler!)
        self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
        
        # Dropout after embeddings
        self.dropout = nn.Dropout(dropout)
        
        # Stack of Transformer blocks — this is the "brain" of the model
        self.blocks = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, dropout)
            for _ in range(num_blocks)
        ])
        
        # Final layer normalization
        self.final_norm = nn.LayerNorm(embed_dim)
        
        # Output projection: vector → vocabulary scores (logits)
        # This tells us: for each position, how likely is each character?
        self.output_head = nn.Linear(embed_dim, vocab_size, bias=False)
        
        # Weight tying: share weights between input embedding and output head
        # This is a common trick that improves performance
        self.output_head.weight = self.token_embedding.weight
    
    def forward(self, idx, targets=None):
        """
        Args:
            idx: Input token indices, shape (batch, seq_len)
            targets: Optional target indices for computing loss
        
        Returns:
            logits: Prediction scores, shape (batch, seq_len, vocab_size)
            loss: Cross-entropy loss (if targets provided)
        """
        batch_size, seq_len = idx.shape
        
        assert seq_len <= self.max_seq_len, \
            f"Sequence length {seq_len} exceeds max {self.max_seq_len}"
        
        # Step 1: Get token embeddings
        tok_emb = self.token_embedding(idx)  # (batch, seq_len, embed_dim)
        
        # Step 2: Get positional embeddings
        positions = torch.arange(seq_len, device=idx.device)
        pos_emb = self.position_embedding(positions)  # (seq_len, embed_dim)
        
        # Step 3: Combine token + position embeddings
        x = self.dropout(tok_emb + pos_emb)
        
        # Step 4: Create causal mask
        mask = torch.tril(torch.ones(seq_len, seq_len, device=idx.device))
        mask = mask.unsqueeze(0)  # (1, seq_len, seq_len)
        
        # Step 5: Pass through transformer blocks
        for block in self.blocks:
            x = block(x, mask=mask)
        
        # Step 6: Final normalization
        x = self.final_norm(x)
        
        # Step 7: Project to vocabulary size
        logits = self.output_head(x)  # (batch, seq_len, vocab_size)
        
        # Compute loss if targets are provided
        loss = None
        if targets is not None:
            # Reshape for cross-entropy: (batch*seq_len, vocab_size) and (batch*seq_len,)
            loss = F.cross_entropy(
                logits.view(-1, self.vocab_size),
                targets.view(-1)
            )
        
        return logits, loss
    
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Generate text autoregressively.
        
        Args:
            idx: Starting token indices, shape (1, seq_len)
            max_new_tokens: How many tokens to generate
            temperature: Controls randomness (lower = more deterministic)
            top_k: Only consider top-k most likely tokens
        """
        for _ in range(max_new_tokens):
            # Crop to max sequence length
            idx_cond = idx[:, -self.max_seq_len:]
            
            # Get predictions
            logits, _ = self(idx_cond)
            
            # Take only the last position's predictions
            logits = logits[:, -1, :] / temperature
            
            # Optional: top-k filtering
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')
            
            # Convert to probabilities
            probs = F.softmax(logits, dim=-1)
            
            # Sample from the distribution
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Append to sequence
            idx = torch.cat([idx, next_token], dim=1)
        
        return idx


# ============================================================================
# 🚀 MAIN
# ============================================================================

if __name__ == '__main__':
    print_header("🟠 Level 3, Step 4: Complete Mini-Transformer")
    
    torch.manual_seed(42)
    
    # --- Configuration ---
    vocab_size = 65     # ~65 printable ASCII characters
    embed_dim = 64      # Embedding dimension
    num_heads = 4       # Attention heads
    num_blocks = 4      # Transformer blocks (layers)
    max_seq_len = 128   # Maximum sequence length
    
    print(f"  {Colors.BOLD}Model Configuration:{Colors.RESET}")
    print(f"    Vocabulary size:     {vocab_size} characters")
    print(f"    Embedding dimension: {embed_dim}")
    print(f"    Attention heads:     {num_heads}")
    print(f"    Transformer blocks:  {num_blocks}")
    print(f"    Max sequence length: {max_seq_len}")
    
    # --- Build Model ---
    print_header("🏗️ Building the Model")
    
    model = MiniTransformer(
        vocab_size=vocab_size,
        embed_dim=embed_dim,
        num_heads=num_heads,
        num_blocks=num_blocks,
        max_seq_len=max_seq_len
    )
    
    # Count parameters by component
    print(f"  {Colors.YELLOW}Parameter count by component:{Colors.RESET}")
    
    tok_params = sum(p.numel() for p in model.token_embedding.parameters())
    pos_params = sum(p.numel() for p in model.position_embedding.parameters())
    block_params = sum(p.numel() for p in model.blocks.parameters())
    norm_params = sum(p.numel() for p in model.final_norm.parameters())
    
    print(f"    {'Token Embedding':30s}: {tok_params:>8,}")
    print(f"    {'Position Embedding':30s}: {pos_params:>8,}")
    print(f"    {'Transformer Blocks (×'+str(num_blocks)+')':30s}: {block_params:>8,}")
    print(f"    {'Final LayerNorm':30s}: {norm_params:>8,}")
    print(f"    {'Output Head (tied)':30s}: {'(shared)'}")
    
    total_params = sum(p.numel() for p in model.parameters())
    print(f"\n    {Colors.BOLD}{'TOTAL':30s}: {total_params:>8,} parameters{Colors.RESET}")
    
    # Compare with real models
    print(f"\n  {Colors.YELLOW}For comparison:{Colors.RESET}")
    print(f"    Your Mini-Transformer:   {total_params:>12,} parameters")
    print(f"    GPT-2 (small):           124,000,000 parameters")
    print(f"    GPT-3:               175,000,000,000 parameters")
    print(f"    GPT-4 (estimated):   1,700,000,000,000 parameters")
    
    # --- Print Architecture ---
    print_header("📐 Model Architecture")
    print(model)
    
    # --- Forward Pass ---
    print_header("🔄 Forward Pass Demo")
    
    # Create dummy input (batch of 2, seq length 10)
    batch_size = 2
    seq_len = 10
    dummy_input = torch.randint(0, vocab_size, (batch_size, seq_len))
    dummy_targets = torch.randint(0, vocab_size, (batch_size, seq_len))
    
    print_step(1, f"Input shape: {list(dummy_input.shape)}")
    print_info(f"(batch_size={batch_size}, seq_len={seq_len})")
    print_info(f"Sample input: {dummy_input[0].tolist()}")
    
    # Forward pass
    logits, loss = model(dummy_input, targets=dummy_targets)
    
    print()
    print_step(2, f"Output logits shape: {list(logits.shape)}")
    print_info(f"(batch_size={batch_size}, seq_len={seq_len}, vocab_size={vocab_size})")
    print_info("Each position outputs a score for every possible next character!")
    
    print()
    print_step(3, f"Loss: {loss.item():.4f}")
    print_info(f"Expected random loss: -ln(1/{vocab_size}) = {-math.log(1/vocab_size):.4f}")
    print_info("(Our untrained model is close to random — that's expected!)")
    
    # --- Probability Distribution ---
    print_header("📊 Output Probability Distribution")
    
    # Show probabilities for the last position
    probs = F.softmax(logits[0, -1, :], dim=-1)
    top_probs, top_indices = torch.topk(probs, 10)
    
    print(f"  Top 10 predicted next characters (for position {seq_len}):")
    print(f"  {'Character':>10s}  {'Probability':>12s}  {'Bar'}")
    print(f"  {'─'*10}  {'─'*12}  {'─'*30}")
    
    for prob, idx in zip(top_probs, top_indices):
        char = chr(idx.item() + 32) if 32 <= idx.item() + 32 <= 126 else '?'
        bar_len = int(prob.item() * 200)
        bar = '█' * bar_len
        print(f"  {repr(char):>10s}  {prob.item():>11.4f}%  {Colors.GREEN}{bar}{Colors.RESET}")
    
    print_info("(Probabilities are roughly equal — model is untrained)")
    
    # --- Generation Demo ---
    print_header("✨ Text Generation (Untrained)")
    
    # Generate from a simple start
    start = torch.zeros((1, 1), dtype=torch.long)  # Start with token 0
    generated = model.generate(start, max_new_tokens=50, temperature=1.0)
    
    # Convert to "characters" (just ASCII mapping for demo)
    gen_chars = ''.join([chr(min(t.item() + 32, 126)) for t in generated[0]])
    print(f"  Generated text (random, untrained):")
    print(f"  {Colors.DIM}\"{gen_chars}\"{Colors.RESET}")
    print()
    print_info("This is garbage because the model hasn't been trained yet!")
    print_info("In Level 4, you'll train this model and watch it learn to write! 🚀")
    
    # --- Summary ---
    print_header("📝 Summary — What You've Built!")
    print(f"""  You now have a complete {Colors.BOLD}Mini-Transformer{Colors.RESET} with:
    
    ✅ Token Embedding      — turns characters into vectors
    ✅ Position Embedding    — adds position information
    ✅ {num_blocks} Transformer Blocks  — the "brain" (attention + FFN)
    ✅ Output Head           — predicts the next character
    ✅ Generate method       — creates new text!
    
  Total: {Colors.BOLD}{total_params:,} parameters{Colors.RESET}
    
  This is the SAME architecture as GPT, just much smaller.
  The only difference? GPT has more blocks, bigger embeddings,
  and was trained on MUCH more data.

  {Colors.BOLD}🎯 You understand how AI language models work from the ground up!{Colors.RESET}

  {Colors.BOLD}{Colors.GREEN}✅ Level 3 Complete! Next: Level 4 — Train your own Mini-GPT!{Colors.RESET}
  {Colors.DIM}Run: python ../level_4_mini_gpt/train.py{Colors.RESET}
""")

Part IV

Creating Your Own AI

Building and training your own language models

Chapter 5

Building Your Own GPT

Learning Objectives

Explain what GPT actually does (spoiler: it predicts the next token)
Understand the full training loop — from raw text to a learning model
Read and modify a complete GPT model architecture in PyTorch
Train your own Mini-GPT on a text dataset
Generate text from your trained model using different sampling strategies
Explain temperature, top-k, and how they shape the model's "creativity"
Critically discuss what a language model learns — and what it does not

5.1 GPT: Just a Transformer with a Job

Let's clear up something that confuses a lot of people. GPT — Generative Pre-trained Transformer — is not magic. It's not a thinking machine. It's not even, fundamentally, a new invention. GPT is simply the Transformer architecture we built in Chapter 4… but given a very specific job:

Predict the next token.

That's it. That is the entire idea behind every GPT model, from your tiny Mini-GPT to OpenAI's GPT-4 with its hundreds of billions of parameters.

Think of it like this. Suppose you're reading a Hindi sentence:

"आज मौसम बहुत ___"

What comes next? Your brain immediately suggests candidates: अच्छा, गर्म, ठंडा, खराब. You're doing next-word prediction! GPT does the same thing — but with mathematics.

Given a sequence of tokens [t_1, t_2, \ldots, t_{n}], GPT learns the conditional probability:

P(t_{n+1} \mid t_1, t_2, \ldots, t_n)

It doesn't predict just one word. It produces a probability distribution over the entire vocabulary. For every possible next token, it says: "This is how likely I think this token comes next."

Note

GPT is an autoregressive model. It generates text one token at a time, feeding each generated token back as input to predict the next one. It's like a cricket commentator — each sentence builds on what was said before.

What Makes GPT Different from BERT?

If you've heard of BERT, here's the key difference. BERT is bidirectional — it looks at context from both the left and the right. GPT is unidirectional — it can only look at what came before. This is enforced by the causal mask we'll see in the code.

Why the restriction? Because GPT's job is generation. When you're writing the next word, you can't peek at words that haven't been written yet. The causal mask ensures the model plays fair — it only uses past context to predict the future.

5.2 The Training Process

Training a GPT model involves four steps that repeat thousands of times. Let's walk through each one carefully.

Step 1: Data Preparation

Our model works at the character level — each character is a token. This is simpler than word-level or subword tokenization (like BPE used in production GPTs), but the principles are identical.

We take a text file — say, a collection of short stories — and do the following:

Build a vocabulary: Find all unique characters in the text
Create mappings: char_to_idx (character → number) and idx_to_char (number → character)
Encode the text: Convert the entire text into a sequence of integers
Create training pairs: For every sequence of characters, the target is the same sequence shifted by one position

For example, if our text is "namaste":

Input	n	a	m	a	s	t
Target	a	m	a	s	t	e

Every character learns to predict the character that follows it.

Step 2: Forward Pass

A batch of input sequences goes through the model:

Token Embedding: Each character index becomes a 128-dimensional vector
Position Embedding: Position information is added (so the model knows word order)
Transformer Blocks: The combined embeddings pass through 4 transformer blocks, each with multi-head attention and a feed-forward network
Output Projection: The final layer produces logits — raw scores for every character in the vocabulary

The output shape is (batch_size, sequence_length, vocab_size). For each position, we get a score for every possible next character.

Step 3: Cross-Entropy Loss

We need to measure how wrong the model is. For this, we use cross-entropy loss:

\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log P(t_i^{\text{target}} \mid t_1, \ldots, t_{i-1})

In plain language: for each position, we look at the probability the model assigned to the correct next character. If the model was confident and correct, the loss is low. If it was confident and wrong, the loss is high.

Tip

At the start of training, the model assigns roughly equal probability to all characters. With a vocabulary of 65 characters, the initial loss should be around -\log(1/65) \approx 4.17. If you see this value at step 0, your model is initialized correctly!

Step 4: Backward Pass

This is where the learning happens. PyTorch computes the gradient of the loss with respect to every parameter in the model using backpropagation. Then the optimizer (AdamW) updates each parameter in the direction that reduces the loss:

\theta_{\text{new}} = \theta_{\text{old}} - \eta \cdot \nabla_\theta \mathcal{L}

where \eta is the learning rate. We also apply gradient clipping — if the gradients become too large (which can happen with transformer models), we scale them down. This prevents "exploding gradients" from destabilising training.

This four-step cycle — forward pass, compute loss, backward pass, update weights — repeats 3,000 times. Each repetition is one training step.

5.3 The Model Architecture

Now let's look at the actual code. Our Mini-GPT lives in model.py and consists of four classes stacked together. Think of it as building a temple — you lay the foundation first, then add pillars, then the dome.

The Configuration: `GPTConfig`

Every model begins with its hyperparameters:

Python
from dataclasses import dataclass

@dataclass
class GPTConfig:
    """All hyperparameters in one place — easy to experiment!"""
    
    # Model architecture
    vocab_size: int = 65          # Will be set based on training data
    embed_dim: int = 128          # Size of token vectors
    num_heads: int = 4            # Number of attention heads
    num_blocks: int = 4           # Number of transformer blocks
    max_seq_len: int = 256        # Maximum context length
    dropout: float = 0.1          # Dropout rate for regularization
    
    # Training
    batch_size: int = 32          # Samples per training step
    learning_rate: float = 3e-4   # How fast the model learns
    max_steps: int = 3000         # Total training steps
    eval_interval: int = 100      # Evaluate every N steps
    eval_steps: int = 20          # Steps per evaluation
    sample_interval: int = 500    # Generate sample every N steps

Using a dataclass keeps everything tidy. Want to experiment with 8 attention heads? Just change num_heads = 8. Want a deeper model? Change num_blocks = 6. This is how real ML research works — you tweak hyperparameters and observe what happens.

Important

The embed_dim must be divisible by num_heads. Each attention head works on a slice of the embedding: head_dim = embed_dim // num_heads. With 128 dimensions and 4 heads, each head operates on 32 dimensions.

Multi-Head Self-Attention

This is the heart of the Transformer. We covered the theory in Chapter 4 — now see it in code:

Python
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_heads = config.num_heads
        self.head_dim = config.embed_dim // config.num_heads
        self.embed_dim = config.embed_dim
        
        # Single linear layer projects to Q, K, V simultaneously
        self.W_qkv = nn.Linear(config.embed_dim, 3 * config.embed_dim, bias=False)
        self.W_out = nn.Linear(config.embed_dim, config.embed_dim, bias=False)
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.scale = math.sqrt(self.head_dim)
    
    def forward(self, x, mask=None):
        B, T, C = x.shape  # Batch, Time (sequence length), Channels (embed_dim)
        
        # Project to Q, K, V in one shot, then split
        qkv = self.W_qkv(x).reshape(B, T, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, B, heads, T, head_dim)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        
        # Scaled dot-product attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        weights = F.softmax(scores, dim=-1)
        weights = self.attn_dropout(weights)
        
        # Weighted sum of values
        out = torch.matmul(weights, V)
        out = out.transpose(1, 2).reshape(B, T, C)  # Recombine heads
        return self.resid_dropout(self.W_out(out))

Let's trace through this carefully:

W_qkv: Instead of three separate linear layers for Q, K, and V, we use one big layer that produces all three at once. This is a common efficiency trick — one matrix multiplication instead of three.

Reshape and Permute: We split the output into num_heads separate attention heads. Each head gets its own slice of the embedding to work with independently.

Scaled Dot-Product Attention: \text{scores} = \frac{QK^T}{\sqrt{d_k}}. The scaling by \sqrt{d_k} prevents the dot products from becoming too large, which would push softmax into regions with tiny gradients.

Causal Mask: The mask argument is a lower-triangular matrix. Position i can only attend to positions \leq i. This is what makes GPT autoregressive.

Output Projection: After attention, all heads are concatenated and projected back to the embedding dimension through W_out.

Feed-Forward Network

After attention, each position's representation is processed independently:

Python
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(config.embed_dim, 4 * config.embed_dim),
            nn.GELU(),
            nn.Linear(4 * config.embed_dim, config.embed_dim),
            nn.Dropout(config.dropout),
        )
    
    def forward(self, x):
        return self.net(x)

The feed-forward network expands the dimension by 4×, applies a non-linearity (GELU), then projects it back down. Think of this as the "thinking" step — attention gathers information from other positions, and the FFN processes that gathered information.

Note

Why GELU instead of ReLU? GELU (Gaussian Error Linear Unit) is smoother than ReLU — it doesn't have a hard cutoff at zero. Most modern transformer models (GPT-2, BERT, etc.) use GELU because it tends to train better.

Transformer Block

A transformer block combines attention and feed-forward with residual connections and layer normalization:

Python
class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.norm1 = nn.LayerNorm(config.embed_dim)
        self.norm2 = nn.LayerNorm(config.embed_dim)
        self.attention = MultiHeadAttention(config)
        self.ffn = FeedForward(config)
    
    def forward(self, x, mask=None):
        x = x + self.attention(self.norm1(x), mask)  # Residual + Attention
        x = x + self.ffn(self.norm2(x))               # Residual + FFN
        return x

Notice the Pre-Norm design: we apply LayerNorm before attention and FFN, not after. This was found to train more stably than the original "Post-Norm" design from the 2017 Transformer paper. The residual connection (x + ...) ensures that gradients can flow directly through the network without degradation — like adding a shortcut in a highway.

The Complete MiniGPT Model

Now we assemble everything:

Python
class MiniGPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        
        # Token and position embeddings
        self.token_emb = nn.Embedding(config.vocab_size, config.embed_dim)
        self.pos_emb = nn.Embedding(config.max_seq_len, config.embed_dim)
        self.dropout = nn.Dropout(config.dropout)
        
        # Stack of transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.num_blocks)
        ])
        
        # Final layer norm and output projection
        self.final_norm = nn.LayerNorm(config.embed_dim)
        self.output_head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)
        
        # Weight tying — reuse token embedding weights for output
        self.output_head.weight = self.token_emb.weight
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        """Initialize weights for better training."""
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        B, T = idx.shape
        
        # Embeddings
        tok = self.token_emb(idx)                              # (B, T, embed_dim)
        pos = self.pos_emb(torch.arange(T, device=idx.device)) # (T, embed_dim)
        x = self.dropout(tok + pos)
        
        # Causal mask
        mask = torch.tril(torch.ones(T, T, device=idx.device)).unsqueeze(0)
        
        # Pass through transformer blocks
        for block in self.blocks:
            x = block(x, mask)
        
        # Project to vocabulary
        x = self.final_norm(x)
        logits = self.output_head(x)  # (B, T, vocab_size)
        
        # Compute loss if targets provided
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, self.config.vocab_size),
                targets.view(-1)
            )
        
        return logits, loss

Let's highlight three important design choices:

Weight Tying: self.output_head.weight = self.token_emb.weight makes the input embedding and output projection share the same weight matrix. The intuition: the embedding maps characters into the vector space, and the output head maps vectors back to characters. These should be inverse operations, so sharing weights makes sense — and it reduces the parameter count significantly.

Causal Mask: torch.tril(torch.ones(T, T)) creates a lower-triangular matrix of ones. When applied in attention, position i can only attend to positions 0, 1, \ldots, i. This ensures the model can't "cheat" by looking at future tokens.

Weight Initialization: Weights are drawn from \mathcal{N}(0, 0.02). This specific standard deviation was found to work well in the original GPT paper. Too large and training is unstable; too small and the model learns too slowly.

Tip

Our Mini-GPT has approximately 1.5 million parameters. For perspective, GPT-2 Small has 124 million, and GPT-3 has 175 billion. Despite being tiny, our model can still learn interesting patterns from text!

5.4 Training Your Mini-GPT

Let's look at the training script (train.py). We'll break it into logical stages.

Stage 1: Data Loading and Tokenization

Python
class CharDataset:
    """Character-level dataset for language modeling."""
    
    def __init__(self, text, config):
        self.config = config
        
        # Build vocabulary from the text
        chars = sorted(list(set(text)))
        self.char_to_idx = {ch: i for i, ch in enumerate(chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(chars)}
        self.vocab_size = len(chars)
        
        # Encode entire text
        self.data = torch.tensor(
            [self.char_to_idx[ch] for ch in text], dtype=torch.long
        )
        
        # Train/validation split (90/10)
        n = int(0.9 * len(self.data))
        self.train_data = self.data[:n]
        self.val_data = self.data[n:]
    
    def encode(self, text):
        return [self.char_to_idx.get(ch, 0) for ch in text]
    
    def decode(self, indices):
        return ''.join([self.idx_to_char.get(i, '?') for i in indices])
    
    def get_batch(self, split='train'):
        """Get a random batch of training data."""
        data = self.train_data if split == 'train' else self.val_data
        seq_len = self.config.max_seq_len
        batch_size = self.config.batch_size
        
        # Random starting positions
        ix = torch.randint(len(data) - seq_len - 1, (batch_size,))
        
        # Input and target sequences (target is shifted by 1)
        x = torch.stack([data[i:i+seq_len] for i in ix])
        y = torch.stack([data[i+1:i+seq_len+1] for i in ix])
        
        return x, y

The get_batch method is where the training data comes from. Each call:

Picks batch_size random starting positions in the text
Extracts sequences of length max_seq_len starting from each position
Creates target sequences that are shifted by one character

This means every batch is different — the model never sees the same batch twice, which is a form of data augmentation built right into the sampling process.

Stage 2: The Evaluation Function

Python
@torch.no_grad()
def estimate_loss(model, dataset, config):
    """Estimate average loss on train and validation sets."""
    model.eval()
    losses = {}
    
    for split in ['train', 'val']:
        total_loss = 0.0
        for _ in range(config.eval_steps):
            x, y = dataset.get_batch(split)
            _, loss = model(x, targets=y)
            total_loss += loss.item()
        losses[split] = total_loss / config.eval_steps
    
    model.train()
    return losses

We evaluate on both training and validation data. If the training loss keeps going down but the validation loss starts going up, that's overfitting — the model is memorising the training data instead of learning general patterns. Think of a student who memorises answers without understanding concepts — they do great on practice papers but fail on unseen questions.

Warning

The @torch.no_grad() decorator is critical during evaluation. Without it, PyTorch would compute and store gradients for every evaluation step, wasting memory and slowing things down. Always use torch.no_grad() (or model.eval()) when you're not training.

Stage 3: The Training Loop

Here's the core of train.py — the actual training loop:

Python
# Setup
config = GPTConfig()
dataset = CharDataset(text, config)
config.vocab_size = dataset.vocab_size

model = MiniGPT(config)
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)

model.train()
best_val_loss = float('inf')

for step in range(config.max_steps):
    # Get a random batch
    x, y = dataset.get_batch('train')
    
    # Forward pass — compute predictions and loss
    logits, loss = model(x, targets=y)
    
    # Backward pass — compute gradients
    optimizer.zero_grad()
    loss.backward()
    
    # Gradient clipping — prevent exploding gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    
    # Update weights
    optimizer.step()
    
    # Evaluate periodically
    if (step + 1) % config.eval_interval == 0:
        losses = estimate_loss(model, dataset, config)
        
        # Save best model
        if losses['val'] &lt; best_val_loss:
            best_val_loss = losses['val']
            torch.save({
                'model_state_dict': model.state_dict(),
                'config': config,
                'char_to_idx': dataset.char_to_idx,
                'idx_to_char': dataset.idx_to_char,
                'vocab_size': dataset.vocab_size,
                'step': step + 1,
                'val_loss': best_val_loss,
            }, 'mini_gpt_model.pt')
    
    # Generate sample text periodically
    if (step + 1) % config.sample_interval == 0:
        model.eval()
        start_tokens = torch.zeros((1, 1), dtype=torch.long)
        generated = model.generate(start_tokens, max_new_tokens=150, temperature=0.8)
        gen_text = dataset.decode(generated[0].tolist())
        model.train()

Let's unpack the key choices:

AdamW Optimizer: Adam with weight decay. It's the go-to optimizer for transformers — it adapts the learning rate for each parameter individually, which works much better than plain SGD for these architectures.
Gradient Clipping (clip_grad_norm_ with max norm 1.0): Transformers can occasionally produce very large gradients. Clipping prevents these from causing catastrophic weight updates.
Model Checkpointing: We save the model whenever the validation loss improves. This means even if training gets worse later (overfitting), we keep the best version. It's like taking a photo of the scoreboard when your team is winning — just in case!

5.5 Watching It Learn

One of the most magical moments in AI is watching your model go from producing complete garbage to generating coherent text. Here's what you'll see at different stages:

Step 0 (Before Training)


"xK&mQ!zP;yWjR#3nL@fT$8vUoC*1bHi^9dAe"

The model knows nothing. It assigns equal probability to every character, so the output is pure random noise — like a monkey typing on a keyboard.

Step 500 (Early Training)


"the the the and the was a the of the"

The model has learned the most basic pattern: common words exist. It produces recognisable English words, but just repeats them with no structure. It's like a toddler who knows a few words but can't form sentences.

Step 1500 (Mid Training)


"The king was a great and the people of the village were happy."

Now we see grammar emerging! The model has learned that sentences start with capital letters, contain subjects and verbs, and end with periods. The sentences make superficial sense, even if the overall narrative is disjointed.

Step 3000 (Final)


"The old woman lived in a small village near the river. She would
walk every morning to collect water and bring it back to her home."

At this stage, the model produces text that reads like coherent prose. It maintains a topic across multiple sentences, uses proper punctuation, and even shows a sense of narrative flow.

Note

The quality of generated text depends heavily on your training data. If you train on stories, the model writes stories. If you train on code, it writes code. If you train on Bollywood song lyrics, it will write lyrics! The model mirrors whatever patterns exist in its training data.

5.6 Chatting with Your Model

Once training is complete, you can have an interactive conversation with your model using generate.py:

Loading the Trained Model

Python
def load_model():
    """Load the trained model from checkpoint."""
    model_path = os.path.join(
        os.path.dirname(os.path.abspath(__file__)), 'mini_gpt_model.pt'
    )
    
    checkpoint = torch.load(model_path, map_location='cpu', weights_only=False)
    
    config = checkpoint['config']
    char_to_idx = checkpoint['char_to_idx']
    idx_to_char = checkpoint['idx_to_char']
    
    model = MiniGPT(config)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()
    
    return model, char_to_idx, idx_to_char

The checkpoint file contains everything needed to reconstruct the model: the architecture configuration, the trained weights, and the character vocabulary mappings. This is why we saved all of these during training — without the char_to_idx mapping, we wouldn't know which number corresponds to which character.

Generating Text

Python
def generate_text(model, prompt, char_to_idx, idx_to_char, 
                  max_tokens=200, temperature=0.8, top_k=20):
    """Generate text from a prompt."""
    # Encode prompt characters to indices
    encoded = [char_to_idx.get(ch, 0) for ch in prompt]
    input_ids = torch.tensor([encoded], dtype=torch.long)
    
    # Generate autoregressively
    with torch.no_grad():
        output = model.generate(
            input_ids, max_new_tokens=max_tokens, 
            temperature=temperature, top_k=top_k
        )
    
    # Decode indices back to characters
    generated = ''.join([idx_to_char.get(i, '?') for i in output[0].tolist()])
    return generated

And here's the autoregressive generation loop inside the model:

Python
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
    self.eval()
    
    for _ in range(max_new_tokens):
        # Crop to max context length
        idx_crop = idx[:, -self.config.max_seq_len:]
        
        # Forward pass — get logits for all positions
        logits, _ = self(idx_crop)
        logits = logits[:, -1, :] / temperature  # Only the last position matters
        
        # Top-k filtering
        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits &lt; v[:, [-1]]] = float('-inf')
        
        # Sample from the distribution
        probs = F.softmax(logits, dim=-1)
        next_tok = torch.multinomial(probs, num_samples=1)
        idx = torch.cat([idx, next_tok], dim=1)
    
    return idx

Notice the generation loop: at each step, we feed the entire sequence so far into the model, but we only care about the logits at the last position (logits[:, -1, :]). That last position's output contains the model's prediction for what comes next. We sample from this distribution, append the sampled token, and repeat.

Tip

The idx_crop line is important. Our model has a maximum context length of 256 characters. If the generated sequence grows beyond this, we crop it to the last 256 characters. This means the model "forgets" the very beginning of long sequences — a fundamental limitation of fixed-context-length models.

The Interactive Loop

The interactive chat allows you to type prompts and adjust settings on the fly:

Python
# Settings
temperature = 0.8
top_k = 20
max_tokens = 200

while True:
    prompt = input("You > ")
    
    if prompt.strip().lower() == 'quit':
        break
    
    # Handle setting commands
    if prompt.startswith('temp '):
        temperature = float(prompt.split()[1])
        continue
    if prompt.startswith('topk '):
        top_k = int(prompt.split()[1])
        continue
    
    # Generate and display
    generated = generate_text(
        model, prompt, char_to_idx, idx_to_char,
        max_tokens=max_tokens, temperature=temperature, top_k=top_k
    )
    print(generated)

Type any text and the model continues it. Type temp 0.3 for more conservative output, or temp 1.5 for wilder creativity. But what do these settings actually mean? Let's find out.

5.7 Temperature, Top-k, and Sampling Strategies

When your model produces logits for the next character, how do we choose which character to actually use? This is where sampling strategies come in, and they have a dramatic effect on the output.

Temperature

Temperature controls the "sharpness" of the probability distribution. The logits are divided by the temperature before applying softmax:

P(t_i) = \frac{e^{z_i / \tau}}{\sum_j e^{z_j / \tau}}

where z_i are the raw logits and \tau is the temperature.

Consider a model predicting the next character after "Ind". Suppose the raw logits produce these probabilities:

Character	`\tau = 0.3` (Low)	`\tau = 0.8` (Default)	`\tau = 1.0` (Normal)	`\tau = 1.5` (High)
i	0.85	0.45	0.35	0.24
u	0.10	0.20	0.20	0.18
e	0.04	0.15	0.18	0.17
o	0.01	0.10	0.12	0.14
a	~0.00	0.05	0.08	0.12
others	~0.00	0.05	0.07	0.15

At low temperature (\tau = 0.3), the model is very confident — it almost always picks "i" (making "Indi" → "India"). The output is predictable, repetitive, but safe.

At high temperature (\tau = 1.5), probabilities are spread out. The model might pick "u" (making "Indu" → "Industry") or even "e" (making "Inde" → "Indeed"). The output is more creative but also more prone to nonsense.

Important

Temperature = 0 is a special case called greedy decoding — always pick the most likely token. This produces the most predictable output but often leads to repetitive, boring text ("the the the the…"). In practice, a temperature between 0.7 and 0.9 usually works best.

Top-k Sampling

Top-k restricts the model to only consider the k most likely characters, setting all other probabilities to zero. This prevents the model from ever choosing very unlikely characters (which tend to be nonsensical).

Continuing our example with "Ind" and top-k = 3:

Character	Original Probability	After Top-3 Filtering	After Renormalisation
i	0.35	0.35	0.48
u	0.20	0.20	0.27
e	0.18	0.18	0.25
o	0.12	~~0.00~~	0.00
a	0.08	~~0.00~~	0.00
others	0.07	~~0.00~~	0.00

Only "i", "u", and "e" survive. The probabilities are renormalised to sum to 1, and we sample from this filtered distribution.

Combining Temperature and Top-k

In practice, we use both together. Temperature controls the shape of the distribution, and top-k provides a safety net against low-probability nonsense. Our default settings (temperature=0.8, top_k=20) give a good balance between creativity and coherence.

A practical guide:

Use Case	Temperature	Top-k	Result
Factual / predictable	0.3	5	Very conservative output
Story continuation	0.7–0.8	20	Balanced and coherent
Creative brainstorming	1.0	40	Diverse and surprising
Experimental / chaotic	1.5	None	Wild, often nonsensical

Tip

When experimenting, change one parameter at a time. Set temp 0.3 and generate. Then set temp 1.5 and generate the same prompt. Compare the results. This builds intuition much faster than reading about it!

💭 5.8 Discussion: What Did Your Model Actually Learn?

After training, your Mini-GPT can produce text that looks surprisingly coherent. But let's be honest with ourselves: what has it actually learned?

### What It HAS Learned

Character Frequencies: The model knows that 'e' is the most common letter in English, that spaces appear between words, and that 'q' is almost always followed by 'u'.

Word Structure: It has internalised the spelling of common words — "the", "and", "was", "village", "morning". It rarely produces non-words after sufficient training.

Grammar Patterns: Subject-verb-object ordering, article-noun pairs ("the king", "a village"), and verb tenses are all captured. It doesn't know grammar rules — it has learned statistical patterns that happen to align with grammar.

Punctuation and Formatting: Sentences start with capitals and end with periods. Dialogue uses quotation marks. Paragraphs have line breaks.

Thematic Coherence: Within a short span, the model can maintain a topic. If it starts writing about a king, the next few sentences will likely continue about the king.

### What It Has NOT Learned

True Understanding: The model doesn't know what a "king" is. It doesn't know that kings rule kingdoms, wear crowns, or exist in the physical world. It only knows that the character sequence "king" tends to appear near sequences like "queen", "throne", "kingdom".

Logic and Reasoning: Ask your model to solve "2 + 3" and it might output "5" — not because it understands arithmetic, but because it has seen "2 + 3 = 5" in text. Ask "2847 + 9283" and it will likely fail.

Factual Knowledge: Your model might write "Delhi is the capital of India" — but only if something similar appeared in the training data. It doesn't know facts; it reproduces patterns.

Long-Range Coherence: Our model's context is 256 characters — roughly 40-50 words. It cannot maintain a plot across paragraphs or remember a character introduced 1,000 tokens ago. Larger models with longer contexts do better, but even they struggle with book-length coherence.

Think of it like a very talented mimic. A mimic can perfectly reproduce the accent, rhythm, and vocabulary of a native Hindi speaker without understanding a single word of Hindi. Your GPT is doing the same thing with text — reproducing the form of language without grasping its meaning.

> [!NOTE]

> This is one of the deepest debates in AI today. Some researchers argue that sufficiently large language models do develop a form of understanding. Others insist they remain "stochastic parrots" — impressive pattern matchers, nothing more. Where you stand on this question will shape how you think about AI's future.

Key Concepts Summary

Concept	Definition
GPT	A Transformer model trained to predict the next token in a sequence
Autoregressive	Generating one token at a time, feeding each output back as input
Causal Mask	Lower-triangular matrix that prevents attending to future positions
Cross-Entropy Loss	Measures how well the predicted probability matches the true next token
Weight Tying	Sharing weights between the input embedding and output projection
Pre-Norm	Applying LayerNorm before (not after) attention and FFN sublayers
AdamW	Adam optimizer with decoupled weight decay — standard for transformers
Gradient Clipping	Capping gradient magnitudes to prevent training instability
Temperature	Controls the sharpness of the sampling distribution (`\tau < 1` = conservative, `\tau > 1` = creative)
Top-k Sampling	Restricting sampling to the `k` most probable tokens
Overfitting	When train loss decreases but validation loss increases — model is memorising, not learning
Checkpoint	A saved snapshot of the model's weights and configuration

📝 5.10 Exercises

Exercise 1: Trace the Dimensions (Pen and Paper)

Take an input batch of shape (batch_size=2, seq_len=8) with vocab_size=65 and embed_dim=128. Trace the shape of the tensor through every layer of the model: token embedding → position embedding → transformer block → final norm → output logits. Write down the shape at each stage.

Exercise 2: Experiment with Hyperparameters

Modify GPTConfig and retrain the model. Try each of these independently and record the best validation loss:

embed_dim = 64 (smaller model)

embed_dim = 256 (larger model)

num_blocks = 2 (shallower)

num_blocks = 8 (deeper)

learning_rate = 1e-3 (faster learning)

learning_rate = 1e-4 (slower learning)

Which change helps the most? Which hurts? Why do you think that is?

Exercise 3: Temperature Explorer

Write a script that generates text from the same prompt ("Once upon a time") at temperatures 0.1, 0.5, 0.8, 1.0, 1.5, and 2.0. Print all outputs side by side. At what temperature does the output become unreadable?

Exercise 4: Train on Your Own Data

Find a text file of your choice — it could be a collection of Panchatantra stories in English, Bollywood movie dialogues, or even your own writing. Train the model on it. How does the domain of training data affect what the model generates?

Exercise 5: Implement Top-p (Nucleus) Sampling

Top-k has a limitation: sometimes the top-5 tokens capture 99% of the probability, and sometimes they capture only 40%. Top-p sampling (also called nucleus sampling) is an alternative: instead of keeping the top-k tokens, keep the smallest set of tokens whose cumulative probability exceeds p (e.g., p = 0.9).

Implement top-p sampling in the generate method. Compare its output with top-k. Which do you prefer?

Exercise 6: The Perplexity Metric

Perplexity is a common metric for language models, defined as:

$\text{PPL} = e^{\mathcal{L}}$

where \mathcal{L} is the cross-entropy loss. A perplexity of 10 means the model is, on average, as confused as if it were choosing between 10 equally likely options. Write a function that computes your model's perplexity on the validation set. What value do you get? How does it change with different hyperparameters?

Exercise 7: Attention Visualisation

Modify the MultiHeadAttention class to return the attention weights along with the output. Write a script that:

Feeds a short sentence into the model

Extracts attention weights from each head and each layer

Plots a heatmap showing which characters attend to which other characters

What patterns do you observe? Do different heads learn different patterns?

Important

What's Next? You've now built a complete language model — from architecture to training to generation. But our model is tiny and trains on a small dataset. In the next chapter, we'll explore how to scale up: larger models, better data, and the techniques that make billion-parameter models possible. We'll also discuss the ethical implications of large language models — a topic that every AI practitioner in India and globally must grapple with.

"The measure of intelligence is the ability to change." — Albert Einstein

Your Mini-GPT changes its weights 3,000 times during training. Whether that constitutes intelligence is a question we'll keep exploring.

Complete Source Code - Chapter 5

Below are the complete, runnable source files for this chapter. Every line is included.

Complete Code: model.py

Python
"""
🔴 Level 4: Mini-GPT Model Definition
========================================

A complete, self-contained GPT model for character-level language modeling.
This file defines the model architecture and can be imported by train.py and generate.py.

Architecture:
    - Character-level tokenization
    - Learned positional embeddings
    - 4 Transformer blocks with 4 attention heads
    - 128-dimensional embeddings
    - ~1.5M parameters
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass


# ============================================================================
# ⚙️ CONFIGURATION
# ============================================================================

@dataclass
class GPTConfig:
    """All hyperparameters in one place — easy to experiment!"""
    
    # Model architecture
    vocab_size: int = 65          # Will be set based on training data
    embed_dim: int = 128          # Size of token vectors
    num_heads: int = 4            # Number of attention heads
    num_blocks: int = 4           # Number of transformer blocks
    max_seq_len: int = 256        # Maximum context length
    dropout: float = 0.1         # Dropout rate for regularization
    
    # Training
    batch_size: int = 32          # Samples per training step
    learning_rate: float = 3e-4   # How fast the model learns
    max_steps: int = 3000         # Total training steps
    eval_interval: int = 100      # Evaluate every N steps
    eval_steps: int = 20          # Steps per evaluation
    sample_interval: int = 500    # Generate sample every N steps
    
    def __str__(self):
        lines = [f"  {k:20s}: {v}" for k, v in self.__dict__.items()]
        return "\n".join(lines)


# ============================================================================
# 🔀 Multi-Head Self-Attention
# ============================================================================

class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_heads = config.num_heads
        self.head_dim = config.embed_dim // config.num_heads
        self.embed_dim = config.embed_dim
        
        self.W_qkv = nn.Linear(config.embed_dim, 3 * config.embed_dim, bias=False)
        self.W_out = nn.Linear(config.embed_dim, config.embed_dim, bias=False)
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.scale = math.sqrt(self.head_dim)
    
    def forward(self, x, mask=None):
        B, T, C = x.shape
        
        qkv = self.W_qkv(x).reshape(B, T, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        Q, K, V = qkv[0], qkv[1], qkv[2]
        
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        weights = F.softmax(scores, dim=-1)
        weights = self.attn_dropout(weights)
        
        out = torch.matmul(weights, V)
        out = out.transpose(1, 2).reshape(B, T, C)
        return self.resid_dropout(self.W_out(out))


# ============================================================================
# 🔧 Feed-Forward Network
# ============================================================================

class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(config.embed_dim, 4 * config.embed_dim),
            nn.GELU(),  # GELU is smoother than ReLU — used in modern models
            nn.Linear(4 * config.embed_dim, config.embed_dim),
            nn.Dropout(config.dropout),
        )
    
    def forward(self, x):
        return self.net(x)


# ============================================================================
# 🧱 Transformer Block
# ============================================================================

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.norm1 = nn.LayerNorm(config.embed_dim)
        self.norm2 = nn.LayerNorm(config.embed_dim)
        self.attention = MultiHeadAttention(config)
        self.ffn = FeedForward(config)
    
    def forward(self, x, mask=None):
        x = x + self.attention(self.norm1(x), mask)
        x = x + self.ffn(self.norm2(x))
        return x


# ============================================================================
# 🏗️ MINI-GPT MODEL
# ============================================================================

class MiniGPT(nn.Module):
    """
    A complete GPT-style language model for character-level text generation.
    
    This model:
    - Takes a sequence of character indices
    - Processes them through embedding + transformer blocks
    - Predicts the probability of the next character
    - Can generate new text autoregressively
    """
    
    def __init__(self, config):
        super().__init__()
        self.config = config
        
        # Token and position embeddings
        self.token_emb = nn.Embedding(config.vocab_size, config.embed_dim)
        self.pos_emb = nn.Embedding(config.max_seq_len, config.embed_dim)
        self.dropout = nn.Dropout(config.dropout)
        
        # Stack of transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.num_blocks)
        ])
        
        # Final layer norm and output projection
        self.final_norm = nn.LayerNorm(config.embed_dim)
        self.output_head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)
        
        # Weight tying (improves performance)
        self.output_head.weight = self.token_emb.weight
        
        # Initialize weights
        self.apply(self._init_weights)
        
        # Print model summary
        n_params = sum(p.numel() for p in self.parameters())
        self._param_count = n_params
    
    def _init_weights(self, module):
        """Initialize weights for better training."""
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        """
        Forward pass.
        
        Args:
            idx: (batch, seq_len) — input token indices
            targets: (batch, seq_len) — target token indices (optional)
        
        Returns:
            logits: (batch, seq_len, vocab_size)
            loss: scalar (if targets provided)
        """
        B, T = idx.shape
        
        # Embeddings
        tok = self.token_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = self.dropout(tok + pos)
        
        # Causal mask
        mask = torch.tril(torch.ones(T, T, device=idx.device)).unsqueeze(0)
        
        # Transformer blocks
        for block in self.blocks:
            x = block(x, mask)
        
        # Output
        x = self.final_norm(x)
        logits = self.output_head(x)
        
        # Loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, self.config.vocab_size), targets.view(-1))
        
        return logits, loss
    
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Generate new tokens autoregressively.
        
        Args:
            idx: (1, seq_len) — starting tokens
            max_new_tokens: how many tokens to generate
            temperature: creativity control (0.1=safe, 1.0=normal, 1.5=creative)
            top_k: only consider top-k most likely tokens
        
        Returns:
            idx: (1, seq_len + max_new_tokens)
        """
        self.eval()
        
        for _ in range(max_new_tokens):
            # Crop to max context
            idx_crop = idx[:, -self.config.max_seq_len:]
            
            # Forward pass
            logits, _ = self(idx_crop)
            logits = logits[:, -1, :] / temperature
            
            # Top-k filtering
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')
            
            # Sample
            probs = F.softmax(logits, dim=-1)
            next_tok = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_tok], dim=1)
        
        return idx
    
    def count_parameters(self):
        """Return total number of parameters."""
        return self._param_count


if __name__ == '__main__':
    # Quick test
    print("\n\033[1m\033[95m" + "="*50)
    print("  🔴 Mini-GPT Model Test")
    print("="*50 + "\033[0m\n")
    
    config = GPTConfig(vocab_size=65)
    model = MiniGPT(config)
    
    print(f"  \033[1mConfiguration:\033[0m")
    print(config)
    
    n_params = model.count_parameters()
    print(f"\n  \033[1m\033[93mTotal parameters: {n_params:,}\033[0m")
    print(f"  That's {n_params/1e6:.2f}M parameters — tiny compared to GPT-2 (124M)!\n")
    
    # Test forward pass
    x = torch.randint(0, 65, (2, 32))
    logits, loss = model(x, targets=x)
    print(f"  Forward pass test:")
    print(f"    Input:  {list(x.shape)}")
    print(f"    Output: {list(logits.shape)}")
    print(f"    Loss:   {loss.item():.4f}")
    
    # Test generation
    start = torch.zeros((1, 1), dtype=torch.long)
    gen = model.generate(start, max_new_tokens=20)
    print(f"    Generated tokens: {gen[0].tolist()}")
    
    print(f"\n  \033[1m\033[92m✅ Model works! Ready for training.\033[0m\n")

Complete Code: train.py

Python
"""
🔴 Level 4: Train Your Mini-GPT!
====================================

This script trains your Mini-GPT model on the stories dataset.
Watch it go from random gibberish to coherent text!

Usage:
    python train.py

The training takes ~5-10 minutes on CPU.
"""

import os
import sys
import time
import math
import torch

# Add parent directory to path so we can import model
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from model import MiniGPT, GPTConfig

# ============================================================================
# 🎨 Colors
# ============================================================================
class Colors:
    HEADER = '\033[95m'
    BLUE = '\033[94m'
    CYAN = '\033[96m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    DIM = '\033[2m'
    RESET = '\033[0m'

def print_header(text):
    print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
    print(f"  {text}")
    print(f"{'='*60}{Colors.RESET}\n")

def print_step(num, text):
    print(f"{Colors.BOLD}{Colors.CYAN}📌 Step {num}: {text}{Colors.RESET}")

def print_info(text):
    print(f"  {Colors.DIM}{text}{Colors.RESET}")

def print_success(text):
    print(f"  {Colors.GREEN}✓ {text}{Colors.RESET}")


# ============================================================================
# 📦 Data Loading & Tokenization
# ============================================================================

class CharDataset:
    """Character-level dataset for language modeling."""
    
    def __init__(self, text, config):
        self.config = config
        
        # Build vocabulary from the text
        chars = sorted(list(set(text)))
        self.char_to_idx = {ch: i for i, ch in enumerate(chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(chars)}
        self.vocab_size = len(chars)
        
        # Encode entire text
        self.data = torch.tensor([self.char_to_idx[ch] for ch in text], dtype=torch.long)
        
        # Train/validation split (90/10)
        n = int(0.9 * len(self.data))
        self.train_data = self.data[:n]
        self.val_data = self.data[n:]
    
    def encode(self, text):
        return [self.char_to_idx.get(ch, 0) for ch in text]
    
    def decode(self, indices):
        return ''.join([self.idx_to_char.get(i, '?') for i in indices])
    
    def get_batch(self, split='train'):
        """Get a random batch of training data."""
        data = self.train_data if split == 'train' else self.val_data
        seq_len = self.config.max_seq_len
        batch_size = self.config.batch_size
        
        # Random starting positions
        ix = torch.randint(len(data) - seq_len - 1, (batch_size,))
        
        # Input and target sequences
        x = torch.stack([data[i:i+seq_len] for i in ix])
        y = torch.stack([data[i+1:i+seq_len+1] for i in ix])
        
        return x, y


# ============================================================================
# 📊 Evaluation
# ============================================================================

@torch.no_grad()
def estimate_loss(model, dataset, config):
    """Estimate average loss on train and validation sets."""
    model.eval()
    losses = {}
    
    for split in ['train', 'val']:
        total_loss = 0.0
        for _ in range(config.eval_steps):
            x, y = dataset.get_batch(split)
            _, loss = model(x, targets=y)
            total_loss += loss.item()
        losses[split] = total_loss / config.eval_steps
    
    model.train()
    return losses


# ============================================================================
# 🎯 Progress Bar
# ============================================================================

def progress_bar(current, total, width=40, loss=None, extra=""):
    """Simple progress bar without tqdm dependency."""
    filled = int(width * current / total)
    bar = '█' * filled + '░' * (width - filled)
    percent = 100 * current / total
    
    loss_str = f" loss={loss:.4f}" if loss else ""
    print(f"\r  [{bar}] {percent:5.1f}% ({current}/{total}){loss_str} {extra}", end='', flush=True)


# ============================================================================
# 🚀 MAIN TRAINING LOOP
# ============================================================================

if __name__ == '__main__':
    print_header("🔴 Level 4: Training Your Mini-GPT!")
    
    # --- Step 1: Load Data ---
    print_step(1, "Loading training data")
    
    data_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'data', 'stories.txt')
    
    if not os.path.exists(data_path):
        print(f"  {Colors.RED}Error: {data_path} not found!{Colors.RESET}")
        print(f"  Make sure stories.txt is in the data/ folder.")
        sys.exit(1)
    
    with open(data_path, 'r', encoding='utf-8') as f:
        text = f.read()
    
    print_info(f"Loaded {len(text):,} characters")
    print_info(f"First 100 chars: \"{text[:100]}...\"")
    
    # --- Step 2: Create Dataset ---
    print_step(2, "Creating character-level dataset")
    
    config = GPTConfig()
    dataset = CharDataset(text, config)
    
    # Update config with actual vocab size
    config.vocab_size = dataset.vocab_size
    
    print_info(f"Vocabulary size: {dataset.vocab_size} unique characters")
    print_info(f"Training data: {len(dataset.train_data):,} characters")
    print_info(f"Validation data: {len(dataset.val_data):,} characters")
    
    print(f"\n  {Colors.YELLOW}Character vocabulary:{Colors.RESET}")
    chars_display = ''.join([dataset.idx_to_char[i] for i in range(min(dataset.vocab_size, 50))])
    print(f"  {repr(chars_display)}")
    
    # --- Step 3: Create Model ---
    print_step(3, "Building Mini-GPT model")
    
    model = MiniGPT(config)
    n_params = model.count_parameters()
    
    print_info(f"Model parameters: {n_params:,} ({n_params/1e6:.2f}M)")
    print(f"\n  {Colors.YELLOW}Configuration:{Colors.RESET}")
    print(config)
    
    # --- Step 4: Setup Optimizer ---
    print_step(4, "Setting up optimizer")
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
    print_info(f"Optimizer: AdamW, lr={config.learning_rate}")
    
    # --- Step 5: TRAINING! ---
    print_header("🏋️ Training Loop Starting!")
    print(f"  Training for {config.max_steps} steps...")
    print(f"  Evaluating every {config.eval_interval} steps")
    print(f"  Generating sample every {config.sample_interval} steps\n")
    
    # Generate BEFORE training to show how bad it is
    print(f"  {Colors.YELLOW}📝 Sample BEFORE training:{Colors.RESET}")
    start_tokens = torch.zeros((1, 1), dtype=torch.long)
    generated = model.generate(start_tokens, max_new_tokens=100, temperature=1.0)
    gen_text = dataset.decode(generated[0].tolist())
    print(f"  {Colors.DIM}\"{gen_text[:100]}\"{Colors.RESET}")
    print(f"  {Colors.RED}^ Complete garbage! The model knows nothing yet.{Colors.RESET}\n")
    
    model.train()
    start_time = time.time()
    best_val_loss = float('inf')
    train_losses = []
    
    for step in range(config.max_steps):
        # Get batch
        x, y = dataset.get_batch('train')
        
        # Forward pass
        logits, loss = model(x, targets=y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        
        # Gradient clipping (prevents exploding gradients)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        # Update weights
        optimizer.step()
        
        train_losses.append(loss.item())
        
        # Progress bar
        progress_bar(step + 1, config.max_steps, loss=loss.item())
        
        # Evaluate periodically
        if (step + 1) % config.eval_interval == 0:
            losses = estimate_loss(model, dataset, config)
            elapsed = time.time() - start_time
            
            print()  # New line after progress bar
            print(f"\n  {Colors.BOLD}Step {step+1}/{config.max_steps}{Colors.RESET}")
            print(f"    Train loss: {Colors.CYAN}{losses['train']:.4f}{Colors.RESET}")
            print(f"    Val loss:   {Colors.CYAN}{losses['val']:.4f}{Colors.RESET}")
            print(f"    Time:       {elapsed:.1f}s")
            
            # Save best model
            if losses['val'] < best_val_loss:
                best_val_loss = losses['val']
                save_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'mini_gpt_model.pt')
                torch.save({
                    'model_state_dict': model.state_dict(),
                    'config': config,
                    'char_to_idx': dataset.char_to_idx,
                    'idx_to_char': dataset.idx_to_char,
                    'vocab_size': dataset.vocab_size,
                    'step': step + 1,
                    'val_loss': best_val_loss,
                }, save_path)
                print(f"    {Colors.GREEN}💾 Best model saved! (val_loss={best_val_loss:.4f}){Colors.RESET}")
            
            print()
        
        # Generate sample periodically
        if (step + 1) % config.sample_interval == 0:
            model.eval()
            start_tokens = torch.zeros((1, 1), dtype=torch.long)
            generated = model.generate(start_tokens, max_new_tokens=150, temperature=0.8)
            gen_text = dataset.decode(generated[0].tolist())
            model.train()
            
            print(f"  {Colors.YELLOW}📝 Sample at step {step+1}:{Colors.RESET}")
            print(f"  {Colors.GREEN}\"{gen_text[:150]}\"{Colors.RESET}\n")
    
    # --- Training Complete ---
    total_time = time.time() - start_time
    
    print_header("🎉 Training Complete!")
    
    print(f"  Total training time: {Colors.BOLD}{total_time:.1f} seconds{Colors.RESET}")
    print(f"  ({total_time/60:.1f} minutes)")
    print(f"  Best validation loss: {Colors.BOLD}{best_val_loss:.4f}{Colors.RESET}")
    
    # Final generation
    print(f"\n  {Colors.YELLOW}📝 Final generated text:{Colors.RESET}")
    model.eval()
    
    prompts = ["The ", "A ", "Once "]
    for prompt in prompts:
        encoded = dataset.encode(prompt)
        start_tokens = torch.tensor([encoded], dtype=torch.long)
        generated = model.generate(start_tokens, max_new_tokens=200, temperature=0.7, top_k=20)
        gen_text = dataset.decode(generated[0].tolist())
        print(f"\n  {Colors.CYAN}Prompt: \"{prompt}\"{Colors.RESET}")
        print(f"  {Colors.GREEN}{gen_text[:200]}{Colors.RESET}")
    
    print(f"\n\n  {Colors.BOLD}{Colors.GREEN}✅ Training complete!{Colors.RESET}")
    print(f"  {Colors.DIM}Model saved to: mini_gpt_model.pt{Colors.RESET}")
    print(f"  {Colors.DIM}Run 'python generate.py' to chat with your model!{Colors.RESET}\n")

Complete Code: generate.py

Python
"""
🔴 Level 4: Chat with Your Mini-GPT!
========================================

Interactive text generation with your trained model.

Usage:
    python generate.py

Commands:
    Type any text → model generates continuation
    temp 0.5     → change temperature (creativity level)
    topk 20      → change top-k sampling
    quit         → exit
"""

import os
import sys
import torch

sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from model import MiniGPT, GPTConfig

# ============================================================================
# 🎨 Colors
# ============================================================================
class Colors:
    HEADER = '\033[95m'
    BLUE = '\033[94m'
    CYAN = '\033[96m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    DIM = '\033[2m'
    RESET = '\033[0m'


def load_model():
    """Load the trained model from checkpoint."""
    model_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'mini_gpt_model.pt')
    
    if not os.path.exists(model_path):
        print(f"\n  {Colors.RED}❌ Model not found!{Colors.RESET}")
        print(f"  The file '{model_path}' does not exist.")
        print(f"\n  You need to train the model first:")
        print(f"  {Colors.CYAN}python train.py{Colors.RESET}")
        print(f"\n  Training takes about 5-10 minutes on CPU.")
        return None, None, None
    
    print(f"  {Colors.DIM}Loading model from {model_path}...{Colors.RESET}")
    
    checkpoint = torch.load(model_path, map_location='cpu', weights_only=False)
    
    config = checkpoint['config']
    char_to_idx = checkpoint['char_to_idx']
    idx_to_char = checkpoint['idx_to_char']
    
    model = MiniGPT(config)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.eval()
    
    print(f"  {Colors.GREEN}✓ Model loaded! ({model.count_parameters():,} parameters){Colors.RESET}")
    print(f"  {Colors.DIM}Trained to step {checkpoint.get('step', '?')}, val_loss={checkpoint.get('val_loss', '?'):.4f}{Colors.RESET}")
    
    return model, char_to_idx, idx_to_char


def generate_text(model, prompt, char_to_idx, idx_to_char, 
                  max_tokens=200, temperature=0.8, top_k=20):
    """Generate text from a prompt."""
    # Encode prompt
    encoded = [char_to_idx.get(ch, 0) for ch in prompt]
    input_ids = torch.tensor([encoded], dtype=torch.long)
    
    # Generate
    with torch.no_grad():
        output = model.generate(input_ids, max_new_tokens=max_tokens, 
                               temperature=temperature, top_k=top_k)
    
    # Decode
    generated = ''.join([idx_to_char.get(i, '?') for i in output[0].tolist()])
    return generated


if __name__ == '__main__':
    # Banner
    print(f"""
{Colors.BOLD}{Colors.HEADER}╔══════════════════════════════════════════════╗
║                                              ║
║   🔴 Mini-GPT Interactive Text Generator     ║
║                                              ║
║   Your very own language model!               ║
║                                              ║
╚══════════════════════════════════════════════╝{Colors.RESET}
""")
    
    # Load model
    model, char_to_idx, idx_to_char = load_model()
    
    if model is None:
        sys.exit(1)
    
    # Settings
    temperature = 0.8
    top_k = 20
    max_tokens = 200
    
    print(f"""
  {Colors.YELLOW}Settings:{Colors.RESET}
    Temperature: {temperature} (creativity level)
    Top-k:       {top_k} (diversity control)
    Max tokens:  {max_tokens}
  
  {Colors.YELLOW}Commands:{Colors.RESET}
    Type any text → model generates continuation
    {Colors.CYAN}temp 0.5{Colors.RESET}     → change temperature
    {Colors.CYAN}topk 10{Colors.RESET}      → change top-k
    {Colors.CYAN}tokens 300{Colors.RESET}   → change max tokens
    {Colors.CYAN}quit{Colors.RESET}         → exit
""")
    
    # Interactive loop
    while True:
        try:
            prompt = input(f"  {Colors.BOLD}{Colors.CYAN}You > {Colors.RESET}")
        except (EOFError, KeyboardInterrupt):
            print(f"\n\n  {Colors.DIM}Goodbye! 👋{Colors.RESET}\n")
            break
        
        if not prompt.strip():
            continue
        
        # Handle commands
        if prompt.strip().lower() == 'quit':
            print(f"\n  {Colors.DIM}Goodbye! 👋{Colors.RESET}\n")
            break
        
        if prompt.strip().lower().startswith('temp '):
            try:
                temperature = float(prompt.strip().split()[1])
                print(f"  {Colors.GREEN}✓ Temperature set to {temperature}{Colors.RESET}\n")
            except (ValueError, IndexError):
                print(f"  {Colors.RED}Usage: temp 0.5{Colors.RESET}\n")
            continue
        
        if prompt.strip().lower().startswith('topk '):
            try:
                top_k = int(prompt.strip().split()[1])
                print(f"  {Colors.GREEN}✓ Top-k set to {top_k}{Colors.RESET}\n")
            except (ValueError, IndexError):
                print(f"  {Colors.RED}Usage: topk 20{Colors.RESET}\n")
            continue
        
        if prompt.strip().lower().startswith('tokens '):
            try:
                max_tokens = int(prompt.strip().split()[1])
                print(f"  {Colors.GREEN}✓ Max tokens set to {max_tokens}{Colors.RESET}\n")
            except (ValueError, IndexError):
                print(f"  {Colors.RED}Usage: tokens 300{Colors.RESET}\n")
            continue
        
        # Generate text
        try:
            generated = generate_text(
                model, prompt, char_to_idx, idx_to_char,
                max_tokens=max_tokens, temperature=temperature, top_k=top_k
            )
            
            # Display: prompt in cyan, generated part in green
            prompt_len = len(prompt)
            print(f"\n  {Colors.CYAN}{generated[:prompt_len]}{Colors.GREEN}{generated[prompt_len:]}{Colors.RESET}")
            print(f"\n  {Colors.DIM}[temp={temperature}, top_k={top_k}, tokens={len(generated)-prompt_len}]{Colors.RESET}\n")
            
        except Exception as e:
            print(f"  {Colors.RED}Error: {e}{Colors.RESET}\n")

The Training Data: stories.txt

Text
Once upon a time, in a small village near the river, there lived a wise old farmer. He worked hard every day in his fields. The farmer grew rice, wheat, and vegetables. He shared his food with everyone in the village. People loved him because he was kind and generous.

The sun rises in the east and sets in the west. Every morning, the birds sing beautiful songs. The flowers open their petals to welcome the sunlight. The trees provide shade and fresh air. Nature is beautiful and full of wonders.

A clever fox lived in a forest near a village. One hot summer day, the fox was very thirsty. He searched for water everywhere but could not find any. Then he saw a pot with some water at the bottom. The fox put small stones into the pot one by one. Slowly the water came up to the top. The fox drank the water happily. This story teaches us that intelligence solves problems.

The river Ganga flows from the Himalayas to the Bay of Bengal. It is one of the longest rivers in India. Many cities and towns are built along its banks. People use the river water for drinking and farming. The Ganga is very important for the people of India.

A kind king ruled a beautiful kingdom. His people were happy and peaceful. The king built schools for children and hospitals for the sick. He made sure everyone had food to eat and a place to live. The kingdom prospered under his wise rule.

The moon shines brightly in the night sky. Stars twinkle like tiny diamonds above us. The sky changes color from blue to orange during sunset. Clouds float gently across the sky like cotton balls. Looking at the sky fills our hearts with wonder.

A small boy named Arjun loved to read books. He would sit under the banyan tree and read for hours. His favorite books were about science and adventure. One day he read about the solar system and the planets. He dreamed of becoming a scientist when he grew up.

Water is essential for all living things. Plants need water to grow and make food. Animals drink water to stay alive and healthy. The water cycle keeps water moving around the earth. Rain fills the rivers and lakes with fresh water.

A poor woodcutter lived at the edge of a forest. Every day he would cut wood and sell it in the market. One day his axe fell into the river. He sat by the river and cried because he was very poor. A kind spirit appeared and asked him what happened. The spirit dove into the water and brought up a golden axe. The woodcutter said that was not his axe. The spirit brought up a silver axe. Again the woodcutter said it was not his. Finally the spirit brought up his old iron axe. The woodcutter was happy and said yes that is mine. The spirit was pleased with his honesty and gave him all three axes.

The earth goes around the sun in one year. The moon goes around the earth in about one month. The earth spins on its axis once every day. This spinning gives us day and night. When our part of the earth faces the sun it is daytime. When it faces away from the sun it is nighttime.

A beautiful peacock lived in a garden near the palace. It had colorful feathers of blue and green. When it danced in the rain everyone would stop and watch. The peacock was proud of its beautiful feathers. It spread its tail like a magnificent fan.

Trees are very important for our planet. They give us oxygen to breathe and clean the air. Trees provide fruits and nuts for us to eat. Birds build their nests in the branches of trees. We should plant more trees and take care of them.

A young girl named Priya wanted to learn music. She practiced singing every day after school. Her teacher said she had a beautiful voice. Priya worked very hard and never missed a practice session. After many months she sang in a concert and everyone clapped.

The heart pumps blood through our body. Blood carries oxygen and food to every part of the body. The brain controls all our movements and thoughts. Our bones give shape to our body and protect our organs. The human body is an amazing machine.

An old tortoise and a young rabbit decided to have a race. The rabbit ran very fast and went far ahead. He thought he had plenty of time so he took a nap. The tortoise kept walking slowly but steadily. When the rabbit woke up the tortoise had already crossed the finish line. Slow and steady wins the race.

India has many beautiful festivals throughout the year. Diwali is the festival of lights celebrated with joy and happiness. Holi is the festival of colors where people play with colored powder. Eid brings people together for prayers and feasts. Christmas is celebrated with decorations and gifts.

A magnet has two poles called north and south. Like poles repel each other and unlike poles attract. Magnets can attract things made of iron and steel. The earth itself is like a giant magnet. A compass needle points north because of the earth magnetic field.

There was a merchant who traveled from town to town selling goods. He carried silk cloths and precious spices on his camel. One day he got lost in the desert during a sandstorm. He prayed for help and soon the storm passed away. He followed the stars in the night sky and found his way home.

Light travels in straight lines very fast. When light passes through a prism it splits into seven colors. These colors are violet indigo blue green yellow orange and red. We can see a rainbow after rain because water drops act like tiny prisms. Light is a form of energy that helps us see the world.

A mother bird built a nest in a tall tree. She laid three small eggs in the nest. She sat on the eggs to keep them warm for many days. Soon the eggs cracked and three baby birds came out. The mother bird brought food for her babies every day until they learned to fly.

Plants make their own food through photosynthesis. They use sunlight water and carbon dioxide for this process. The green color in leaves comes from a substance called chlorophyll. Chlorophyll captures sunlight to make food for the plant. Plants give out oxygen during photosynthesis which we breathe.

A brave soldier named Ravi protected his village from danger. He stood guard at the border day and night without complaint. The villagers respected him and treated him like a hero. Ravi taught the young boys how to be brave and strong. He said courage means doing the right thing even when you are afraid.

The seasons change throughout the year in India. Summer is hot and dry with temperatures rising very high. The monsoon brings heavy rains and cools the land. Winter is cold and pleasant in most parts of the country. Spring brings new flowers and green leaves on the trees.

A fisherman went to the sea every morning in his small boat. He would throw his net into the water and wait patiently. Sometimes he caught many fish and sometimes very few. One day he caught a beautiful golden fish. The golden fish spoke and asked to be set free. The kind fisherman released it back into the sea.

Electricity flows through wires like water flows through pipes. We use electricity to power lights fans and computers. A battery stores electrical energy for later use. Switches control the flow of electricity in a circuit. We should use electricity wisely and not waste it.

Two friends were walking through a forest one day. Suddenly they saw a large bear coming toward them. One friend quickly climbed a tree to save himself. The other friend lay down on the ground and pretended to be dead. The bear came close and smelled him then walked away. When the bear left the friend in the tree came down. He asked what the bear whispered in his ear. The friend on the ground said the bear told me not to trust a friend who runs away in danger.

Mountains are the tallest landforms on the earth. The Himalayas are the highest mountains in the world. Mount Everest is the tallest peak standing at eight thousand meters. Many rivers begin from the glaciers in the mountains. Mountains affect the weather and rainfall in nearby areas.

A little ant worked hard all summer long. It collected food and stored it carefully in its home. A grasshopper spent the whole summer singing and dancing. When winter came the ant had plenty of food to eat. The grasshopper had nothing and was cold and hungry. The ant shared some food with the grasshopper and said it is wise to prepare for the future.

Sound is a form of energy that travels in waves. We hear sounds when these waves reach our ears. Sound travels faster through water than through air. It travels fastest through solid objects like metal. Very loud sounds can damage our hearing so we should protect our ears.

A teacher loved her students very much. She came to school early every day to prepare her lessons. She explained difficult topics in simple and easy ways. Her students always performed well in their examinations. She believed that every child can learn if given the right guidance.

Chapter 6

Fine-Tuning — From Generic AI to Your Personal ChatBot

Learning Objectives

Explain why fine-tuning is more practical than training from scratch — and how it saves crores of rupees.
Distinguish between pre-training and fine-tuning with clear mental models.
Describe LoRA (Low-Rank Adaptation), including the math behind it and why it's revolutionary.
Trace the full RLHF pipeline used to build ChatGPT, Claude, and Gemini.
Prepare a custom Q&A dataset for fine-tuning using Python.
Fine-tune DistilGPT-2 with LoRA on your own data — on a regular laptop.
Chat with your fine-tuned model and compare it against the base model.
Critically evaluate the ethical dimensions of fine-tuning, especially in the Indian context.

6.1 Standing on the Shoulders of Giants

Imagine you want to start a chai stall. Would you plant tea bushes, wait three years for them to grow, build a factory to process leaves, and then start making chai? Of course not! You buy ready-made tea powder from Tata or Brooke Bond and focus on what makes your chai special — the perfect ratio of adrak, elaichi, and sugar.

Fine-tuning works exactly the same way.

Training a large language model from scratch — what we call pre-training — is staggeringly expensive. GPT-3 cost an estimated 4.6 million (roughly ₹38 crore) in compute alone. GPT-4's training cost is rumoured to exceed 100 million (₹830+ crore). These models are trained on trillions of tokens of internet text for weeks on clusters of thousands of GPUs.

Now compare that to fine-tuning. When you fine-tune, you take a pre-trained model that already understands language — grammar, facts, reasoning patterns — and you teach it your specific task. A fine-tuning run on a small model like DistilGPT-2 can cost as little as 0 (free, on your laptop's CPU) to 10 (a few hours on a cloud GPU). Even fine-tuning a 7-billion-parameter model on an A100 GPU costs under $50.

Approach	Cost	Time	Data Needed	Hardware
Pre-training from scratch	₹38–830 crore	Weeks–months	Trillions of tokens	Thousands of GPUs
Fine-tuning (full)	₹400–₹40,000	Hours–days	Thousands of examples	1–8 GPUs
Fine-tuning (LoRA)	₹0–₹4,000	Minutes–hours	Hundreds of examples	1 GPU or CPU

Important

Fine-tuning is why AI is democratized today. You don't need the budget of Google or OpenAI. A student at IIT Bombay or a teacher in Jaipur can build a specialized AI chatbot with a laptop and a few hundred well-crafted examples.

The giants — Google, Meta, OpenAI, Mistral — have done the expensive work of pre-training. We stand on their shoulders and specialize their models for our needs.

6.2 Pre-training vs Fine-tuning

Here's the analogy that makes this click:

Pre-training is like going to school for 12 years. From Class 1 to Class 12, you learn Hindi, English, Maths, Science, Social Studies, Art — everything. By the time you finish school, you're a well-rounded person who knows a little about a lot.

Fine-tuning is like taking admission in a B.Sc. or B.Tech programme. You pick one subject — say, Computer Science — and spend 3–4 years going deep into it. You don't forget what you learned in school; you build on it.

Now consider what happens at each stage for a language model:

Pre-training

During pre-training, the model reads massive amounts of text from the internet — Wikipedia articles, books, news, code, forums — and learns to predict the next word. Through billions of these next-word predictions, it picks up:

Grammar and syntax — how sentences are structured
Facts and knowledge — who was the first Prime Minister of India, what is photosynthesis
Reasoning patterns — if X then Y, cause and effect
Multiple languages — Hindi, English, Tamil, and hundreds more

The pre-trained model is a generalist. Ask it anything and it will produce grammatically correct, somewhat relevant text. But it won't follow instructions well, it won't stay on topic, and it won't have the personality or expertise you want.

Fine-tuning

During fine-tuning, you take this generalist model and train it further on a small, curated dataset specific to your task. For our chatbot, we'll use education Q&A pairs — questions about science, maths, Indian history, and study tips, paired with clear, helpful answers.

After fine-tuning, the model:

Follows the Q&A format — it knows when to stop answering and doesn't ramble
Stays on topic — it gives education-relevant responses
Matches the tone of your training data — helpful, clear, encouraging

Tip

Think of it this way: pre-training gives the model its IQ (general intelligence). Fine-tuning gives it its specialization and personality (like a teacher who explains things simply, or a doctor who speaks with compassion).

6.3 What is LoRA?

Here's the problem with fine-tuning: even though the dataset is small, you still need to update the model's weights. A model like LLaMA-2 7B has 7 billion parameters. Storing a full copy of those 7 billion updated parameters requires ~28 GB of memory (in FP32). Training them requires even more. For most of us, that's impossible.

LoRA — Low-Rank Adaptation — solves this brilliantly.

The Sticky Note Analogy

Imagine you have a massive NCERT textbook — say, the Class 12 Physics textbook. It has 500 pages of printed text that you cannot change (those are the frozen pre-trained weights). But you can add small sticky notes (Post-it notes) on specific pages with your own handwritten additions, corrections, or summaries.

That's LoRA. Instead of rewriting the entire textbook (full fine-tuning), you add tiny, targeted modifications. The original textbook stays intact. Your additions are small — maybe 50 sticky notes total — but they dramatically change how you use the book.

The Math Behind LoRA

Let's go deeper. In a Transformer, the key computation happens in weight matrices — large matrices like W that transform input vectors into output vectors:

h = W \cdot x

where W is, say, a 768 \times 768 matrix (589,824 parameters) in DistilGPT-2.

In standard fine-tuning, you'd update every single one of those 589,824 values. LoRA instead says: "The change \Delta W that fine-tuning makes is probably low-rank." That is, the update matrix doesn't need all 589,824 degrees of freedom — it can be approximated by a much smaller matrix.

LoRA decomposes \Delta W into two small matrices:

\Delta W = B \times A

where:

A is a matrix of shape r \times 768 (low-rank "down-projection")
B is a matrix of shape 768 \times r (low-rank "up-projection")
r is the rank — a tiny number like 4, 8, or 16

So instead of storing 768 \times 768 = 589,824 values for \Delta W, you store:

r \times 768 + 768 \times r = 2 \times r \times 768

With r = 8:

2 \times 8 \times 768 = 12,288 \text{ parameters}

That's a 98% reduction in trainable parameters! The forward pass becomes:

h = (W + \frac{\alpha}{r} \cdot B \times A) \cdot x = W \cdot x + \frac{\alpha}{r} \cdot B \times A \cdot x

Understanding Rank and Alpha

Rank (r): This controls the "capacity" of your adaptation. Think of it as how many sticky notes you're allowed to add. A rank of 4 means very focused changes; a rank of 64 gives more expressive power but requires more memory. For most tasks, r = 8 or r = 16 works beautifully.

Alpha (\alpha): This is a scaling factor that controls how much the LoRA adaptation influences the output. The effective scaling is \frac{\alpha}{r}. A common practice is to set \alpha = 2 \times r (e.g., r = 8, \alpha = 16), so the scaling factor is 2.

Note

The beauty of LoRA is that at inference time, you can merge \Delta W into W to get W' = W + \frac{\alpha}{r} B A, so there's zero additional latency. The sticky notes get permanently written into the textbook, and the book is the same size as before.

Why LoRA is Revolutionary

Feature	Full Fine-tuning	LoRA
Parameters updated	All (100%)	0.1–2%
Memory required	Very high	Very low
Training speed	Slow	Fast
Storage per task	Full model copy	Tiny adapter (~MB)
Switch between tasks	Load entire model	Swap adapter file

This last point is especially powerful. Imagine you fine-tune the same base model for three different subjects — Physics, History, and Mathematics. With LoRA, you store one base model and three tiny adapter files. Swapping subjects is as fast as loading a small file.

6.4 The RLHF Pipeline — How ChatGPT, Claude, and Gemini Are Trained

You might wonder: if fine-tuning is so simple, why did it take until 2022 for ChatGPT to feel truly magical? The answer lies in a multi-stage pipeline that goes far beyond basic fine-tuning.

Here's the full journey:

Stage 1: Pre-training

The model (GPT-4, Gemini, Claude, LLaMA) is trained on trillions of tokens of internet text. This produces a base model — a powerful text predictor that can complete any sentence but doesn't follow instructions or behave helpfully.

Stage 2: Supervised Fine-Tuning (SFT)

Human annotators write thousands of high-quality (instruction, response) pairs. The model is fine-tuned on these examples to learn the format of being helpful. After SFT, the model can follow instructions, but its responses vary in quality.

Stage 3: RLHF (Reinforcement Learning from Human Feedback)

This is the magic sauce. Here's how it works:

Generate: The SFT model generates multiple responses to the same prompt.
Rank: Human raters rank these responses from best to worst. ("Response A is more helpful and accurate than Response B.")
Train a Reward Model: A separate neural network learns to predict which responses humans will prefer. It assigns a reward score to any (prompt, response) pair.
Optimize with RL: Using Proximal Policy Optimization (PPO), the language model is trained to generate responses that maximize the reward model's score — while staying close to the SFT model (to prevent it from "hacking" the reward).

This is analogous to a teacher training a student:

SFT = giving the student model answers to copy
RLHF = having the student write their own answers, then giving grades and feedback

Stage 4: Constitutional AI and DPO

Anthropic's Constitutional AI (used in Claude) adds another layer: instead of relying solely on human raters, the model critiques its own responses against a set of principles (a "constitution") and revises them. This is like a student doing self-correction based on a rubric.

DPO (Direct Preference Optimization) is a newer, simpler alternative to RLHF. Instead of training a separate reward model and using PPO, DPO directly optimizes the language model using preference pairs. It's mathematically equivalent to RLHF in many cases but much simpler to implement.


Pre-training       →  SFT             →  RLHF/DPO         →  Deployed Model
(Trillions of        (Thousands of       (Thousands of        (ChatGPT,
 tokens, months)      examples, hours)    comparisons, days)   Claude, Gemini)

Tip

What we're doing in this chapter — fine-tuning DistilGPT-2 on Q&A pairs — is equivalent to Stage 2 (SFT). In a production system, you'd add RLHF or DPO on top. But even SFT alone produces a dramatic improvement!

6.5 Preparing Your Data

Before fine-tuning, we need to transform raw question-answer pairs into tokenized, model-ready training data. Our data lives in a JSONL file (education_qa.jsonl) where each line is a JSON object:

JSON
{"instruction": "What is photosynthesis?", "response": "Photosynthesis is the process by which green plants convert sunlight into food..."}
{"instruction": "Who wrote the Indian national anthem?", "response": "The Indian national anthem 'Jana Gana Mana' was written by Rabindranath Tagore..."}

Our data preparation pipeline has five clear steps. Let's walk through each one.

Step 1: Load Raw Data from JSONL

Python
import os
import json
import random
from pathlib import Path

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
DATA_DIR = os.path.join(SCRIPT_DIR, 'data')
PROCESSED_DIR = os.path.join(DATA_DIR, 'processed')
DATASET_FILE = os.path.join(DATA_DIR, 'education_qa.jsonl')

def load_jsonl(filepath):
    """Load data from a JSONL file."""
    if not os.path.exists(filepath):
        print(f"Error: Dataset file not found: {filepath}")
        return None

    data = []
    with open(filepath, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                entry = json.loads(line)
                data.append(entry)
            except json.JSONDecodeError as e:
                print(f"  Skipping malformed line {line_num}: {e}")

    print(f"Loaded {len(data)} examples")
    return data

We use JSONL (JSON Lines) instead of a single JSON file because JSONL is streaming-friendly — you can process one line at a time without loading the entire file into memory. This matters when your dataset grows to millions of examples.

Step 2: Format into Instruction-Response Pairs

Python
def format_examples(data):
    """Format each example into instruction-response format."""
    formatted = []
    for entry in data:
        instruction = entry.get('instruction', entry.get('question', ''))
        response = entry.get('response', entry.get('answer', entry.get('output', '')))

        if not instruction or not response:
            continue

        text = f"### Question:\n{instruction}\n\n### Answer:\n{response}"
        formatted.append(text)

    print(f"Formatted {len(formatted)} examples")
    return formatted

Notice the format: ### Question:\n...\n\n### Answer:\n.... This is a prompt template — a consistent structure that teaches the model to recognize where questions end and answers begin. The ### markers act as clear delimiters.

Note

The .get() calls with fallbacks ('question', 'answer', 'output') make the function robust to different dataset formats. Whether your data uses instruction/response or question/answer keys, it works.

Step 3: Tokenize

Python
def tokenize_texts(texts):
    """Tokenize all formatted texts using DistilGPT-2 tokenizer."""
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained('distilgpt2')

    # GPT-2 doesn't have a pad token — use eos_token instead
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    tokenized_data = []
    for text in texts:
        encoded = tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=256,
            return_tensors=None
        )
        tokenized_data.append({
            'input_ids': encoded['input_ids'],
            'attention_mask': encoded['attention_mask'],
            'text': text
        })

    print(f"Tokenized {len(tokenized_data)} examples (max_length=256)")
    return tokenized_data, tokenizer

Key choices here:

max_length=256: We cap each example at 256 tokens. This is enough for most education Q&A pairs while keeping memory usage manageable. Longer texts get truncated; shorter texts get padded.
padding='max_length': All sequences are padded to exactly 256 tokens so they can be batched together.
attention_mask: A binary mask that tells the model which tokens are real (1) and which are padding (0). The model ignores padding tokens.

Step 4: Split into Train and Validation Sets

Python
def split_data(tokenized_data, train_ratio=0.9, seed=42):
    """Split data into train and validation sets."""
    random.seed(seed)
    indices = list(range(len(tokenized_data)))
    random.shuffle(indices)

    split_idx = int(len(indices) * train_ratio)
    train_indices = indices[:split_idx]
    val_indices = indices[split_idx:]

    train_data = [tokenized_data[i] for i in train_indices]
    val_data = [tokenized_data[i] for i in val_indices]

    print(f"Train: {len(train_data)} examples")
    print(f"Val:   {len(val_data)} examples")
    return train_data, val_data

We use a 90/10 split — 90% for training, 10% for validation. The validation set is crucial: it tells us whether the model is actually learning or just memorizing the training data (overfitting). Setting seed=42 ensures reproducibility — you get the same split every time you run the script.

Step 5: Save Processed Data

Python
def save_processed_data(train_data, val_data):
    """Save processed data to JSON files."""
    os.makedirs(PROCESSED_DIR, exist_ok=True)

    train_path = os.path.join(PROCESSED_DIR, 'train.json')
    val_path = os.path.join(PROCESSED_DIR, 'val.json')

    with open(train_path, 'w', encoding='utf-8') as f:
        json.dump(train_data, f, indent=2)

    with open(val_path, 'w', encoding='utf-8') as f:
        json.dump(val_data, f, indent=2)

    print(f"Saved train data: {train_path}")
    print(f"Saved val data:   {val_path}")

Tip

Run this script with: python prepare_data.py. It will create a data/processed/ directory with train.json and val.json, ready for the fine-tuning step.

6.6 Fine-Tuning DistilGPT-2 with LoRA

Now the exciting part — we train the model. This script loads DistilGPT-2, wraps it with LoRA adapters, and trains it on our education data using Hugging Face's Trainer API.

Step 1: Load the Base Model

Python
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset

MODEL_NAME = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# GPT-2 doesn't have a pad token by default
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.eos_token_id

AutoModelForCausalLM loads the model in causal language modelling mode — the model predicts the next token given all previous tokens. This is the standard mode for GPT-style text generation.

DistilGPT-2 has about 82 million parameters, 6 Transformer layers, 12 attention heads, and an embedding dimension of 768. It's a distilled (compressed) version of GPT-2 — smaller and faster, perfect for learning.

Step 2: Apply LoRA

Python
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                        # LoRA rank
    lora_alpha=16,              # Scaling factor
    lora_dropout=0.05,          # Small dropout for regularization
    target_modules=["c_attn"],  # Target attention layers in GPT-2
    bias="none",
)

model = get_peft_model(model, lora_config)

# Check: how many parameters are trainable?
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())
trainable_pct = 100 * trainable_params / all_params

print(f"Trainable parameters: {trainable_params:,} ({trainable_pct:.2f}% of {all_params:,})")

Let's unpack each parameter:

task_type=TaskType.CAUSAL_LM — We're doing causal (autoregressive) language modelling, not classification or sequence-to-sequence.
r=8 — Rank 8. Our A and B matrices are 8 \times 768 and 768 \times 8 respectively. This gives us enough capacity to adapt the model while keeping the adapter tiny.
lora_alpha=16 — Scaling factor. The LoRA update is scaled by \frac{\alpha}{r} = \frac{16}{8} = 2.
target_modules=["c_attn"] — We only add LoRA to the attention projection layers in GPT-2 (called c_attn). This is where the query, key, and value projections live — the most impactful place to adapt.
lora_dropout=0.05 — A tiny 5% dropout on the LoRA layers to prevent overfitting.
bias="none" — We don't train bias terms, keeping the adapter even smaller.

After applying LoRA, you'll see that only about 0.29% of the model's parameters are trainable. The rest are frozen. That's the power of LoRA.

Step 3: Load and Prepare the Dataset

Python
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
DATASET_FILE = os.path.join(SCRIPT_DIR, "data", "education_qa.jsonl")

# Load from JSONL and format
raw_data = load_dataset_from_jsonl(DATASET_FILE)

# Create a Hugging Face Dataset
dataset = Dataset.from_list(raw_data)

# Tokenize
def tokenize_function(examples):
    tokenized = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=256,
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"],
    desc="Tokenizing",
)

# 90/10 train/validation split
split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
val_dataset = split["test"]

Important

Notice tokenized["labels"] = tokenized["input_ids"].copy(). In causal language modelling, the labels are the same as the inputs, shifted by one position. The model learns to predict each token given all preceding tokens. The Trainer handles the shifting internally.

Step 4: Configure and Launch Training

Python
OUTPUT_DIR = os.path.join(SCRIPT_DIR, "output", "edu-chatbot-lora")

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-4,
    warmup_steps=50,
    weight_decay=0.01,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    fp16=torch.cuda.is_available(),
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Causal LM, not masked LM
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
)

# Train!
train_result = trainer.train()

Key training decisions explained:

3 epochs — The model sees each training example 3 times. More epochs risk overfitting on a small dataset.
Batch size 4 — Process 4 examples at once. Small enough for CPU/low-memory GPU, large enough for stable gradients.
Learning rate 2e-4 — Higher than typical pre-training LRs (1e-5 to 5e-5) because we're only training LoRA parameters, which need larger updates.
Warmup 50 steps — The learning rate starts at 0 and linearly increases to 2e-4 over 50 steps. This prevents early instability.
fp16=torch.cuda.is_available() — Uses half-precision (16-bit floats) on GPU for 2x speedup. Falls back to FP32 on CPU.
load_best_model_at_end=True — After training, automatically loads the checkpoint with the lowest validation loss. This prevents using an overfit model.

Step 5: Save the LoRA Adapter

Python
# Save LoRA adapter and tokenizer
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

Note

The saved adapter is tiny — typically just a few hundred kilobytes. The base DistilGPT-2 model is ~350 MB. Your LoRA adapter adds less than 1 MB on top. This is like saving just the sticky notes, not the entire textbook.

6.7 Chatting with Your Fine-Tuned Model

Now let's build an interactive chat interface to talk to our creation!

Loading the Fine-Tuned Model

Python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

MODEL_DIR = os.path.join(SCRIPT_DIR, "output", "edu-chatbot-lora")
BASE_MODEL = "distilgpt2"

def load_model(model_path, base_model_name):
    """Load the fine-tuned model (base + LoRA adapter)."""
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    base_model = AutoModelForCausalLM.from_pretrained(base_model_name)

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Apply the LoRA adapter on top of the base model
    model = PeftModel.from_pretrained(base_model, model_path)
    model.eval()

    return model, tokenizer, base_model

Notice the two-step loading process:

Load the base model (distilgpt2) — the original, unmodified weights.
Apply the LoRA adapter on top using PeftModel.from_pretrained().

We also keep a reference to the base_model so we can compare outputs later.

Generating Responses

Python
def generate_response(model, tokenizer, prompt, temperature=0.7,
                      max_new_tokens=200, top_k=50, top_p=0.9):
    """Generate a response from the model."""
    # Format using the same template as training
    formatted_prompt = f"### Question:\n{prompt}\n\n### Answer:\n"

    inputs = tokenizer(formatted_prompt, return_tensors="pt",
                       truncation=True, max_length=256)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=max_new_tokens,
            temperature=max(temperature, 0.01),
            top_k=top_k,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.2,
            no_repeat_ngram_size=3,
        )

    # Decode only the newly generated tokens
    generated_ids = outputs[0][input_ids.shape[1]:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    # Stop at prompt markers to prevent the model from generating new Q&A pairs
    for stop_marker in ["### Question:", "### Answer:", "\n\n\n"]:
        if stop_marker in response:
            response = response[:response.index(stop_marker)].strip()

    return response

Let's understand the generation parameters:

temperature=0.7 — Controls randomness. Lower values (0.1) make the model deterministic and focused. Higher values (1.5) make it creative but potentially incoherent.
top_k=50 — At each step, only consider the top 50 most likely next tokens.
top_p=0.9 — Nucleus sampling: consider the smallest set of tokens whose cumulative probability exceeds 90%.
repetition_penalty=1.2 — Penalize tokens that have already appeared, reducing repetitive output.
no_repeat_ngram_size=3 — Never repeat the same 3-word sequence. Prevents loops like "the the the" or "is very very very important."

Warning

The formatted_prompt must use the exact same template as the training data (### Question:\n...\n\n### Answer:\n). If you change this format, the model won't recognize the pattern and will produce poor output. Consistency between training and inference is critical.

The Comparison Function

Python
def compare_models(base_model, finetuned_model, tokenizer, prompt, temperature=0.7):
    """Compare responses from base model vs fine-tuned model."""
    base_response = generate_response(base_model, tokenizer, prompt, temperature)
    ft_response = generate_response(finetuned_model, tokenizer, prompt, temperature)

    print(f"Base Model (no fine-tuning):\n{base_response}\n")
    print(f"Fine-Tuned Model:\n{ft_response}\n")

This function sends the same prompt to both the base model and the fine-tuned model, displaying their responses side by side.

6.8 Comparing Base vs Fine-Tuned

Here's what you'll typically see when you run the comparison:

Prompt: "What is photosynthesis?"

	Base DistilGPT-2	Fine-Tuned DistilGPT-2
Response	"The main purpose of this post is to explain how a new generation..." (random, off-topic text)	"Photosynthesis is the process by which green plants use sunlight, water, and carbon dioxide to make their own food (glucose) and release oxygen..."
Quality	❌ Incoherent, random	✅ Clear, educational
Format	No structure	Follows Q&A format
Relevance	0% relevant	Highly relevant

Prompt: "Give me tips to prepare for board exams."

	Base DistilGPT-2	Fine-Tuned DistilGPT-2
Response	"I'm not sure what you're talking about but the first thing I'd say is..."	"Here are some effective tips: 1) Make a study timetable and stick to it. 2) Focus on NCERT textbooks first. 3) Practice previous years' question papers..."

Tip

Try the compare command in chat.py to see this in action with your own questions. It's the most convincing demonstration of why fine-tuning works!

The base model is like a Class 12 topper who answers every question with random Wikipedia trivia. The fine-tuned model is like a dedicated tuition teacher who understands exactly what you asked and gives a clear, structured answer.

💭 6.9 Discussion: Ethics of Fine-Tuning

Fine-tuning is powerful — and with power comes responsibility. Let's discuss the ethical dimensions, especially as they apply to India.

### Bias Amplification

Every dataset carries the biases of its creators. If your education Q&A data is written from a particular perspective — say, a North Indian, upper-caste, English-medium viewpoint — the fine-tuned model will reflect those biases. It might:

- Give examples only from CBSE, ignoring state board syllabi

- Use English explanations that aren't accessible to Hindi-medium or regional-medium students

- Present history from a single perspective, ignoring diverse regional narratives

Mitigation: Actively include diverse examples. Have reviewers from different states, languages, and backgrounds evaluate your training data.

### Misinformation

A model fine-tuned on incorrect or outdated information will confidently produce wrong answers. In education, this is particularly dangerous — imagine a student trusting an AI that says "India became independent in 1948" or gives wrong formulas for Physics.

Mitigation: Rigorously fact-check your training data. Include citations where possible. Add disclaimers that the AI can make mistakes.

### Language and Accessibility

India has 22 officially recognized languages and hundreds more spoken across the country. An AI trained only on English education data excludes the vast majority of Indian students. A student in Madurai studying in Tamil medium, or a student in Assam studying in Assamese medium, deserves the same quality of AI assistance.

Mitigation: Build multilingual datasets. Fine-tune models that support Hindi, Tamil, Telugu, Bengali, Marathi, and other Indian languages. Organizations like AI4Bharat are doing pioneering work in this space.

### The Digital Divide

Fine-tuning requires computing resources, technical knowledge, and data — all of which are unevenly distributed. There's a risk that AI-powered education tools benefit urban, English-speaking, well-connected students while leaving rural India behind.

Mitigation: Design tools that work offline, on low-end devices. Partner with government schools and NGOs. Make your models and data open-source so others can build on them.

### Privacy

Education datasets might contain student questions that reveal personal information — learning difficulties, family situations, or mental health struggles. Fine-tuning on such data without consent is a serious privacy violation.

Mitigation: Anonymize all data. Get informed consent. Follow India's Digital Personal Data Protection Act (DPDPA) guidelines.

> [!CAUTION]

> Never deploy a fine-tuned education AI without human oversight. AI should assist teachers, not replace them. A wrong answer from a textbook can be corrected in the next edition; a wrong answer from an AI can be given to thousands of students simultaneously before anyone notices.

Key Concepts Summary

Concept	Definition
Pre-training	Training a model from scratch on massive text data to learn general language understanding. Extremely expensive.
Fine-tuning	Adapting a pre-trained model to a specific task using a small, curated dataset. Cheap and fast.
LoRA (Low-Rank Adaptation)	A parameter-efficient fine-tuning method that decomposes weight updates into two small matrices (`B \times A`), reducing trainable parameters by 98%+.
Rank (`r`)	The inner dimension of LoRA matrices. Controls adaptation capacity. Typical values: 4, 8, 16.
Alpha (`\alpha`)	LoRA scaling factor. The update is scaled by `\frac{\alpha}{r}`. Common choice: `\alpha = 2r`.
SFT (Supervised Fine-Tuning)	Fine-tuning on human-written (instruction, response) pairs.
RLHF	Reinforcement Learning from Human Feedback. Uses human preference rankings to train a reward model, then optimizes the LM with PPO.
DPO	Direct Preference Optimization. A simpler alternative to RLHF that optimizes directly from preference data without a separate reward model.
Prompt Template	A consistent format (e.g., `### Question:\n...\n\n### Answer:\n`) used during both training and inference.
Temperature	A generation parameter controlling randomness. Low = focused; high = creative.
Attention Mask	A binary mask indicating real tokens (1) vs. padding (0).
Data Collator	A utility that dynamically batches and pads tokenized examples for training.

📝 6.11 Exercises

Exercise 1: Experiment with LoRA Hyperparameters 🔬

Modify finetune.py to try different LoRA configurations:

Change the rank from 8 to 4 and to 16. How does the validation loss change? Does higher rank always mean better performance?

Change lora_alpha to 8 (same as rank) and to 32 (4× rank). How does this affect training stability?

Add "c_proj" to target_modules alongside "c_attn". Does targeting more layers improve results?

Exercise 2: Build a Hindi Q&A Dataset 🇮🇳

Create a JSONL file with at least 50 question-answer pairs in Hindi (or your regional language). Topics can include: Indian history, geography, civics, or any school subject. Run the full pipeline (prepare_data.py → finetune.py → chat.py) on your dataset and evaluate the results. Does the model respond in Hindi?

Exercise 3: Temperature Exploration 🌡️

Using chat.py, ask the same question ("Explain Newton's third law") at five different temperatures: 0.1, 0.3, 0.7, 1.0, and 1.5. Copy the responses and analyze:

At what temperature does the response become incoherent?

Which temperature produces the most "textbook-like" answer?

Which temperature produces the most "creative" answer?

Exercise 4: Measure Overfitting 📊

Modify finetune.py to train for 10 epochs instead of 3. Plot the training loss and validation loss for each epoch. At what epoch does the validation loss start increasing while training loss keeps decreasing? This is the overfitting point. What strategies could you use to prevent it?

Exercise 5: Compare Prompt Templates 📝

The current prompt template uses ### Question: and ### Answer:. Try these alternatives and compare output quality:

Q: ... A: ...

Student: ... Teacher: ...

<question> ... </question> <answer> ... </answer>

Remember: you must use the same template in both finetune.py and chat.py.

Exercise 6: Ethical Audit 🔍

Take your fine-tuned model and ask it 10 questions about Indian history from different regional perspectives (e.g., the independence movement from a South Indian perspective, tribal history, Northeast Indian history). Document:

Which questions does the model answer well?

Where does it show bias or gaps?

How would you improve the training data to address these gaps?

Exercise 7: Adapter Arithmetic ➕

You created one LoRA adapter for education. Now create a second adapter for a different domain (e.g., cooking recipes, cricket commentary, or Bollywood trivia). Can you load them separately on the same base model? How quickly can you switch between "teacher mode" and "cricket commentator mode"?

In the next chapter, we'll bring everything together — building a complete, deployable chatbot with a web interface that your friends, students, and colleagues can actually use. The journey from theory to product begins!

Complete Source Code - Chapter 6

Below are the complete, runnable source files for this chapter. Every line is included.

Complete Code: prepare_data.py

Python
#!/usr/bin/env python3
"""Level 5: Prepare Education Q&A Dataset for Fine-Tuning

This script loads raw education Q&A data from JSONL format,
formats it into instruction-response pairs, tokenizes using
the DistilGPT-2 tokenizer, and splits into train/validation sets.
"""

import os
import json
import random
from pathlib import Path

# ANSI Colors
GREEN = '\033[92m'
CYAN = '\033[96m'
YELLOW = '\033[93m'
MAGENTA = '\033[95m'
BOLD = '\033[1m'
RESET = '\033[0m'
BLUE = '\033[94m'
RED = '\033[91m'
DIM = '\033[2m'

# Get script directory
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
DATA_DIR = os.path.join(SCRIPT_DIR, 'data')
PROCESSED_DIR = os.path.join(DATA_DIR, 'processed')
DATASET_FILE = os.path.join(DATA_DIR, 'education_qa.jsonl')


def print_banner():
    """Print the data preparation banner."""
    banner = f"""
{CYAN}{BOLD}╔══════════════════════════════════════════════════════════════╗
║                                                              ║
║   📦  Level 5: Data Preparation Pipeline                     ║
║                                                              ║
║   Transforming raw Q&A data into tokenized training data     ║
║                                                              ║
╚══════════════════════════════════════════════════════════════╝{RESET}
"""
    print(banner)


def load_jsonl(filepath):
    """Load data from a JSONL file."""
    print(f"{BLUE}{BOLD}[Step 1/5]{RESET} Loading dataset from {DIM}{filepath}{RESET}")
    
    if not os.path.exists(filepath):
        print(f"{RED}{BOLD}✗ Error:{RESET} Dataset file not found: {filepath}")
        print(f"{YELLOW}  → Make sure 'education_qa.jsonl' exists in the data/ directory{RESET}")
        return None
    
    data = []
    with open(filepath, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                entry = json.loads(line)
                data.append(entry)
            except json.JSONDecodeError as e:
                print(f"{YELLOW}  ⚠ Skipping malformed line {line_num}: {e}{RESET}")
    
    print(f"{GREEN}  ✓ Loaded {BOLD}{len(data)}{RESET}{GREEN} examples{RESET}")
    return data


def format_examples(data):
    """Format each example into instruction-response format."""
    print(f"\n{BLUE}{BOLD}[Step 2/5]{RESET} Formatting examples into Q&A pairs")
    
    formatted = []
    for entry in data:
        instruction = entry.get('instruction', entry.get('question', ''))
        response = entry.get('response', entry.get('answer', entry.get('output', '')))
        
        if not instruction or not response:
            continue
        
        text = f"### Question:\n{instruction}\n\n### Answer:\n{response}"
        formatted.append(text)
    
    print(f"{GREEN}  ✓ Formatted {BOLD}{len(formatted)}{RESET}{GREEN} examples{RESET}")
    return formatted


def tokenize_texts(texts):
    """Tokenize all formatted texts using DistilGPT-2 tokenizer."""
    print(f"\n{BLUE}{BOLD}[Step 3/5]{RESET} Loading tokenizer and tokenizing texts")
    
    try:
        from transformers import AutoTokenizer
    except ImportError:
        print(f"{RED}{BOLD}✗ Error:{RESET} 'transformers' library not installed.")
        print(f"{YELLOW}  → Run: pip install transformers{RESET}")
        return None, None
    
    tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
    
    # Set pad token to eos token (GPT-2 doesn't have a pad token by default)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    
    print(f"{DIM}  Using tokenizer: distilgpt2 (vocab size: {tokenizer.vocab_size}){RESET}")
    
    tokenized_data = []
    for text in texts:
        encoded = tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=256,
            return_tensors=None
        )
        tokenized_data.append({
            'input_ids': encoded['input_ids'],
            'attention_mask': encoded['attention_mask'],
            'text': text
        })
    
    print(f"{GREEN}  ✓ Tokenized {BOLD}{len(tokenized_data)}{RESET}{GREEN} examples (max_length=256){RESET}")
    return tokenized_data, tokenizer


def split_data(tokenized_data, train_ratio=0.9, seed=42):
    """Split data into train and validation sets."""
    print(f"\n{BLUE}{BOLD}[Step 4/5]{RESET} Splitting data (90/10 train/val, seed={seed})")
    
    random.seed(seed)
    indices = list(range(len(tokenized_data)))
    random.shuffle(indices)
    
    split_idx = int(len(indices) * train_ratio)
    train_indices = indices[:split_idx]
    val_indices = indices[split_idx:]
    
    train_data = [tokenized_data[i] for i in train_indices]
    val_data = [tokenized_data[i] for i in val_indices]
    
    print(f"{GREEN}  ✓ Train: {BOLD}{len(train_data)}{RESET}{GREEN} examples{RESET}")
    print(f"{GREEN}  ✓ Val:   {BOLD}{len(val_data)}{RESET}{GREEN} examples{RESET}")
    
    return train_data, val_data


def print_statistics(tokenized_data, train_data, val_data, tokenizer):
    """Print colorful dataset statistics."""
    print(f"\n{MAGENTA}{BOLD}{'═' * 50}")
    print(f"  📊  Dataset Statistics")
    print(f"{'═' * 50}{RESET}\n")
    
    # Calculate token lengths (non-padding tokens)
    token_lengths = []
    for entry in tokenized_data:
        non_pad = sum(entry['attention_mask'])
        token_lengths.append(non_pad)
    
    avg_len = sum(token_lengths) / len(token_lengths) if token_lengths else 0
    max_len = max(token_lengths) if token_lengths else 0
    min_len = min(token_lengths) if token_lengths else 0
    
    stats = [
        ("Total examples", f"{len(tokenized_data)}", CYAN),
        ("Train split", f"{len(train_data)}", GREEN),
        ("Validation split", f"{len(val_data)}", GREEN),
        ("Avg token length", f"{avg_len:.1f}", YELLOW),
        ("Max token length", f"{max_len}", YELLOW),
        ("Min token length", f"{min_len}", YELLOW),
        ("Vocabulary size", f"{tokenizer.vocab_size:,}", MAGENTA),
    ]
    
    for label, value, color in stats:
        print(f"  {color}{BOLD}{'•':>3} {label:<22}{RESET} {color}{value}{RESET}")
    
    print(f"\n{MAGENTA}{BOLD}{'═' * 50}{RESET}")


def save_processed_data(train_data, val_data):
    """Save processed data to JSON files."""
    print(f"\n{BLUE}{BOLD}[Step 5/5]{RESET} Saving processed data")
    
    os.makedirs(PROCESSED_DIR, exist_ok=True)
    
    train_path = os.path.join(PROCESSED_DIR, 'train.json')
    val_path = os.path.join(PROCESSED_DIR, 'val.json')
    
    with open(train_path, 'w', encoding='utf-8') as f:
        json.dump(train_data, f, indent=2)
    
    with open(val_path, 'w', encoding='utf-8') as f:
        json.dump(val_data, f, indent=2)
    
    # Calculate file sizes
    train_size = os.path.getsize(train_path) / (1024 * 1024)
    val_size = os.path.getsize(val_path) / (1024 * 1024)
    
    print(f"{GREEN}  ✓ Saved train data: {DIM}{train_path}{RESET} ({train_size:.2f} MB)")
    print(f"{GREEN}  ✓ Saved val data:   {DIM}{val_path}{RESET} ({val_size:.2f} MB)")


def main():
    """Main data preparation pipeline."""
    print_banner()
    
    # Step 1: Load raw data
    data = load_jsonl(DATASET_FILE)
    if data is None:
        return
    
    if len(data) == 0:
        print(f"{RED}{BOLD}✗ Error:{RESET} No valid examples found in the dataset.")
        return
    
    # Step 2: Format examples
    formatted_texts = format_examples(data)
    if not formatted_texts:
        print(f"{RED}{BOLD}✗ Error:{RESET} No examples could be formatted.")
        return
    
    # Step 3: Tokenize
    tokenized_data, tokenizer = tokenize_texts(formatted_texts)
    if tokenized_data is None:
        return
    
    # Step 4: Split data
    train_data, val_data = split_data(tokenized_data)
    
    # Print statistics
    print_statistics(tokenized_data, train_data, val_data, tokenizer)
    
    # Step 5: Save
    save_processed_data(train_data, val_data)
    
    # Final success message
    print(f"\n{GREEN}{BOLD}{'━' * 50}")
    print(f"  ✅  Data preparation complete!")
    print(f"  → Next step: Run finetune.py to train the model")
    print(f"{'━' * 50}{RESET}\n")


if __name__ == '__main__':
    main()

Complete Code: finetune.py

Python
#!/usr/bin/env python3
"""
🎓 Fine-Tune DistilGPT-2 with LoRA
====================================
Fine-tunes a pre-trained DistilGPT-2 model on the education Q&A dataset
using LoRA (Low-Rank Adaptation) for parameter-efficient training.

Part of: 🧠 Build Your Own AI — From Zero to ChatBot (Level 5)
"""

import os
import sys
import json
import math

# ─── ANSI Color Codes ──────────────────────────────────────────
CYAN    = "\033[96m"
GREEN   = "\033[92m"
YELLOW  = "\033[93m"
RED     = "\033[91m"
MAGENTA = "\033[95m"
BLUE    = "\033[94m"
BOLD    = "\033[1m"
DIM     = "\033[2m"
RESET   = "\033[0m"

# ─── Paths (relative to this script) ───────────────────────────
SCRIPT_DIR   = os.path.dirname(os.path.abspath(__file__))
DATA_DIR     = os.path.join(SCRIPT_DIR, "data")
DATASET_FILE = os.path.join(DATA_DIR, "education_qa.jsonl")
OUTPUT_DIR   = os.path.join(SCRIPT_DIR, "output", "edu-chatbot-lora")


def print_banner():
    """Print the fine-tuning banner."""
    banner = f"""
{CYAN}{BOLD}╔══════════════════════════════════════════════════════════════╗
║                                                              ║
║   🎓  Level 5: Fine-Tune DistilGPT-2 with LoRA              ║
║                                                              ║
║   Training a real AI model on education Q&A data             ║
║   Using parameter-efficient LoRA adaptation                  ║
║                                                              ║
╚══════════════════════════════════════════════════════════════╝{RESET}
"""
    print(banner)


def check_dependencies():
    """Check if all required libraries are installed."""
    print(f"{BLUE}{BOLD}[Pre-Check]{RESET} Verifying dependencies...\n")
    
    required = {
        "torch": "PyTorch",
        "transformers": "Hugging Face Transformers",
        "peft": "PEFT (LoRA)",
        "datasets": "Hugging Face Datasets",
    }
    
    all_ok = True
    for module, name in required.items():
        try:
            __import__(module)
            print(f"  {GREEN}✓{RESET} {name} ({module})")
        except ImportError:
            print(f"  {RED}✗{RESET} {name} ({module}) — {RED}not installed{RESET}")
            all_ok = False
    
    if not all_ok:
        print(f"\n{RED}{BOLD}✗ Missing dependencies!{RESET}")
        print(f"  {YELLOW}Run: pip install -r requirements.txt{RESET}")
        return False
    
    print(f"\n  {GREEN}{BOLD}✅ All dependencies satisfied!{RESET}\n")
    return True


def load_dataset_from_jsonl(filepath):
    """Load training data from JSONL file."""
    if not os.path.exists(filepath):
        print(f"{RED}{BOLD}✗ Error:{RESET} Dataset not found: {filepath}")
        print(f"  {YELLOW}Run prepare_data.py first, or ensure education_qa.jsonl exists.{RESET}")
        return None
    
    data = []
    with open(filepath, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                try:
                    entry = json.loads(line)
                    instruction = entry.get("instruction", "")
                    response = entry.get("response", "")
                    if instruction and response:
                        text = f"### Question:\n{instruction}\n\n### Answer:\n{response}"
                        data.append({"text": text})
                except json.JSONDecodeError:
                    continue
    
    return data


def format_number(n):
    """Format a number with commas for readability."""
    return f"{n:,}"


def main():
    """Main fine-tuning pipeline."""
    print_banner()

    # ─── Check Dependencies ─────────────────────────────────────
    if not check_dependencies():
        sys.exit(1)

    import torch
    from transformers import (
        AutoModelForCausalLM,
        AutoTokenizer,
        TrainingArguments,
        Trainer,
        DataCollatorForLanguageModeling,
    )
    from peft import LoraConfig, get_peft_model, TaskType
    from datasets import Dataset

    # ─── Step 1: Load Base Model ────────────────────────────────
    MODEL_NAME = "distilgpt2"
    
    print(f"{BLUE}{BOLD}[Step 1/5]{RESET} Loading base model: {CYAN}{MODEL_NAME}{RESET}")
    print(f"  {DIM}(Downloading from Hugging Face Hub if not cached...){RESET}\n")

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

    # Set pad token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        model.config.pad_token_id = tokenizer.eos_token_id

    # ─── Print Model Info ───────────────────────────────────────
    total_params = sum(p.numel() for p in model.parameters())
    
    print(f"{MAGENTA}{BOLD}{'═' * 55}")
    print(f"  📋  Model Information")
    print(f"{'═' * 55}{RESET}\n")
    print(f"  {CYAN}{'•':>3} Model Name          {RESET} {MODEL_NAME}")
    print(f"  {CYAN}{'•':>3} Architecture         {RESET} GPT-2 (Decoder-only Transformer)")
    print(f"  {CYAN}{'•':>3} Total Parameters     {RESET} {format_number(total_params)}")
    print(f"  {CYAN}{'•':>3} Model Size           {RESET} ~{total_params * 4 / (1024**2):.1f} MB (FP32)")
    print(f"  {CYAN}{'•':>3} Layers               {RESET} {model.config.n_layer}")
    print(f"  {CYAN}{'•':>3} Attention Heads      {RESET} {model.config.n_head}")
    print(f"  {CYAN}{'•':>3} Embedding Dim        {RESET} {model.config.n_embd}")
    print(f"  {CYAN}{'•':>3} Vocabulary Size      {RESET} {format_number(model.config.vocab_size)}")
    print(f"\n{MAGENTA}{BOLD}{'═' * 55}{RESET}\n")

    # ─── Step 2: Apply LoRA ─────────────────────────────────────
    print(f"{BLUE}{BOLD}[Step 2/5]{RESET} Applying LoRA (Low-Rank Adaptation)\n")

    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=8,                        # LoRA rank
        lora_alpha=16,              # LoRA alpha (scaling factor)
        lora_dropout=0.05,          # Small dropout for regularization
        target_modules=["c_attn"],  # Target attention layers in GPT-2
        bias="none",
    )

    model = get_peft_model(model, lora_config)

    # Print LoRA info
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    all_params = sum(p.numel() for p in model.parameters())
    trainable_pct = 100 * trainable_params / all_params

    print(f"  {GREEN}LoRA Configuration:{RESET}")
    print(f"  {DIM}├─{RESET} Rank (r):           {YELLOW}8{RESET}")
    print(f"  {DIM}├─{RESET} Alpha:              {YELLOW}16{RESET}")
    print(f"  {DIM}├─{RESET} Target Modules:     {YELLOW}c_attn (attention layers){RESET}")
    print(f"  {DIM}├─{RESET} Dropout:            {YELLOW}0.05{RESET}")
    print(f"  {DIM}└─{RESET} Bias:               {YELLOW}none{RESET}")
    print()
    print(f"  {GREEN}{BOLD}Trainable parameters: {YELLOW}{format_number(trainable_params)}{GREEN} ({YELLOW}{trainable_pct:.2f}%{GREEN} of total {format_number(all_params)}){RESET}")
    print(f"  {DIM}  → Training only {trainable_pct:.2f}% of the model — like adding sticky notes to a textbook! 📝{RESET}\n")

    # ─── Step 3: Load and Prepare Data ──────────────────────────
    print(f"{BLUE}{BOLD}[Step 3/5]{RESET} Loading and preparing training data\n")

    raw_data = load_dataset_from_jsonl(DATASET_FILE)
    if raw_data is None:
        return

    print(f"  {GREEN}✓ Loaded {BOLD}{len(raw_data)}{RESET}{GREEN} training examples{RESET}")

    # Create Hugging Face Dataset
    dataset = Dataset.from_list(raw_data)

    # Tokenize
    def tokenize_function(examples):
        tokenized = tokenizer(
            examples["text"],
            truncation=True,
            padding="max_length",
            max_length=256,
        )
        tokenized["labels"] = tokenized["input_ids"].copy()
        return tokenized

    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=["text"],
        desc="Tokenizing",
    )

    # Split into train/val
    split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
    train_dataset = split["train"]
    val_dataset = split["test"]

    print(f"  {GREEN}✓ Training examples:   {BOLD}{len(train_dataset)}{RESET}")
    print(f"  {GREEN}✓ Validation examples: {BOLD}{len(val_dataset)}{RESET}\n")

    # ─── Step 4: Training ───────────────────────────────────────
    print(f"{BLUE}{BOLD}[Step 4/5]{RESET} Starting LoRA fine-tuning\n")

    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        overwrite_output_dir=True,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        learning_rate=2e-4,
        warmup_steps=50,
        weight_decay=0.01,
        logging_steps=10,
        eval_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=2,
        fp16=torch.cuda.is_available(),
        report_to="none",  # Disable wandb/tensorboard
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
    )

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,  # Causal LM, not masked LM
    )

    # Custom callback for colored output
    from transformers import TrainerCallback

    class ColoredLoggingCallback(TrainerCallback):
        """Custom callback for beautiful colored training logs."""
        
        def on_log(self, args, state, control, logs=None, **kwargs):
            if logs is None:
                return
            
            step = state.global_step
            epoch = logs.get("epoch", 0)
            
            if "loss" in logs:
                loss = logs["loss"]
                lr = logs.get("learning_rate", 0)
                # Color code loss: green if low, yellow if medium, red if high
                if loss < 2.0:
                    loss_color = GREEN
                elif loss < 4.0:
                    loss_color = YELLOW
                else:
                    loss_color = RED
                
                print(f"  {DIM}Step {step:>4}{RESET} │ "
                      f"Epoch {CYAN}{epoch:.2f}{RESET} │ "
                      f"Loss {loss_color}{BOLD}{loss:.4f}{RESET} │ "
                      f"LR {DIM}{lr:.2e}{RESET}")
            
            if "eval_loss" in logs:
                eval_loss = logs["eval_loss"]
                perplexity = math.exp(eval_loss) if eval_loss < 100 else float("inf")
                print(f"\n  {MAGENTA}{BOLD}📊 Evaluation:{RESET} "
                      f"Loss = {YELLOW}{eval_loss:.4f}{RESET}, "
                      f"Perplexity = {YELLOW}{perplexity:.2f}{RESET}\n")

    print(f"  {GREEN}Training Configuration:{RESET}")
    print(f"  {DIM}├─{RESET} Epochs:            {YELLOW}3{RESET}")
    print(f"  {DIM}├─{RESET} Batch size:        {YELLOW}4{RESET}")
    print(f"  {DIM}├─{RESET} Learning rate:     {YELLOW}2e-4{RESET}")
    print(f"  {DIM}├─{RESET} Warmup steps:      {YELLOW}50{RESET}")
    print(f"  {DIM}├─{RESET} Logging every:     {YELLOW}10 steps{RESET}")
    print(f"  {DIM}├─{RESET} Device:            {YELLOW}{'CUDA (GPU) 🚀' if torch.cuda.is_available() else 'CPU 💻'}{RESET}")
    print(f"  {DIM}└─{RESET} Output:            {YELLOW}{OUTPUT_DIR}{RESET}")
    print()
    print(f"  {CYAN}{BOLD}{'─' * 55}{RESET}")
    print(f"  {CYAN}{BOLD} Training Progress{RESET}")
    print(f"  {CYAN}{BOLD}{'─' * 55}{RESET}\n")

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        data_collator=data_collator,
        callbacks=[ColoredLoggingCallback()],
    )

    # Train!
    train_result = trainer.train()

    print(f"\n  {CYAN}{BOLD}{'─' * 55}{RESET}")
    print(f"  {GREEN}{BOLD}✅ Training Complete!{RESET}")
    print(f"  {CYAN}{BOLD}{'─' * 55}{RESET}\n")

    # Print training summary
    metrics = train_result.metrics
    print(f"  {GREEN}Training Summary:{RESET}")
    print(f"  {DIM}├─{RESET} Total steps:       {YELLOW}{metrics.get('total_flos', 'N/A')}{RESET}")
    print(f"  {DIM}├─{RESET} Training loss:     {YELLOW}{metrics.get('train_loss', 'N/A'):.4f}{RESET}")
    print(f"  {DIM}├─{RESET} Training time:     {YELLOW}{metrics.get('train_runtime', 0):.1f}s{RESET}")
    samples_per_sec = metrics.get('train_samples_per_second', 0)
    print(f"  {DIM}└─{RESET} Samples/sec:       {YELLOW}{samples_per_sec:.2f}{RESET}")

    # ─── Step 5: Save Model ─────────────────────────────────────
    print(f"\n{BLUE}{BOLD}[Step 5/5]{RESET} Saving fine-tuned model\n")

    # Save LoRA adapter
    model.save_pretrained(OUTPUT_DIR)
    tokenizer.save_pretrained(OUTPUT_DIR)

    # Calculate saved model size
    total_size = 0
    for f_name in os.listdir(OUTPUT_DIR):
        f_path = os.path.join(OUTPUT_DIR, f_name)
        if os.path.isfile(f_path):
            total_size += os.path.getsize(f_path)
    
    print(f"  {GREEN}✓ LoRA adapter saved to:{RESET} {DIM}{OUTPUT_DIR}{RESET}")
    print(f"  {GREEN}✓ Tokenizer saved to:{RESET}   {DIM}{OUTPUT_DIR}{RESET}")
    print(f"  {GREEN}✓ Adapter size:{RESET}          {YELLOW}{total_size / 1024:.1f} KB{RESET}")
    print(f"  {DIM}  → The adapter is tiny because LoRA only saves the changed parameters!{RESET}")

    # ─── Final Message ──────────────────────────────────────────
    print(f"""
{GREEN}{BOLD}{'═' * 55}

  🎉 Your model is ready! Run chat.py to talk to it.

  Commands:
    python chat.py              — Start chatting
    python chat.py --compare    — Compare base vs fine-tuned

{'═' * 55}{RESET}
""")


if __name__ == "__main__":
    main()

Complete Code: chat.py

Python
#!/usr/bin/env python3
"""
💬 Interactive Chat with Your Fine-Tuned AI
=============================================
Chat with the DistilGPT-2 model fine-tuned on education Q&A data.
Supports temperature control, comparison mode, and beautiful terminal UI.

Part of: 🧠 Build Your Own AI — From Zero to ChatBot (Level 5)
"""

import os
import sys

# ─── ANSI Color Codes ──────────────────────────────────────────
CYAN    = "\033[96m"
GREEN   = "\033[92m"
YELLOW  = "\033[93m"
RED     = "\033[91m"
MAGENTA = "\033[95m"
BLUE    = "\033[94m"
BOLD    = "\033[1m"
DIM     = "\033[2m"
RESET   = "\033[0m"
BG_CYAN = "\033[46m"
BG_GREEN = "\033[42m"
WHITE   = "\033[97m"

# ─── Paths (relative to this script) ───────────────────────────
SCRIPT_DIR   = os.path.dirname(os.path.abspath(__file__))
MODEL_DIR    = os.path.join(SCRIPT_DIR, "output", "edu-chatbot-lora")
BASE_MODEL   = "distilgpt2"


def print_welcome():
    """Print a beautiful welcome banner with ASCII art."""
    banner = f"""
{CYAN}{BOLD}
    ╔══════════════════════════════════════════════════════════╗
    ║                                                          ║
    ║   ██████╗██╗  ██╗ █████╗ ████████╗██████╗  ██████╗ ████████╗║
    ║  ██╔════╝██║  ██║██╔══██╗╚══██╔══╝██╔══██╗██╔═══██╗╚══██╔══╝║
    ║  ██║     ███████║███████║   ██║   ██████╔╝██║   ██║   ██║   ║
    ║  ██║     ██╔══██║██╔══██║   ██║   ██╔══██╗██║   ██║   ██║   ║
    ║  ╚██████╗██║  ██║██║  ██║   ██║   ██████╔╝╚██████╔╝   ██║   ║
    ║   ╚═════╝╚═╝  ╚═╝╚═╝  ╚═╝   ╚═╝   ╚═════╝  ╚═════╝    ╚═╝   ║
    ║                                                          ║
    ║         🧠 Your AI Education Assistant 🎓                ║
    ║         Fine-tuned on Indian education Q&A               ║
    ║                                                          ║
    ╚══════════════════════════════════════════════════════════╝{RESET}

    {YELLOW}{BOLD}Commands:{RESET}
    {DIM}├─{RESET} {CYAN}quit{RESET}          Exit the chat
    {DIM}├─{RESET} {CYAN}temp 0.8{RESET}      Set temperature (0.1 = focused, 1.5 = creative)
    {DIM}├─{RESET} {CYAN}compare{RESET}       Compare base model vs fine-tuned model
    {DIM}├─{RESET} {CYAN}help{RESET}          Show this help message
    {DIM}└─{RESET} {CYAN}clear{RESET}         Clear the screen

    {GREEN}Ask me anything about Science, Math, Indian History, or Study Tips!{RESET}
    {DIM}{'─' * 60}{RESET}
"""
    print(banner)


def print_help():
    """Print help message with available commands."""
    print(f"""
    {YELLOW}{BOLD}📋 Available Commands:{RESET}
    {DIM}├─{RESET} {CYAN}quit / exit{RESET}    Exit the chat
    {DIM}├─{RESET} {CYAN}temp <value>{RESET}   Set generation temperature (default: 0.7)
    {DIM}│{RESET}                  {DIM}Low (0.1-0.3) = deterministic, focused answers{RESET}
    {DIM}│{RESET}                  {DIM}Med (0.5-0.8) = balanced, natural responses{RESET}
    {DIM}│{RESET}                  {DIM}High (1.0-1.5) = creative, varied outputs{RESET}
    {DIM}├─{RESET} {CYAN}compare{RESET}        Compare base vs fine-tuned model responses
    {DIM}├─{RESET} {CYAN}help{RESET}           Show this help message
    {DIM}└─{RESET} {CYAN}clear{RESET}          Clear the screen
""")


def load_model(model_path, base_model_name):
    """Load the fine-tuned model (base + LoRA adapter)."""
    try:
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer
        from peft import PeftModel
    except ImportError as e:
        print(f"{RED}{BOLD}✗ Error:{RESET} Missing dependency: {e}")
        print(f"  {YELLOW}Run: pip install transformers peft torch{RESET}")
        return None, None, None

    # Check if fine-tuned model exists
    if not os.path.exists(model_path):
        print(f"{RED}{BOLD}✗ Fine-tuned model not found!{RESET}")
        print(f"  {DIM}Expected at: {model_path}{RESET}")
        print(f"\n  {YELLOW}You need to train the model first:{RESET}")
        print(f"  {CYAN}  1. python prepare_data.py{RESET}")
        print(f"  {CYAN}  2. python finetune.py{RESET}")
        print(f"  {CYAN}  3. python chat.py  ← then come back here!{RESET}")
        return None, None, None

    print(f"  {DIM}Loading base model ({base_model_name})...{RESET}")
    
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
        
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        
        print(f"  {DIM}Applying LoRA adapter...{RESET}")
        model = PeftModel.from_pretrained(base_model, model_path)
        model.eval()
        
        return model, tokenizer, base_model
    except Exception as e:
        print(f"{RED}{BOLD}✗ Error loading model:{RESET} {e}")
        print(f"  {YELLOW}The model files may be corrupted. Try running finetune.py again.{RESET}")
        return None, None, None


def generate_response(model, tokenizer, prompt, temperature=0.7, max_new_tokens=200, 
                      top_k=50, top_p=0.9):
    """Generate a response from the model."""
    import torch

    # Format the prompt
    formatted_prompt = f"### Question:\n{prompt}\n\n### Answer:\n"
    
    inputs = tokenizer(formatted_prompt, return_tensors="pt", truncation=True, max_length=256)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=max_new_tokens,
            temperature=max(temperature, 0.01),  # Avoid division by zero
            top_k=top_k,
            top_p=top_p,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.2,
            no_repeat_ngram_size=3,
        )

    # Decode only the generated part
    generated_ids = outputs[0][input_ids.shape[1]:]
    response = tokenizer.decode(generated_ids, skip_special_tokens=True)
    
    # Clean up the response
    response = response.strip()
    
    # Stop at certain markers
    for stop_marker in ["### Question:", "### Answer:", "\n\n\n"]:
        if stop_marker in response:
            response = response[:response.index(stop_marker)].strip()
    
    return response


def compare_models(base_model, finetuned_model, tokenizer, prompt, temperature=0.7):
    """Compare responses from base model vs fine-tuned model."""
    print(f"\n  {MAGENTA}{BOLD}🔬 Comparison Mode{RESET}")
    print(f"  {MAGENTA}{'─' * 55}{RESET}")
    print(f"  {DIM}Prompt: \"{prompt}\"{RESET}\n")
    
    # Base model response
    print(f"  {RED}{BOLD}┌─ 📖 Base Model (distilgpt2 — no fine-tuning){RESET}")
    print(f"  {RED}{BOLD}│{RESET}")
    
    base_response = generate_response(base_model, tokenizer, prompt, temperature)
    for line in base_response.split("\n"):
        print(f"  {RED}│{RESET}  {DIM}{line}{RESET}")
    print(f"  {RED}{BOLD}└{'─' * 50}{RESET}\n")
    
    # Fine-tuned model response
    print(f"  {GREEN}{BOLD}┌─ 🎓 Fine-Tuned Model (trained on education Q&A){RESET}")
    print(f"  {GREEN}{BOLD}│{RESET}")
    
    ft_response = generate_response(finetuned_model, tokenizer, prompt, temperature)
    for line in ft_response.split("\n"):
        print(f"  {GREEN}│{RESET}  {line}")
    print(f"  {GREEN}{BOLD}└{'─' * 50}{RESET}\n")
    
    print(f"  {YELLOW}💡 Notice the difference? The fine-tuned model gives more relevant,")
    print(f"     education-focused answers!{RESET}\n")


def main():
    """Main interactive chat loop."""
    print_welcome()

    # ─── Load Model ─────────────────────────────────────────────
    print(f"  {BLUE}{BOLD}🔄 Loading your fine-tuned AI model...{RESET}\n")
    
    finetuned_model, tokenizer, base_model = load_model(MODEL_DIR, BASE_MODEL)
    
    if finetuned_model is None:
        print(f"\n  {RED}Cannot start chat without a trained model.{RESET}")
        sys.exit(1)

    print(f"\n  {GREEN}{BOLD}✅ Model loaded successfully!{RESET}")
    print(f"  {DIM}{'─' * 60}{RESET}\n")

    # ─── Chat Settings ──────────────────────────────────────────
    temperature = 0.7
    compare_prompt = None

    # ─── Chat Loop ──────────────────────────────────────────────
    while True:
        try:
            # Get user input with colored prompt
            user_input = input(f"  {CYAN}{BOLD}You > {RESET}").strip()
            
            if not user_input:
                continue
            
            # ─── Handle Commands ────────────────────────────────
            lower_input = user_input.lower()
            
            # Quit command
            if lower_input in ("quit", "exit", "q"):
                print(f"\n  {YELLOW}{BOLD}👋 Goodbye! Keep learning and exploring AI!{RESET}")
                print(f"  {DIM}\"The best way to understand AI is to build one yourself.\"{RESET}\n")
                break
            
            # Help command
            if lower_input == "help":
                print_help()
                continue
            
            # Clear command
            if lower_input == "clear":
                os.system("cls" if os.name == "nt" else "clear")
                print_welcome()
                continue
            
            # Temperature command
            if lower_input.startswith("temp "):
                try:
                    new_temp = float(lower_input.split()[1])
                    if 0.01 <= new_temp <= 2.0:
                        temperature = new_temp
                        # Describe the temperature
                        if new_temp < 0.3:
                            desc = "very focused & deterministic"
                        elif new_temp < 0.6:
                            desc = "balanced & reliable"
                        elif new_temp < 1.0:
                            desc = "natural & varied"
                        else:
                            desc = "creative & experimental"
                        print(f"  {GREEN}🌡️ Temperature set to {BOLD}{temperature}{RESET}{GREEN} ({desc}){RESET}\n")
                    else:
                        print(f"  {RED}⚠ Temperature must be between 0.01 and 2.0{RESET}\n")
                except (ValueError, IndexError):
                    print(f"  {RED}⚠ Usage: temp 0.8{RESET}\n")
                continue
            
            # Compare command
            if lower_input == "compare":
                compare_input = input(f"  {MAGENTA}Enter a question to compare > {RESET}").strip()
                if compare_input:
                    compare_models(base_model, finetuned_model, tokenizer, compare_input, temperature)
                else:
                    print(f"  {YELLOW}⚠ Please enter a question for comparison.{RESET}\n")
                continue

            # ─── Generate Response ──────────────────────────────
            print(f"  {DIM}Thinking...{RESET}", end="\r")
            
            response = generate_response(
                finetuned_model, tokenizer, user_input, 
                temperature=temperature
            )
            
            # Clear "Thinking..." and print response
            print(f"  {' ' * 30}", end="\r")  # Clear line
            
            if response:
                print(f"  {GREEN}{BOLD}AI > {RESET}{GREEN}{response}{RESET}\n")
            else:
                print(f"  {YELLOW}AI > {DIM}(The model didn't generate a response. Try rephrasing your question.){RESET}\n")

        except KeyboardInterrupt:
            print(f"\n\n  {YELLOW}{BOLD}👋 Goodbye! (Ctrl+C detected){RESET}\n")
            break
        except EOFError:
            print(f"\n\n  {YELLOW}{BOLD}👋 Goodbye!{RESET}\n")
            break
        except Exception as e:
            print(f"  {RED}⚠ Error: {e}{RESET}\n")


if __name__ == "__main__":
    main()

Part V

Looking Ahead

The future of AI and your next steps

Chapter 7

The Future of AI

"The best way to predict the future is to invent it." — Alan Kay

Learning Objectives

Understand where AI technology is heading in the next 5-10 years
Know about the major research frontiers in language models
Appreciate the role AI will play in Indian education
Think critically about the ethical implications of AI
Have a clear roadmap for your own AI learning journey

7.1 You've Come a Long Way!

Let's take a moment to appreciate what you've accomplished:

Chapter	What You Built	Key Concept
2	Bigram text generator	Prediction = counting patterns
3	Neural network from scratch	Learning = adjusting weights
4	Transformer & self-attention	Attention = understanding context
5	Mini-GPT trained on stories	Language model = next token prediction
6	Fine-tuned chatbot	Fine-tuning = specializing a pre-trained model

You now understand the complete pipeline that powers ChatGPT, Claude, and Gemini.

7.2 Where AI is Heading

7.2.1 Bigger Models, Smarter Reasoning

The trend in AI is clear: scale brings capabilities.


GPT-2  (2019):    1.5 billion parameters
GPT-3  (2020):  175 billion parameters
GPT-4  (2023): 1.7 trillion parameters    →  Could reason, analyze, create
GPT-5+ (2025+):     ??? parameters        →  ???

But it's not just about size. The frontier is moving toward:

Chain-of-Thought Reasoning: Models that "think step by step" before answering (like you saw in Claude's thinking mode!)
Tool Use: Models that can search the web, run code, use calculators — not just generate text
Multimodal AI: Models that see images, hear audio, AND process text simultaneously
Long Context: From 4K tokens to 1M+ tokens — models can now read entire books at once!

7.2.2 Smaller, Faster, On-Device Models

The opposite trend is also happening:

Quantization: Compressing models to run on phones and laptops
Distillation: Training small models to mimic large ones
Edge AI: Running models directly on your device without internet
Gemma, Phi, LLaMA: Open-source models small enough for your laptop

Tip

Key Insight: The future isn't just "bigger is better". It's "smart enough for the task, small enough for the device."

7.2.3 AI Agents

The next big leap is from chat models to AI agents:


Today:    You ask ChatGPT a question → It gives an answer
Tomorrow: You tell an AI agent a goal → It plans, acts, and delivers

Example:
  "Plan a 3-day trip to Rajasthan for my family of 4,
   budget ₹50,000, book hotels, and create an itinerary."
   
  The agent would:
  1. Research destinations
  2. Compare hotel prices
  3. Book rooms
  4. Create a day-by-day plan
  5. Send you a WhatsApp summary

7.3 AI in Indian Education

7.3.1 The Opportunity

India has:

250+ million students in schools
1.5 million schools, many with teacher shortages
22 official languages to teach in
Vast rural-urban education gap

AI can help address ALL of these challenges:

Challenge	AI Solution
Teacher shortage	AI tutors that explain concepts 24/7
Language barrier	Real-time translation to any Indian language
Quality gap	Same quality of education in Delhi and a village in Bihar
Personalization	Each student learns at their own pace
Assessment	Instant, detailed feedback on assignments

7.3.2 What's Already Happening

DIKSHA (by NCERT): AI-powered learning platform for Indian students
Byju's, Vedantu: Personalized learning using AI recommendations
Google Translate: Now handles Hindi, Tamil, Bengali, and more
Bhashini: Government of India's AI translation platform for all 22 scheduled languages
ChatGPT/Gemini in Hindi: Students using AI assistants in their own language

7.3.3 What You Could Build

With what you've learned in this book, you could build:

A Subject Tutor Bot: Fine-tune a model on Class 6-10 NCERT content
A Question Paper Generator: Train on past papers to generate new questions
A Doubt-Solver: A chatbot that explains concepts in simple language
A Language Tutor: Practice English speaking with an AI partner
A Study Planner: AI that creates personalized study schedules

Important

You have the skills now! Level 5 taught you how to fine-tune models. You can create specialized education AI tools for Indian students TODAY.

7.4 The Ethics of AI

As someone who now understands HOW AI works, you have a responsibility to think about these issues:

7.4.1 Bias in AI

AI models learn from data. If the data has biases, the model will too.

Example: If an AI is trained mostly on English text from Western countries, it might:

Not understand Indian cultural context
Give advice that doesn't apply to Indian families
Reinforce stereotypes about gender, caste, religion

What you can do:

Fine-tune models on diverse, inclusive data
Always test your models for bias
Include data from multiple Indian languages and cultures

7.4.2 AI and Jobs

Common fear: "AI will take all our jobs!"

Reality: AI will change jobs, not eliminate all of them.


Jobs at risk:        Jobs that will grow:       Jobs AI can't do:
─────────────        ──────────────────         ──────────────────
Data entry           AI trainers                Creative leadership
Basic translation    Prompt engineers           Emotional support
Simple coding        AI ethics experts          Complex problem-solving
Form filling         AI-augmented teachers      Physical craftsmanship
                     Healthcare + AI            Community building

Note

Think About It: The printing press didn't eliminate writers — it created millions more. AI won't eliminate thinkers — it will empower them.

7.4.3 AI Safety

As AI gets more powerful, we need to think about:

Misinformation: AI can generate fake but convincing text, images, videos
Privacy: AI models trained on personal data
Dependence: Over-relying on AI for critical thinking
Access: Ensuring AI benefits aren't limited to rich countries/people

7.4.4 The Responsible AI Developer

As someone building AI, follow these principles:

Transparency: Be clear about what your AI can and cannot do
Fairness: Test for biases across genders, languages, communities
Privacy: Don't train on personal data without consent
Safety: Add guardrails to prevent harmful outputs
Accessibility: Build for all users, including those with disabilities

7.5 Your Learning Roadmap: What's Next?

You've completed this book. Here's where to go next:

🟢 Immediate Next Steps (This Week)

Experiment with your Mini-GPT: Try different training data, model sizes, hyperparameters
Fine-tune on YOUR data: Create a chatbot for your specific use case
Share your work: Show your friends and teachers what you built!

🟡 Short-Term Goals (Next 1-3 Months)

Goal	How
Learn PyTorch deeply	PyTorch tutorials
Understand larger models	Study nanoGPT by Andrej Karpathy
Try bigger open-source models	Fine-tune Gemma 2B or Phi-3
Learn about vision models	Explore CLIP and image generation
Build a real project	Create an AI tool for your school or community

🟠 Medium-Term Goals (3-6 Months)

Take structured courses: - fast.ai — Practical Deep Learning (FREE) - CS229 Stanford — Machine Learning theory - Andrej Karpathy's YouTube — Build GPT from scratch

Read key papers: - "Attention Is All You Need" (2017) — The Transformer - "Language Models are Few-Shot Learners" (2020) — GPT-3 - "Training Language Models to Follow Instructions" (2022) — InstructGPT/RLHF - "LoRA: Low-Rank Adaptation" (2021) — Efficient fine-tuning

Contribute to open source: - Hugging Face Transformers - Indian language NLP projects - AI4Bharat (Indian language AI)

🔴 Long-Term Vision (6+ Months)

Specialize in one area: - NLP (language models, chatbots) - Computer Vision (image recognition, generation) - Reinforcement Learning (game AI, robotics) - AI Safety & Ethics

Build something impactful: - An education tool for rural Indian schools - A healthcare assistant in Indian languages - A farming advisory AI for Indian farmers - A legal aid chatbot for common citizens

Consider AI research: - Apply to IIT, IIIT, or ISI for AI/ML programs - Look into Google AI India, Microsoft Research India - Contribute to cutting-edge research papers

7.6 Final Words

When you started this book, AI seemed like magic — something only Google and OpenAI could build.

Now you know the truth:

AI is not magic. It's math, patterns, and a lot of training data.

You've built every piece yourself:

A model that counts patterns (Chapter 2)
A network that learns from mistakes (Chapter 3)
An attention mechanism that understands context (Chapter 4)
A GPT that generates coherent text (Chapter 5)
A fine-tuned chatbot that answers questions (Chapter 6)

You are no longer just a user of AI. You are a builder.

The world needs more people like you — people who understand how AI works, who can build it responsibly, and who can use it to solve real problems.

India, with its 1.4 billion people, 22 languages, and incredible diversity, needs AI solutions built BY Indians, FOR Indians.

You have the knowledge. You have the tools. Now go build something amazing. 🚀

💭 Discussion Questions

If you could build any AI tool for India, what would it be and why?

Do you think AI will ever truly "understand" language, or will it always be "just predicting the next word"? What's the difference?

How should India approach AI regulation — strict rules like the EU, or open innovation like the US?

If AI can write essays, code, and solve math problems, what should schools focus on teaching?

You've built a language model from scratch. Does knowing how it works change how you interact with ChatGPT and similar tools?

Key Concepts Summary

Chapter	Core Insight
1	AI is not magic — it's math and patterns
2	The simplest AI just counts what comes after what
3	Neural networks learn by adjusting weights to reduce error
4	Attention lets models understand which parts of input matter
5	GPT = Transformer + next-token prediction + lots of data
6	Fine-tuning = specializing a pre-trained model for your task
7	The future is yours to build!

"I hear and I forget. I see and I remember. I do and I understand."
You didn't just hear about AI. You didn't just see AI. You built AI. You understand AI.
🎉 Congratulations on completing this journey!

7.7 🔑 Production AI — Key Terms Explained

As you move from building toy models to understanding production AI systems like ChatGPT, Claude, and Gemini, you'll encounter these critical terms:

🤖 Agentic Frameworks

What it means: AI systems that don't just answer one question — they take multiple autonomous steps to complete a complex task.

Comparison
Simple AI:     User asks → AI answers → Done

Agentic AI:    User asks → AI plans → reads files → writes code →
               tests it → fixes bugs → reports back → Done

Example: When an AI reads 10 markdown files, converts them to HTML, updates a React component, and starts a dev server — that's agentic behavior: multi-step workflows with tool use, planning, and self-correction.

Why It Matters

Agentic AI is the frontier of AI development in 2024-25. Companies like Google (Gemini), Anthropic (Claude), and OpenAI (GPT) are racing to build agents that can autonomously code, research, and build entire applications.

📁 Multi-file Code Understanding

What it means: The AI can understand how multiple files relate to each other — imports, function calls, data flow, and architectural patterns across an entire codebase.

Example
Not just:   "What does model.py do?"

But:        "train.py imports MiniGPT from model.py,
             trains it on stories.txt data,
             saves checkpoints to disk,
             and generate.py loads those checkpoints
             to run interactive chat."

Real projects have 10 to 1,000+ files. Understanding one file in isolation is useless — you need to understand the entire system.

🎯 Low Hallucination Rates

What it means: The AI doesn't make things up. When unsure, it acknowledges uncertainty instead of confidently generating false information.

Type	Example	Problem
Hallucination ❌	"Python was created in 1985"	Wrong year (it was 1991)
Low Hallucination ✅	"Python was created in 1991"	Correct fact
Honest Uncertainty ✅	"I'm not certain — please verify"	Transparent about limits

Critical for Education

In education, wrong answers are worse than no answer. A tutor that confidently tells a student "water boils at 90°C" is dangerous. Low hallucination rates are essential for any educational AI.

💾 Prompt Caching

What it means: When you send the same context repeatedly (like a system prompt or a large document), the AI remembers it instead of re-processing it. Saves up to 90% of input costs.

Scenario	Tokens Processed	Cost
Without caching — Request 1	[50K system prompt] + "What is AI?"	₹5.00
Without caching — Request 2	[50K system prompt] + "What is ML?"	₹5.00
Without caching — Request 3	[50K system prompt] + "What is DL?"	₹5.00
Total without caching:		₹15.00
With caching — Request 1	[50K system prompt] + "What is AI?"	₹5.00
With caching — Request 2	[CACHED ✅] + "What is ML?"	₹0.50
With caching — Request 3	[CACHED ✅] + "What is DL?"	₹0.50
Total with caching:		₹6.00 (60% savings!)

How It Works

The AI provider stores computed representations (KV-cache) of your repeated prefix. On subsequent requests, it skips re-computing those tokens and charges only for the new portion. Especially powerful for chatbots with long system prompts or RAG pipelines.

🛡️ Safety & Fallbacks

What it means: Built-in guardrails that detect dangerous queries (bioweapons, hacking, harmful content) and either refuse or transparently redirect to a safer model.

How It Works
Student asks:  "Explain cell division in biology"
Result:        ✅ Normal answer — no safety concerns

Attacker asks: "How to create a dangerous pathogen"
Result:        🛑 Safety filter triggered!
               → Query blocked or rerouted to safe model
               → User NOT charged premium rates
               → Incident logged for review

Production AI systems have multiple safety layers:

Input filtering — classify incoming queries before processing
Output filtering — scan generated responses before delivery
Constitutional AI — model trained to self-evaluate and refuse harmful requests
Human review — flagged interactions reviewed by safety teams

🔓 Jailbreaking

What it means: Techniques to bypass AI safety filters and trick the model into producing content it's designed to refuse.

Type	Technique	Example
Role-play attack	Ask AI to "pretend" to be unrestricted	"You are DAN (Do Anything Now)..."
Encoding trick	Encode harmful requests in code/base64	Obfuscating the real intent
Prompt injection	Override system instructions via user input	"Ignore all previous instructions..."
Indirect attack	Slowly escalate through innocent-seeming steps	Gradual boundary pushing

Defenses against jailbreaking:

Multi-layer safety checks that can't be fooled by simple prompt tricks
Constitutional AI — model trained to recognize and refuse manipulation
Input/output filtering independent of the model itself
Red-teaming — dedicated teams that try to break the system to find vulnerabilities
Continuous updates — safety systems updated as new attack vectors are discovered

Why This Matters for Education

When deploying AI chatbots for students, jailbreaking resistance is critical. Students are naturally curious — they WILL try to make the bot say unexpected things. Your safety layers must be robust enough to handle this while still being helpful for genuine educational questions.

📊 Key Terms Summary

Term	One-Line Meaning	Book Connection
Agentic Framework	AI that takes multiple autonomous steps	Ch 8 → Level 7 (AI Agents)
Multi-file Understanding	AI reads entire codebases, not just one file	How this book was built!
Low Hallucination	AI doesn't confidently make things up	Ch 8 → Safety & Guardrails
Prompt Caching	Remember repeated context, save 90% cost	API cost optimization
Safety & Fallbacks	Block dangerous queries, redirect safely	Ch 8 → Safety & Guardrails
Jailbreaking	Tricks to bypass AI safety rules	Ch 6 → Ethics of Fine-Tuning

7.8 📊 How to Compare AI Models — Key Metrics Explained

When choosing an AI model for your project, you'll see comparison tables with metrics like cost, context, and benchmarks. Here's what each column really means:

The Comparison Table Columns

Metric	What It Means	Why It Matters
Model	Name and version of the AI	Different models have different strengths
Best for	Primary use case	Coding vs writing vs reasoning vs chat
Input / MTok	Cost per 1 million input tokens	How much you pay to SEND text to the model
Output / MTok	Cost per 1 million output tokens	How much you pay for the model's RESPONSE
Context	Max tokens it can read at once	How much text it can "see" simultaneously
Max output	Max tokens in one response	How long its single reply can be
SWE-bench	Software Engineering benchmark	How well it writes and fixes real code

💰 Input / MTok & Output / MTok (Cost)

MTok = Million Tokens ≈ 750,000 words ≈ 1,500 pages of text

Example: You send a 10-page document (≈5,000 tokens) and ask a question
  Input cost: 5,000 tokens × ($3 / 1M tokens) = $0.015
  Output cost: 2,000 tokens × ($15 / 1M tokens) = $0.030
  Total: $0.045 per question (≈ ₹3.75)

Model	Input / MTok	Output / MTok	~Monthly (1000 queries/day)
GPT-4o	$2.50	$10.00	~$375
Claude Sonnet 4	$3.00	$15.00	~$540
Claude Opus 4	$15.00	$75.00	~$2,700
Gemini 2.5 Pro	$1.25	$10.00	~$338
GPT-4o mini	$0.15	$0.60	~$22
DeepSeek V3	$0.27	$1.10	~$41

Rule of Thumb

Output tokens cost 3-5x more than input tokens because generating text (running the full model forward pass + sampling) is computationally harder than just encoding input.

📏 Context Window

The context window determines how much text the model can "see" at once — think of it as the model's "desk size":

Context Size	Equivalent	Models
4K tokens	≈3 pages	Old GPT-3.5
32K tokens	≈24 pages	GPT-4 original
128K tokens	≈96 pages	GPT-4o, Claude Sonnet
200K tokens	≈150 pages	Claude Opus
1M tokens	≈750 pages	Gemini 2.5 Pro

For your project: Your Mini-GPT (Level 4) has a context of 256 characters. ChatGPT has 128,000 tokens — that's 500x larger! For RAG (Chapter 8), bigger context = more textbook pages per query = better answers.

📤 Max Output

How long the model's single response can be:

Max Output	Equivalent	Can It...
4K tokens	≈3 pages	✓ Answer questions, short essays
8K tokens	≈6 pages	✓ Write detailed explanations
16K tokens	≈12 pages	✓ Generate full chapters
64K tokens	≈48 pages	✓ Write entire documents in one go

Why Max Output Matters

If you ask a model with 4K max output to "write a 50-page book chapter" (65,000 tokens), it will get cut off mid-sentence. You'd need to generate it in multiple chunks. Models with larger max output (like 64K) can write much longer responses in a single call.

🏆 SWE-bench (Software Engineering Benchmark)

What: Tests if AI can fix REAL bugs in real GitHub repositories. The model gets a bug report and must find the file, understand the code, and write a working fix.

Score Range	Rating	What It Means
< 20%	Basic	Can write simple code snippets
20-30%	Good	Can fix straightforward bugs
30-40%	Strong	Can handle complex multi-file fixes
40-50%	Excellent	Near human-level debugging
> 50%	Elite	Better than most junior developers

Model	SWE-bench Score	Rating
Claude Opus 4	~72%	⭐ Elite
Claude Sonnet 4	~65%	⭐ Elite
Gemini 2.5 Pro	~63%	Excellent
DeepSeek V3	~42%	Excellent
GPT-4o	~38%	Strong
GPT-4o mini	~24%	Good

📋 Other Important Benchmarks

Benchmark	What It Tests	Real-World Meaning
MMLU	General knowledge (57 subjects)	"How much does it know?"
HumanEval	Code generation from docstrings	"Can it write functions?"
MATH	Competition-level math problems	"Can it solve hard math?"
GPQA	PhD-level science questions	"How deep is its knowledge?"
Arena ELO	Human preference rankings	"Which model do people prefer?"
Aider	Code editing & multi-file changes	"Can it refactor a codebase?"

🧮 How to Choose the Right Model

Decision Guide

Your Priority	Best Model Choice	Why
💰 Low Cost	GPT-4o mini, DeepSeek V3	10-50x cheaper than premium models
🧠 Best Reasoning	Claude Opus 4, o3	Highest scores on complex reasoning
💻 Best Coding	Claude Sonnet 4, Opus 4	Highest SWE-bench scores
📄 Long Documents	Gemini 2.5 Pro	1M context — reads entire textbooks
⇐ Speed	GPT-4o mini, Claude Haiku	Fastest response times
🎓 Education Bot	Claude Sonnet 4	Best quality-to-cost ratio
🇮🇳 Indian Languages	Gemini 2.5 Pro	Best Hindi/Tamil/Telugu support
🔒 Privacy / Self-hosted	LLaMA, Mistral, DeepSeek	Run on your own servers

For Your Chatbot

For the education chatbot you built in this book, DeepSeek V3 or GPT-4o mini would be the most cost-effective choices for deployment. If you need the best quality and can afford it, Claude Sonnet 4 gives excellent results at a reasonable price. For serving Indian language students, Gemini 2.5 Pro has the best multilingual support.

Appendices

Reference Material

Math foundations, glossary, resources, and more

Mathematical Foundations

Appendix A: Mathematical Foundations

This appendix covers the key math concepts used throughout the book. You don't need to master all of this to understand the code, but it helps to know what's happening under the hood.

A.1 Vectors and Matrices

A vector is a list of numbers:

\mathbf{v} = [3, 1, 4, 1, 5]

A matrix is a 2D grid of numbers:

\mathbf{M} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}

In our models:

Each character/word is represented as a vector (embedding)
Weight matrices transform vectors from one representation to another
Attention scores form a matrix showing which tokens attend to which

A.2 Dot Product

The dot product of two vectors measures how "similar" they are:

\mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + \dots + a_n b_n

Example:

[1, 2, 3] \cdot [4, 5, 6] = (1 \times 4) + (2 \times 5) + (3 \times 6) = 32

In attention: we use the dot product of Query and Key vectors to compute attention scores. A high dot product means "this query is very interested in this key."

A.3 Matrix Multiplication

When we multiply matrices, each element of the result is a dot product:

\mathbf{C} = \mathbf{A} \times \mathbf{B}

C_{ij} = \sum_k A_{ik} \cdot B_{kj}

In our code, torch.matmul(Q, K.transpose(-2, -1)) computes attention scores by multiplying the Query matrix with the transposed Key matrix.

A.4 Softmax Function

Softmax converts a vector of arbitrary numbers into a probability distribution (all positive, summing to 1):

\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}

Example:

\text{softmax}([2.0, 1.0, 0.1]) = [0.659, 0.242, 0.099]

Larger inputs → larger probabilities
All outputs are between 0 and 1
They sum to 1.0

Used in:

Attention weights (which tokens to attend to)
Output probabilities (which character comes next)

A.5 Cross-Entropy Loss

Cross-entropy measures how different the predicted probabilities are from the actual answer:

L = -\sum_{i} y_i \log(\hat{y}_i)

Where:

y_i = true label (1 for correct class, 0 for others)
\hat{y}_i = predicted probability for class i

If the model is very confident and correct → low loss If the model is wrong → high loss

This is the loss function used throughout Chapters 3-6.

A.6 Gradient and Chain Rule

The gradient of a function tells us which direction to move to decrease the function's value:

\nabla L = \left[\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \dots\right]

The chain rule lets us compute gradients through multiple layers:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w}

This is exactly what backpropagation does — it applies the chain rule from the output layer all the way back to the input.

A.7 The Attention Formula

The most important formula in this entire book:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \cdot V

Breaking it down:

QK^T — compute similarity between queries and keys
/ \sqrt{d_k} — scale down to prevent extreme values
\text{softmax}(\cdot) — convert to probabilities
\cdot V — weighted sum of values

A.8 Sigmoid Function

\sigma(x) = \frac{1}{1 + e^{-x}}

Properties:

Output is always between 0 and 1
Used in Chapter 3 as the neuron activation function
S-shaped curve

A.9 ReLU and GELU

ReLU (Rectified Linear Unit):

\text{ReLU}(x) = \max(0, x)

Simple and effective. Used in earlier models.

GELU (Gaussian Error Linear Unit):

\text{GELU}(x) = x \cdot \Phi(x)

Where \Phi(x) is the standard normal CDF. Used in GPT-2 and modern models. Smoother than ReLU.

Glossary of Terms

Appendix B: Glossary

Term	Definition
Activation Function	A function applied after a neuron's weighted sum to introduce non-linearity (e.g., sigmoid, ReLU, GELU)
Attention	A mechanism that lets each token in a sequence "look at" other tokens and decide how much to focus on each
Autoregressive	Generating output one token at a time, using previous outputs as input for the next prediction
Backpropagation	The algorithm for computing gradients by applying the chain rule backwards through a neural network
Batch	A group of training examples processed together for efficiency
Bigram	A pair of consecutive characters or words; the simplest language model
Causal Mask	A triangular mask that prevents a token from attending to future tokens
Cross-Entropy	A loss function that measures the difference between predicted and actual probability distributions
Dropout	Randomly turning off neurons during training to prevent overfitting
Embedding	A dense vector representation of a token, learned during training
Epoch	One complete pass through the entire training dataset
Feed-Forward Network (FFN)	A simple neural network with one or two hidden layers, used within transformer blocks
Fine-Tuning	Adapting a pre-trained model to a specific task by training on task-specific data
Gradient	The direction and magnitude of change that would reduce the loss function
Gradient Descent	An optimization algorithm that updates weights in the direction that reduces loss
Head (Attention)	One independent attention computation within multi-head attention
Hidden Layer	A layer of neurons between the input and output layers
Inference	Using a trained model to make predictions (as opposed to training)
Layer Normalization	Normalizing the values within a layer to have mean 0 and standard deviation 1
Learning Rate	A hyperparameter controlling how much weights change in each update step
Logits	The raw, unnormalized scores output by a model before softmax
LoRA	Low-Rank Adaptation — an efficient fine-tuning method that only updates small adapter matrices
Loss	A number measuring how wrong the model's predictions are
Multi-Head Attention	Running multiple attention computations in parallel, each focusing on different aspects
Neuron	The basic unit of a neural network: computes weighted sum + activation
N-gram	A sequence of N consecutive tokens (bigram = 2-gram, trigram = 3-gram)
One-Hot Encoding	Representing a token as a vector of all zeros except for a 1 at the token's index
Overfitting	When a model memorizes training data instead of learning general patterns
Parameters	The learnable weights and biases of a model
PEFT	Parameter-Efficient Fine-Tuning — methods like LoRA that update only a fraction of parameters
Perplexity	A measure of how well a model predicts text; lower = better
Positional Encoding	Information added to embeddings to tell the model about token positions
Pre-training	Training a model on a large, general dataset before fine-tuning
Query (Q)	In attention: "what am I looking for?"
Key (K)	In attention: "what do I contain?"
Value (V)	In attention: "what information do I give?"
Residual Connection	Adding the input of a layer to its output (skip connection)
RLHF	Reinforcement Learning from Human Feedback — training models using human preferences
Sampling	Randomly selecting the next token based on probability distribution
Self-Attention	Attention applied within a single sequence (each token attends to all other tokens)
Softmax	A function that converts logits into a probability distribution
Temperature	A parameter controlling the randomness of sampling (low = predictable, high = creative)
Tokenization	Converting text into a sequence of tokens (characters, subwords, or words)
Top-k Sampling	Only considering the k most probable tokens when sampling
Transformer	The neural network architecture based on self-attention, introduced in 2017
Vocabulary	The set of all unique tokens a model knows
Weight	A learnable parameter that determines how much influence one input has
Weight Tying	Sharing weights between the input embedding and output projection layers

Resources for Further Learning

Appendix C: Resources for Further Learning

📺 Video Courses (Free)

Resource	What You'll Learn	Link
3Blue1Brown: Neural Networks	Beautiful visual explanations of neural networks	YouTube
Andrej Karpathy: Let's Build GPT	Build GPT from scratch (2 hours)	YouTube
Andrej Karpathy: Neural Networks: Zero to Hero	Complete deep learning series	YouTube
fast.ai	Practical deep learning for coders	fast.ai
CS231n (Stanford)	Computer vision & deep learning	YouTube
CS224n (Stanford)	NLP with deep learning	YouTube

📖 Books

Book	Level	Focus
Deep Learning (Goodfellow et al.)	Advanced	Complete theory reference
Hands-On Machine Learning (Géron)	Intermediate	Practical ML with scikit-learn & TF
Natural Language Processing with Transformers (Tunstall et al.)	Intermediate	Hugging Face ecosystem
The Hundred-Page Machine Learning Book (Burkov)	Beginner	Concise overview

📄 Key Papers

Paper	Year	Why It Matters
"Attention Is All You Need"	2017	Introduced the Transformer
"BERT: Pre-training of Deep Bidirectional Transformers"	2018	Bidirectional understanding
"Language Models are Few-Shot Learners" (GPT-3)	2020	Scaling and in-context learning
"Training Language Models to Follow Instructions"	2022	RLHF and InstructGPT
"LoRA: Low-Rank Adaptation"	2021	Efficient fine-tuning
"Constitutional AI"	2022	AI safety approach

🛠️ Tools and Libraries

Tool	Purpose
PyTorch	Deep learning framework (used in this book)
Hugging Face Transformers	Pre-trained models and fine-tuning
Hugging Face PEFT	Parameter-efficient fine-tuning (LoRA)
Hugging Face Datasets	Easy dataset loading
Google Colab	Free GPU for training
Weights & Biases	Experiment tracking

🇮🇳 Indian AI Resources

Resource	Focus
AI4Bharat	NLP for Indian languages
IIT Madras NPTEL	Free AI/ML courses in Hindi and English
Bhashini	Government translation platform
IndicNLP	NLP tools for Indic languages

Setting Up Your Environment

Appendix D: Setting Up Your Environment

D.1 Installing Python

Windows:

Download Python from python.org
During installation, check ✅ "Add Python to PATH"
Open Command Prompt and verify: python --version

Linux/Mac:

Bash
# Usually pre-installed. Check with:
python3 --version

# If not installed:
# Ubuntu: sudo apt install python3 python3-pip
# Mac: brew install python3

D.2 Setting Up a Virtual Environment (Recommended)

Bash
# Create a virtual environment
python -m venv ai-env

# Activate it
# Windows:
ai-env\Scripts\activate
# Linux/Mac:
source ai-env/bin/activate

# Install dependencies
pip install -r requirements.txt

D.3 Using Google Colab (No Installation Needed!)

If you don't want to install anything locally:

Go to colab.research.google.com
Create a new notebook
Upload the Python files or copy-paste the code
Run! (Colab gives you free GPU access)

D.4 Troubleshooting Common Issues

Problem	Solution
`ModuleNotFoundError: No module named 'torch'`	Run `pip install torch`
CUDA out of memory	Reduce batch_size in config
Training is very slow	Use Google Colab for GPU access
`PermissionError` on Windows	Run terminal as Administrator
Model generates gibberish	Train for more steps or check data quality

D.5 Hardware Recommendations

Level	Minimum	Recommended
Level 1-2	Any computer	Any computer
Level 3	4GB RAM	8GB RAM
Level 4	4GB RAM, CPU OK	8GB RAM, GPU preferred
Level 5	8GB RAM	16GB RAM, NVIDIA GPU

Tip

If you don't have a GPU, use Google Colab (free) for Levels 4 and 5. It provides a free NVIDIA T4 GPU that's more than enough!

End of Appendices

The Training Stories

This is the complete training dataset used in Chapter 5 (Level 4) to train your Mini-GPT model. These 30 stories were carefully chosen to give the model a mix of narrative styles, scientific knowledge, and Indian cultural context — all in simple English.

Why These Stories?

When training a language model, the training data determines what the model learns. We chose stories that:

Category	Purpose	Examples
Indian folk tales	Cultural context, moral lessons	The fox and the pot, the honest woodcutter
Science paragraphs	Factual knowledge	Photosynthesis, water cycle, magnets
Nature descriptions	Vocabulary, descriptive language	Sunrise, moon, seasons
Character stories	Narrative structure	Arjun the reader, Priya the singer
Moral stories	Story patterns, cause-effect	Tortoise and rabbit, ant and grasshopper

Important

Simple vocabulary (suitable for Class 6-8 level)
Short sentences (easier for a small model to learn)
Repetitive patterns (helps the model learn grammar faster)
Mix of topics (gives the model breadth)
~8,800 characters total (small but sufficient for a demo model)

Key Design Decisions:

The Complete Dataset

Below is every story the model trains on. Read through them — when you later see the model generating text, you'll recognize the patterns it learned from these stories!

Story 1: The Wise Farmer

Once upon a time, in a small village near the river, there lived a wise old farmer. He worked hard every day in his fields. The farmer grew rice, wheat, and vegetables. He shared his food with everyone in the village. People loved him because he was kind and generous.

What the model learns: Opening phrases ("Once upon a time"), character introductions, village/farming vocabulary.

Story 2: Nature's Beauty

The sun rises in the east and sets in the west. Every morning, the birds sing beautiful songs. The flowers open their petals to welcome the sunlight. The trees provide shade and fresh air. Nature is beautiful and full of wonders.

What the model learns: Descriptive language, nature vocabulary, present tense patterns.

Story 3: The Clever Fox (Panchatantra-style)

A clever fox lived in a forest near a village. One hot summer day, the fox was very thirsty. He searched for water everywhere but could not find any. Then he saw a pot with some water at the bottom. The fox put small stones into the pot one by one. Slowly the water came up to the top. The fox drank the water happily. This story teaches us that intelligence solves problems.

What the model learns: Problem-solving narratives, sequential actions, moral conclusions.

Story 4: The River Ganga

The river Ganga flows from the Himalayas to the Bay of Bengal. It is one of the longest rivers in India. Many cities and towns are built along its banks. People use the river water for drinking and farming. The Ganga is very important for the people of India.

What the model learns: Indian geography, factual sentences, proper nouns.

Story 5: The Kind King

A kind king ruled a beautiful kingdom. His people were happy and peaceful. The king built schools for children and hospitals for the sick. He made sure everyone had food to eat and a place to live. The kingdom prospered under his wise rule.

What the model learns: Governance vocabulary, cause-effect relationships.

Story 6: The Night Sky

The moon shines brightly in the night sky. Stars twinkle like tiny diamonds above us. The sky changes color from blue to orange during sunset. Clouds float gently across the sky like cotton balls. Looking at the sky fills our hearts with wonder.

What the model learns: Similes ("like tiny diamonds"), poetic descriptions, visual imagery.

Story 7: Arjun the Reader

A small boy named Arjun loved to read books. He would sit under the banyan tree and read for hours. His favorite books were about science and adventure. One day he read about the solar system and the planets. He dreamed of becoming a scientist when he grew up.

What the model learns: Character development, Indian names, aspirational narratives.

Story 8: Water — Essential for Life

Water is essential for all living things. Plants need water to grow and make food. Animals drink water to stay alive and healthy. The water cycle keeps water moving around the earth. Rain fills the rivers and lakes with fresh water.

What the model learns: Scientific facts, cause-effect, ecosystem vocabulary.

Story 9: The Honest Woodcutter

A poor woodcutter lived at the edge of a forest. Every day he would cut wood and sell it in the market. One day his axe fell into the river. He sat by the river and cried because he was very poor. A kind spirit appeared and asked him what happened. The spirit dove into the water and brought up a golden axe. The woodcutter said that was not his axe. The spirit brought up a silver axe. Again the woodcutter said it was not his. Finally the spirit brought up his old iron axe. The woodcutter was happy and said yes that is mine. The spirit was pleased with his honesty and gave him all three axes.

What the model learns: Longer narratives, dialogue patterns, honesty theme, repetitive structure (which helps small models learn!).

Story 10: Day and Night

The earth goes around the sun in one year. The moon goes around the earth in about one month. The earth spins on its axis once every day. This spinning gives us day and night. When our part of the earth faces the sun it is daytime. When it faces away from the sun it is nighttime.

What the model learns: Astronomical facts, cause-effect explanations.

Story 11: The Proud Peacock

A beautiful peacock lived in a garden near the palace. It had colorful feathers of blue and green. When it danced in the rain everyone would stop and watch. The peacock was proud of its beautiful feathers. It spread its tail like a magnificent fan.

Story 12: The Importance of Trees

Trees are very important for our planet. They give us oxygen to breathe and clean the air. Trees provide fruits and nuts for us to eat. Birds build their nests in the branches of trees. We should plant more trees and take care of them.

Story 13: Priya the Singer

A young girl named Priya wanted to learn music. She practiced singing every day after school. Her teacher said she had a beautiful voice. Priya worked very hard and never missed a practice session. After many months she sang in a concert and everyone clapped.

Story 14: The Human Body

The heart pumps blood through our body. Blood carries oxygen and food to every part of the body. The brain controls all our movements and thoughts. Our bones give shape to our body and protect our organs. The human body is an amazing machine.

Story 15: Tortoise and the Rabbit

An old tortoise and a young rabbit decided to have a race. The rabbit ran very fast and went far ahead. He thought he had plenty of time so he took a nap. The tortoise kept walking slowly but steadily. When the rabbit woke up the tortoise had already crossed the finish line. Slow and steady wins the race.

Story 16: Indian Festivals

India has many beautiful festivals throughout the year. Diwali is the festival of lights celebrated with joy and happiness. Holi is the festival of colors where people play with colored powder. Eid brings people together for prayers and feasts. Christmas is celebrated with decorations and gifts.

Story 17: Magnets

A magnet has two poles called north and south. Like poles repel each other and unlike poles attract. Magnets can attract things made of iron and steel. The earth itself is like a giant magnet. A compass needle points north because of the earth magnetic field.

Story 18: The Merchant's Journey

There was a merchant who traveled from town to town selling goods. He carried silk cloths and precious spices on his camel. One day he got lost in the desert during a sandstorm. He prayed for help and soon the storm passed away. He followed the stars in the night sky and found his way home.

Story 19: Light and Colors

Light travels in straight lines very fast. When light passes through a prism it splits into seven colors. These colors are violet indigo blue green yellow orange and red. We can see a rainbow after rain because water drops act like tiny prisms. Light is a form of energy that helps us see the world.

Story 20: The Mother Bird

A mother bird built a nest in a tall tree. She laid three small eggs in the nest. She sat on the eggs to keep them warm for many days. Soon the eggs cracked and three baby birds came out. The mother bird brought food for her babies every day until they learned to fly.

Story 21: Photosynthesis

Plants make their own food through photosynthesis. They use sunlight water and carbon dioxide for this process. The green color in leaves comes from a substance called chlorophyll. Chlorophyll captures sunlight to make food for the plant. Plants give out oxygen during photosynthesis which we breathe.

Story 22: The Brave Soldier

A brave soldier named Ravi protected his village from danger. He stood guard at the border day and night without complaint. The villagers respected him and treated him like a hero. Ravi taught the young boys how to be brave and strong. He said courage means doing the right thing even when you are afraid.

Story 23: Indian Seasons

The seasons change throughout the year in India. Summer is hot and dry with temperatures rising very high. The monsoon brings heavy rains and cools the land. Winter is cold and pleasant in most parts of the country. Spring brings new flowers and green leaves on the trees.

Story 24: The Kind Fisherman

A fisherman went to the sea every morning in his small boat. He would throw his net into the water and wait patiently. Sometimes he caught many fish and sometimes very few. One day he caught a beautiful golden fish. The golden fish spoke and asked to be set free. The kind fisherman released it back into the sea.

Story 25: Electricity

Electricity flows through wires like water flows through pipes. We use electricity to power lights fans and computers. A battery stores electrical energy for later use. Switches control the flow of electricity in a circuit. We should use electricity wisely and not waste it.

Story 26: The Two Friends and the Bear

Two friends were walking through a forest one day. Suddenly they saw a large bear coming toward them. One friend quickly climbed a tree to save himself. The other friend lay down on the ground and pretended to be dead. The bear came close and smelled him then walked away. When the bear left the friend in the tree came down. He asked what the bear whispered in his ear. The friend on the ground said the bear told me not to trust a friend who runs away in danger.

Story 27: Mountains

Mountains are the tallest landforms on the earth. The Himalayas are the highest mountains in the world. Mount Everest is the tallest peak standing at eight thousand meters. Many rivers begin from the glaciers in the mountains. Mountains affect the weather and rainfall in nearby areas.

Story 28: The Ant and the Grasshopper

A little ant worked hard all summer long. It collected food and stored it carefully in its home. A grasshopper spent the whole summer singing and dancing. When winter came the ant had plenty of food to eat. The grasshopper had nothing and was cold and hungry. The ant shared some food with the grasshopper and said it is wise to prepare for the future.

Story 29: Sound

Sound is a form of energy that travels in waves. We hear sounds when these waves reach our ears. Sound travels faster through water than through air. It travels fastest through solid objects like metal. Very loud sounds can damage our hearing so we should protect our ears.

Story 30: The Dedicated Teacher

A teacher loved her students very much. She came to school early every day to prepare her lessons. She explained difficult topics in simple and easy ways. Her students always performed well in their examinations. She believed that every child can learn if given the right guidance.

Data Analysis


Total stories:        30
Total characters:     8,867
Total words:          ~1,530
Unique characters:    ~55
Average story length: ~295 characters (~51 words)

Topic distribution:
  Indian folk/moral tales:  10 stories (33%)
  Science facts:             10 stories (33%)
  Nature/descriptions:        5 stories (17%)
  Character stories:          5 stories (17%)

Tip

Experiment Idea: Try adding your own stories to stories.txt and re-training the model. Does the generated text change? Does more data improve quality? This is how real AI researchers iterate on data!

Project Walkthrough & Task Tracker

Appendix F: Project Walkthrough & Task Tracker

The Building Journey

This project was built as a learning experience — and the process of building it is educational too! Here's how the project came together:

Task Checklist (All Completed ✅)

Level 1: Prediction (Pure Python)

[x] lesson.md — What is prediction? Bigrams explained with Indian context
[x] step1_bigram.py — Count character patterns from sample text
[x] step2_generate.py — Generate text using bigram probabilities

Level 2: Neural Network (NumPy)

[x] lesson.md — Neurons, weights, activation, backpropagation
[x] step1_neuron.py — Build a single neuron, test on AND/OR gates
[x] step2_network.py — Multi-layer neural network with forward pass
[x] step3_train.py — Full training with backpropagation from scratch
[x] step4_visualize.py — Loss curves and training visualization

Level 3: Transformer (PyTorch)

[x] lesson.md — Attention mechanism with classroom analogy
[x] step1_embedding.py — Token and positional embeddings
[x] step2_attention.py — Self-attention from scratch
[x] step3_transformer_block.py — Multi-head attention + FFN + residual
[x] step4_put_it_together.py — Complete Mini-Transformer model

Level 4: Mini-GPT

[x] lesson.md — Autoregressive generation, temperature, top-k
[x] model.py — Complete MiniGPT model class
[x] train.py — Training loop with progress and evaluation
[x] generate.py — Interactive text generation
[x] data/stories.txt — 30 training stories

Level 5: Real Fine-Tune

[x] lesson.md — Pre-training, fine-tuning, LoRA explained
[x] prepare_data.py — Data loading and tokenization
[x] finetune.py — LoRA fine-tuning with Hugging Face
[x] chat.py — Interactive chat with comparison mode
[x] data/education_qa.jsonl — 60+ education Q&A pairs

Documentation

[x] README.md — Complete project guide
[x] requirements.txt — All Python dependencies

Project Statistics

Metric	Value
Total files	25+
Total lines of code	~3,500+
Total documentation	~15,000+ words
Total training data	30 stories + 60 Q&A pairs
Languages used	Python, Markdown
Libraries used	NumPy, Matplotlib, PyTorch, Transformers, PEFT
Estimated learning time	5-6 hours (all levels)

Key Features

🎨 Rich colored terminal output — every script visually shows what's happening
💬 Extensive comments — every code block explains WHY it exists
🇮🇳 Indian context — Panchatantra stories, NCERT science, Hindi terms
📖 Lesson files — conceptual primers before diving into code
🤖 Interactive chat — talk to your trained model in Levels 4 & 5
📊 60+ education Q&A pairs — covering Class 6-8 science, math, GK
🔄 Backpropagation from scratch — Level 2 implements it without autograd
🧠 Self-attention from scratch — Level 3 builds the core transformer mechanism
⚡ LoRA fine-tuning — Level 5 uses industry-standard tools

Tip

Start from Level 1 even if you know some ML — the progression builds understanding layer by layer!

Part VI

What's Next?

Future improvements, extensions, and your roadmap forward

Chapter 8

🔮 Future Modifications & Extensions

What You'll Explore

10 concrete ways to improve and extend the chatbot
Architecture upgrades from character-level to production-grade
Multi-language support for Indian languages
RAG (Retrieval-Augmented Generation) for accurate answers
Deployment options — web, WhatsApp, mobile, classroom
New levels to continue your AI journey

8.1 🧠 Model Improvements

Your current Mini-GPT is a great learning tool, but there's a huge gap between it and production models like ChatGPT. Here's how to bridge that gap:

Area	Current	Improvement
Tokenization	Character-level	Switch to BPE (Byte-Pair Encoding) like GPT uses — better vocabulary, faster training
Model Size	~1.5M params, 4 blocks	Scale to 10-50M params, 6-8 blocks — much more coherent output
Training Data	30 stories (9KB)	Use Project Gutenberg, Wikipedia, NCERT textbooks — 10MB+
Context Window	256 characters	Expand to 512-1024 tokens for longer coherent passages
Architecture	Basic transformer	Add RoPE, SwiGLU activation, GQA (like LLaMA)

Key Insight

The biggest improvement comes from more data, not bigger models. Going from 9KB to 10MB of training data will dramatically improve your model's output quality — even with the same architecture!

8.2 🗣️ Multi-Language Support

India has 22 official languages. Currently our model only speaks English. Here's how to make it multilingual:

Python
# Future: Train on multiple Indian languages

LANGUAGES = {
    "hindi":   "Add Hindi stories and NCERT content",
    "tamil":   "Tamil literature and textbooks",
    "telugu":  "Telugu educational content",
    "bengali": "Bengali stories and science",
}

# How to implement:
# 1. Use SentencePiece tokenizer (supports all scripts)
# 2. Mix multilingual data during training
# 3. Fine-tune with language-specific LoRA adapters
# 4. Use AI4Bharat's IndicNLP resources

Indian Language AI Resources

AI4Bharat — NLP tools and datasets for all Indian languages
Bhashini — Government of India's translation platform
IndicTrans2 — Open-source translation model for 22 Indian languages
Sangraha — Large-scale Indic language dataset

8.3 💬 Better Chat Experience

The current terminal-based chat works, but students deserve a modern UI:

Feature	How to Build
Web UI	Add an `/ai-tutor` route to EduArtha with a React chat interface
Streaming	Show tokens appearing one-by-one (like ChatGPT) using Server-Sent Events
Chat History	Save conversations to SQLite/PostgreSQL for review
Voice Input	Speech-to-text using Web Speech API or OpenAI Whisper
Voice Output	Text-to-speech for answers — great for younger students!
Markdown Rendering	Render math formulas, code blocks, and tables in responses

8.4 📚 RAG — Smarter Education Bot

The most impactful upgrade: Retrieval-Augmented Generation (RAG). Instead of relying only on what the model memorized, RAG searches a knowledge base for relevant information before answering:

Student asks: "Explain photosynthesis"
↓
1. Search NCERT textbook database (vector similarity)
2. Retrieve relevant paragraphs from Class 7 Science Ch.1
3. Feed retrieved text + question to the LLM as context
4. Generate accurate, sourced answer with page references

Python
# RAG Pipeline (conceptual code)
from langchain import VectorStore, RetrievalQA
from sentence_transformers import SentenceTransformer

# Step 1: Index your textbooks
embedder = SentenceTransformer("all-MiniLM-L6-v2")
db = VectorStore.from_documents(ncert_chapters, embedder)

# Step 2: When student asks a question
def answer_with_rag(question):
    relevant_docs = db.similarity_search(question, k=3)
    context = "\n".join([doc.text for doc in relevant_docs])
    prompt = f"Based on this textbook content:\n{context}\n\nAnswer: {question}"
    return model.generate(prompt)

# Tools needed: ChromaDB, FAISS, LangChain, Sentence-Transformers

Why RAG Matters

Fine-tuning teaches the model how to answer. RAG gives it what to answer with. Together, they create a chatbot that is both fluent AND accurate — the holy grail of educational AI.

8.5 🎯 Subject-Specific Fine-Tuning

Instead of one generic bot, create specialized tutors — each using the same base model but different LoRA adapters:

Python
# One model, many experts — just swap LoRA adapters!

SUBJECT_TUTORS = {
    "science_tutor":  "Fine-tune on NCERT Science Class 6-10",
    "math_tutor":     "Fine-tune on solved problems + step-by-step",
    "history_tutor":  "Fine-tune on Indian history with timelines",
    "english_tutor":  "Fine-tune on grammar rules + essay examples",
    "coding_tutor":   "Fine-tune on Python exercises + explanations",
}

# At runtime:
base_model = load_model("distilgpt2")
adapter = load_lora_adapter("science_tutor")  # Swap this!
model = merge(base_model, adapter)

# Each adapter is only ~5MB — store dozens of experts cheaply!

8.6 📊 Training Improvements

Technique	What It Does	Difficulty
Learning Rate Scheduler	Warmup + cosine decay — smoother training	⭐ Easy
Mixed Precision (FP16)	2x faster training, half the memory	⭐ Easy
Gradient Accumulation	Train with larger effective batch size on small GPU	⭐ Easy
DPO (Direct Preference Optimization)	Align model to prefer good answers over bad ones	⭐⭐ Medium
Quantization (4-bit / 8-bit)	Run larger models on small GPUs — use QLoRA	⭐⭐ Medium
Flash Attention	3-5x faster attention computation	⭐⭐ Medium
Distributed Training	Train across multiple GPUs with DeepSpeed/FSDP	⭐⭐⭐ Hard

8.7 🛡️ Safety & Guardrails

Before deploying to real students, add these safety layers:

Python
# Safety layers to add to chat.py

class SafeChatBot:
    def generate_safe(self, question):
        # 1. Input filtering — block inappropriate questions
        if self.is_inappropriate(question):
            return "I can only help with educational topics!"

        # 2. Generate answer
        answer = self.model.generate(question)

        # 3. Factuality check — flag uncertain answers
        confidence = self.check_confidence(answer)
        if confidence < 0.5:
            answer += "\n⚠️ I'm not very sure. Please verify with your teacher!"

        # 4. Source attribution
        sources = self.find_sources(answer)
        if sources:
            answer += f"\n📖 Source: {sources}"

        return answer

8.8 📱 Deployment Options

Your chatbot doesn't have to live in the terminal forever. Here are four deployment paths:

Option	Description	Best For	Cost
🌐 EduArtha Web	Add `/ai-tutor` route to Next.js app with WebSocket streaming	Online students	Free (existing server)
📱 WhatsApp Bot	Twilio API integration — students chat on WhatsApp directly	Rural students with basic phones	~₹500/month
📲 Mobile App	React Native app with offline quantized model	Students without internet	Free (open source)
🖥️ Classroom Kiosk	Raspberry Pi + screen — students walk up and ask questions	Government schools	~₹5,000 one-time

Build a WhatsApp Education Bot

The most impactful deployment for India: a WhatsApp bot that any student can message. Steps:

Set up a Twilio account (free trial available)
Create a Flask/FastAPI webhook endpoint
Load your fine-tuned model on the server
When a WhatsApp message arrives → generate response → send back
Students text questions like "What is photosynthesis?" and get instant answers!

This works on any phone — no app download needed. Perfect for rural India. 🇮🇳

8.9 📈 Analytics & Feedback Loop

Track student interactions to continuously improve your bot:

Question frequency — What topics do students ask about most?
Struggle patterns — Where do they ask the same question repeatedly?
Accuracy tracking — Teacher reviews a sample of answers weekly
Student ratings — "Was this answer helpful?" thumbs up/down
Data flywheel — Use real student questions to create better training data!

More students → More questions → Better training data → Better model → More students
This is the AI data flywheel!

8.10 🔬 New Levels — Continue Your Journey

This book covered Levels 1-5. Here's where to go next:

Level	Topic	What You'll Build
Level 6	RAG System	Chatbot that searches NCERT textbooks before answering
Level 7	AI Agent	Agent that uses tools — calculator, web search, code execution
Level 8	Vision Model	Image recognition — identify plants, animals, diagrams
Level 9	Speech Model	Voice chatbot — speak questions, hear answers in Hindi
Level 10	Production Deploy	Docker, API server, load balancing, monitoring

The Bigger Picture

You've built something remarkable — a complete AI system from zero. But this is just the beginning. The techniques in this book are the same foundations used by Google, OpenAI, and Anthropic. The difference is scale.

India needs AI builders who understand these foundations deeply. With 250 million students and 22 languages, the opportunity to build impactful AI tools is enormous.

You have the knowledge. You have the code. Now go build something that matters. 🚀🇮🇳

📝 Chapter Summary — Future Roadmap

Model improvements: BPE tokenization, more data, larger architecture → dramatically better output
Multi-language: SentencePiece + multilingual data → support all Indian languages
RAG: Search textbooks before answering → accurate, sourced answers
Subject tutors: Multiple LoRA adapters on one base model → specialized experts
Training upgrades: FP16, gradient accumulation, DPO → faster, better training
Safety: Content filtering, confidence scores, source attribution → trustworthy bot
Deployment: Web, WhatsApp, mobile, classroom kiosk → reach every student
Analytics: Track questions, accuracy, ratings → continuous improvement
New levels: RAG, agents, vision, speech, production → your learning never stops!