Summary  
This chapter explains the self-attention mechanism, where pairwise dot-product scores between input vectors are softmax-normalized and used to compute weighted combinations that produce contextualized representations.

General domain of usage  
Natural language processing

Self-attention is the core mechanism behind the transformer architecture. It solves a fundamental problem in sequence modeling: how to capture relationships between elements that are far apart in a sequence.

Traditional models like RNNs process input step by step, which makes it hard to connect distant elements. Self-attention allows every element to directly attend to every other element – all at once, regardless of position.



## The Intuition

In natural language, the meaning of a word depends on its context. Take the sentence:

`The cat sat on the mat because it was soft.`

To understand what `"it"` refers to, a model needs to connect it to `"mat"` – even though they are several words apart. Self-attention enables this by letting each word compute how much it should focus on every other word in the sequence.



## How It Works

For each word in a sequence, self-attention produces a **score** against every other word. These scores determine how much information to pull from each position when building the representation of the current word.

Take the sentence `"She ate the apple."` When processing `"ate"`, the model asks: which other words are most relevant to me?

- `"ate"` → `"apple"`: high score — `"apple"` is the object of the action;
- `"ate"` → `"She"`: moderate score — `"She"` is the subject;
- `"ate"` → `"the"`: low score — `"the"` carries little meaning.

The scores are normalized so they sum to 1, then used to compute a weighted combination of all word representations. Words with higher scores contribute more to the output.



## A Minimal Example in PyTorch

The simplest version of self-attention computes scores as dot products between word vectors, then applies softmax to normalize them:


import torch
import torch.nn.functional as F

# Four word vectors of dimension 4 — representing "She ate the apple"
x = torch.tensor([
    [1.0, 0.0, 1.0, 0.0],
    [0.0, 1.0, 0.0, 1.0],
    [1.0, 1.0, 0.0, 0.0],
    [0.0, 0.0, 1.0, 1.0],
])

# Computing raw attention scores via dot products
scores = x @ x.T  # Shape: (4, 4)

# Normalizing scores into attention weights
weights = F.softmax(scores, dim=-1)

# Computing the output as a weighted combination of input vectors
output = weights @ x  # Shape: (4, 4)

print(weights)
print(output)

Each row of `weights` shows how much one word attends to every other word. The `output` is a new representation for each word that blends information from the whole sequence.

Run this locally to inspect the attention weights and see which words attend most strongly to each other.

Why is self-attention crucial for sequence modeling tasks such as language understanding?

Master the transformer architecture by building its core components from scratch. Explore self-attention, multi-head attention, positional encoding, encoder-decoder structure, and layer normalization, all within a consistent domain. Finish by assembling a complete transformer and tackling a capstone challenge.

Build a deep understanding of the transformer architecture by constructing each component from scratch, culminating in a capstone challenge.

What Is Self-Attention?

The Intuition

How It Works

A Minimal Example in PyTorch