# Task

You have all the building blocks: scaled dot-product attention from the previous challenge, and the intuition behind multiple heads from the last chapter. Now put them together.

Implement a `MultiHeadAttention` module as an `nn.Module` class. It should:

1. Accept `d_model` and `num_heads` as constructor arguments – assert that `d_model % num_heads == 0`;
2. Define separate linear projections for `Q`, `K`, `V`, and a final output projection;
3. In `forward(x)`, split the projections into `num_heads` heads of dimension `d_model // num_heads`;
4. Run scaled dot-product attention independently per head;
5. Concatenate the head outputs and pass through the output projection.



Implement the module locally.

Master the transformer architecture by building its core components from scratch. Explore self-attention, multi-head attention, positional encoding, encoder-decoder structure, and layer normalization, all within a consistent domain. Finish by assembling a complete transformer and tackling a capstone challenge.

Build a deep understanding of the transformer architecture by constructing each component from scratch, culminating in a capstone challenge.

Challenge: Implementing Multi-Head Attention

Task