# Task

You now have all the pieces to implement scaled dot-product attention from scratch. Using the formula from the previous chapter, write a function `scaled_dot_product_attention` that:

1. Takes `Q`, `K`, `V` tensors of shape `(batch_size, seq_len, d_k)` as input;
2. Accepts an optional `mask` tensor of shape `(batch_size, seq_len_q, seq_len_k)` — when provided, positions where `mask == 0` should be set to `-inf` before softmax;
3. Returns the output tensor and the attention weights.

Implement the function locally.

Master the transformer architecture by building its core components from scratch. Explore self-attention, multi-head attention, positional encoding, encoder-decoder structure, and layer normalization, all within a consistent domain. Finish by assembling a complete transformer and tackling a capstone challenge.

Build a deep understanding of the transformer architecture by constructing each component from scratch, culminating in a capstone challenge.

Challenge: Implementing Scaled Dot-Product Attention

Task