Summary  
This chapter explains how to compute and apply sinusoidal positional encodings—using sine and cosine functions of varying frequencies—to token embeddings so that transformer models can distinguish token order.  

General domain of usage  
Natural language processing

Transformers process all tokens in a sequence simultaneously – unlike RNNs, they have no built-in notion of order. This means without additional information, the model cannot distinguish between `"dog bites man"` and `"man bites dog"`.

**Positional encoding** solves this by adding a position-dependent vector to each token's embedding before it enters the transformer. The model then has access to both the token's meaning and its position in the sequence.



## Sinusoidal Encoding

The original transformer uses sine and cosine functions of varying frequencies to construct a unique encoding for each position:

- Even dimensions: 
$$
PE(pos, 2i) = sin \left( \frac{pos}{10000^{\frac{2i}{d_{model}}}} \right)
$$

- Odd dimensions: 
$$
PE(pos, 2i+1) = cos \left( \frac{pos}{10000^{\frac{2i}{d_{model}}}} \right)
$$

Different dimensions use different frequencies – lower dimensions oscillate quickly, higher dimensions change slowly. Together they form a unique fingerprint for each position that generalizes to sequence lengths not seen during training.



## A Worked Example

For a sequence of length 3 with `d_model = 4`:

<link rel="stylesheet" href="https://content-media-cdn.codefinity.com/courses/c1087363-9025-4c37-8d5e-983a32a007e9/tables-section/scrollable-framed-table.css"/>
<link rel="stylesheet" href="https://codefinity-content-media.s3.eu-west-1.amazonaws.com/css_custom_styles/TextFormatting.css"/>
<div class="custom-table-wrapper">
  <table>
    <tbody>
      <tr>
        <th><strong>Position</strong></th>
        <th><strong>PE(pos, 0)</strong></th>
        <th><strong>PE(pos, 1)</strong></th>
        <th><strong>PE(pos, 2)</strong></th>
        <th><strong>PE(pos, 3)</strong></th>
      </tr>
      <tr><td>0</td><td>sin(0) = 0.0</td><td>cos(0) = 1.0</td><td>sin(0) = 0.0</td><td>cos(0) = 1.0</td></tr>
      <tr><td>1</td><td>sin(1) ≈ 0.841</td><td>cos(1) ≈ 0.540</td><td>sin(0.01) ≈ 0.010</td><td>cos(0.01) ≈ 1.000</td></tr>
      <tr><td>2</td><td>sin(2) ≈ 0.909</td><td>cos(2) ≈ −0.416</td><td>sin(0.02) ≈ 0.020</td><td>cos(0.02) ≈ 1.000</td></tr>
    </tbody>
  </table>
</div>

Each row is a unique vector added to the corresponding token embedding. Notice how columns 0–1 change rapidly while columns 2–3 change slowly – this multi-frequency structure is what makes each position distinguishable.

Which statement best describes how positional encoding is used in a transformer model?

Master the transformer architecture by building its core components from scratch. Explore self-attention, multi-head attention, positional encoding, encoder-decoder structure, and layer normalization, all within a consistent domain. Finish by assembling a complete transformer and tackling a capstone challenge.

Build a deep understanding of the transformer architecture by constructing each component from scratch, culminating in a capstone challenge.

Position	PE(pos, 0)	PE(pos, 1)	PE(pos, 2)	PE(pos, 3)
0	sin(0) = 0.0	cos(0) = 1.0	sin(0) = 0.0	cos(0) = 1.0
1	sin(1) ≈ 0.841	cos(1) ≈ 0.540	sin(0.01) ≈ 0.010	cos(0.01) ≈ 1.000
2	sin(2) ≈ 0.909	cos(2) ≈ −0.416	sin(0.02) ≈ 0.020	cos(0.02) ≈ 1.000

Understanding Positional Encoding

Sinusoidal Encoding

A Worked Example