Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre What Is Self-Attention? | Section
Transformer Architecture

bookWhat Is Self-Attention?

Glissez pour afficher le menu

Self-attention is the core mechanism behind the transformer architecture. It solves a fundamental problem in sequence modeling: how to capture relationships between elements that are far apart in a sequence.

Traditional models like RNNs process input step by step, which makes it hard to connect distant elements. Self-attention allows every element to directly attend to every other element – all at once, regardless of position.

The Intuition

In natural language, the meaning of a word depends on its context. Take the sentence:

The cat sat on the mat because it was soft.

To understand what "it" refers to, a model needs to connect it to "mat" – even though they are several words apart. Self-attention enables this by letting each word compute how much it should focus on every other word in the sequence.

How It Works

For each word in a sequence, self-attention produces a score against every other word. These scores determine how much information to pull from each position when building the representation of the current word.

Take the sentence "She ate the apple." When processing "ate", the model asks: which other words are most relevant to me?

  • "ate""apple": high score — "apple" is the object of the action;
  • "ate""She": moderate score — "She" is the subject;
  • "ate""the": low score — "the" carries little meaning.

The scores are normalized so they sum to 1, then used to compute a weighted combination of all word representations. Words with higher scores contribute more to the output.

A Minimal Example in PyTorch

The simplest version of self-attention computes scores as dot products between word vectors, then applies softmax to normalize them:

12345678910111213141516171819202122
import torch import torch.nn.functional as F # Four word vectors of dimension 4 — representing "She ate the apple" x = torch.tensor([ [1.0, 0.0, 1.0, 0.0], [0.0, 1.0, 0.0, 1.0], [1.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 1.0], ]) # Computing raw attention scores via dot products scores = x @ x.T # Shape: (4, 4) # Normalizing scores into attention weights weights = F.softmax(scores, dim=-1) # Computing the output as a weighted combination of input vectors output = weights @ x # Shape: (4, 4) print(weights) print(output)
copy

Each row of weights shows how much one word attends to every other word. The output is a new representation for each word that blends information from the whole sequence.

Run this locally to inspect the attention weights and see which words attend most strongly to each other.

question mark

Why is self-attention crucial for sequence modeling tasks such as language understanding?

Sélectionnez la réponse correcte

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 1. Chapitre 1

Demandez à l'IA

expand

Demandez à l'IA

ChatGPT

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 1. Chapitre 1
some-alt