What Is Self-Attention?
Sveip for å vise menyen
Self-attention is the core mechanism behind the transformer architecture. It solves a fundamental problem in sequence modeling: how to capture relationships between elements that are far apart in a sequence.
Traditional models like RNNs process input step by step, which makes it hard to connect distant elements. Self-attention allows every element to directly attend to every other element – all at once, regardless of position.
The Intuition
In natural language, the meaning of a word depends on its context. Take the sentence:
The cat sat on the mat because it was soft.
To understand what "it" refers to, a model needs to connect it to "mat" – even though they are several words apart. Self-attention enables this by letting each word compute how much it should focus on every other word in the sequence.
How It Works
For each word in a sequence, self-attention produces a score against every other word. These scores determine how much information to pull from each position when building the representation of the current word.
Take the sentence "She ate the apple." When processing "ate", the model asks: which other words are most relevant to me?
"ate"→"apple": high score —"apple"is the object of the action;"ate"→"She": moderate score —"She"is the subject;"ate"→"the": low score —"the"carries little meaning.
The scores are normalized so they sum to 1, then used to compute a weighted combination of all word representations. Words with higher scores contribute more to the output.
A Minimal Example in PyTorch
The simplest version of self-attention computes scores as dot products between word vectors, then applies softmax to normalize them:
12345678910111213141516171819202122import torch import torch.nn.functional as F # Four word vectors of dimension 4 — representing "She ate the apple" x = torch.tensor([ [1.0, 0.0, 1.0, 0.0], [0.0, 1.0, 0.0, 1.0], [1.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 1.0], ]) # Computing raw attention scores via dot products scores = x @ x.T # Shape: (4, 4) # Normalizing scores into attention weights weights = F.softmax(scores, dim=-1) # Computing the output as a weighted combination of input vectors output = weights @ x # Shape: (4, 4) print(weights) print(output)
Each row of weights shows how much one word attends to every other word. The output is a new representation for each word that blends information from the whole sequence.
Run this locally to inspect the attention weights and see which words attend most strongly to each other.
Takk for tilbakemeldingene dine!
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår