What Is Self-Attention?
Scorri per mostrare il menu
Self-attention is the core mechanism behind the transformer architecture. It solves a fundamental problem in sequence modeling: how to capture relationships between elements that are far apart in a sequence.
Traditional models like RNNs process input step by step, which makes it hard to connect distant elements. Self-attention allows every element to directly attend to every other element – all at once, regardless of position.
The Intuition
In natural language, the meaning of a word depends on its context. Take the sentence:
The cat sat on the mat because it was soft.
To understand what "it" refers to, a model needs to connect it to "mat" – even though they are several words apart. Self-attention enables this by letting each word compute how much it should focus on every other word in the sequence.
How It Works
For each word in a sequence, self-attention produces a score against every other word. These scores determine how much information to pull from each position when building the representation of the current word.
Take the sentence "She ate the apple." When processing "ate", the model asks: which other words are most relevant to me?
"ate"→"apple": high score —"apple"is the object of the action;"ate"→"She": moderate score —"She"is the subject;"ate"→"the": low score —"the"carries little meaning.
The scores are normalized so they sum to 1, then used to compute a weighted combination of all word representations. Words with higher scores contribute more to the output.
A Minimal Example in PyTorch
The simplest version of self-attention computes scores as dot products between word vectors, then applies softmax to normalize them:
12345678910111213141516171819202122import torch import torch.nn.functional as F # Four word vectors of dimension 4 — representing "She ate the apple" x = torch.tensor([ [1.0, 0.0, 1.0, 0.0], [0.0, 1.0, 0.0, 1.0], [1.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 1.0], ]) # Computing raw attention scores via dot products scores = x @ x.T # Shape: (4, 4) # Normalizing scores into attention weights weights = F.softmax(scores, dim=-1) # Computing the output as a weighted combination of input vectors output = weights @ x # Shape: (4, 4) print(weights) print(output)
Each row of weights shows how much one word attends to every other word. The output is a new representation for each word that blends information from the whole sequence.
Run this locally to inspect the attention weights and see which words attend most strongly to each other.
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione