Lära Scoring Relevance: Similarity Functions in Attention

To understand how attention mechanisms determine which pieces of information are most relevant, you need to grasp how similarity functions are used to score the relationship between queries and keys. The most common similarity functions in attention are the dot product and cosine similarity. Both of these functions take two vectors—such as a query and a key—and produce a single number that represents how "aligned" or "relevant" they are to each other. In attention, this number becomes the attention score: a higher score means the key is more relevant to the query, and thus the corresponding value should be weighted more heavily in the output.

The dot product's role as a measure of alignment can be derived mathematically. Recall that the dot product between two vectors $q$ and $k$ can also be written as:

q · k = ||q|| * ||k|| * cos(θ)

where $||q||$ and $||k||$ are the magnitudes (lengths) of the vectors and $θ$ is the angle between them. This means the dot product combines information about both the lengths of the vectors and their orientation relative to each other. When the vectors are normalized (have length 1), the dot product reduces to the cosine similarity, which directly measures the angle between the vectors:

\text{cosine similarity}(q, k) = \frac{(q · k)}{(||q|| * ||k||)}

Cosine similarity always ranges from -1 (opposite) to 1 (identical), regardless of the vectors' magnitudes.

Note

The dot product is preferred in attention mechanisms because it is computationally efficient and naturally aligns with matrix multiplication operations on modern hardware. Geometrically, the dot product measures how much one vector "points in the direction" of another, making it an intuitive fit for scoring relevance between queries and keys. When vectors are not normalized, the dot product also takes into account the magnitude of the vectors, which can encode additional information.

While the dot product and cosine similarity are the most common functions, other scoring functions have been proposed. These include additive (or concatenation-based) scoring, where the query and key are concatenated and passed through a small neural network, and bilinear scoring, which introduces a learned weight matrix between the query and key vectors. Each scoring function has theoretical implications for how the model learns to represent similarity and relevance. For example:

Additive scoring can capture more complex relationships but may be slower to compute;
Bilinear scoring can learn more flexible matching patterns at the cost of additional parameters.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 3

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Svep för att visa menyn