Summary  
The chapter explains how a key-value cache stores previously computed attention key and value tensors so that only the newest token’s tensors need to be computed at each step, dramatically reducing redundant self-attention work during sequential token generation.

General domain of usage  
Deep learning model inference

Transformers generate text by predicting one token at a time, using previous tokens as context. During this process, the model repeatedly applies self-attention to all tokens generated so far. Without any optimization, each new token requires recomputing attention for the entire context, which becomes increasingly expensive as the sequence grows. To address this, transformers use a **key-value cache** (KV-cache). The KV-cache stores the computed key and value tensors for each past token. When generating a new token, the model only needs to compute keys and values for the latest token, then append these to the cache. The attention mechanism then uses the cached keys and values from all previous tokens, rather than recalculating them from scratch. This approach significantly reduces redundant computation and speeds up the decoding process, especially for long sequences.

The memory-speed trade-off in transformer inference refers to the balance between using more memory to store cached key-value pairs (for faster inference) versus saving memory by recomputing these values (at the cost of slower inference).

Definition

Without KV-cache, the model must repeatedly recompute keys, values, and attention for the entire sequence every time it generates a new token, causing the computation cost to grow with sequence length; with KV-cache, the model stores all previously computed keys and values and only computes them for the newest token, reusing the cached history and dramatically reducing the work required at each decoding step.


Which statements correctly describe the effects of using KV-cache in transformer text generation?

A comprehensive, code-free exploration of transformer-based language models, focusing on their architecture, text generation mechanics, and the theoretical principles underlying their behavior.

Explore the essential components that enable transformers to process and represent language, including attention, residual pathways, and positional encodings.

Examine the theoretical underpinnings of text generation in transformers, from probability distributions to sampling strategies.

Dive into optimization, error modes, and the information-theoretic perspective on transformer-based language models.

KV-Cache and Efficient Inference