Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте KV-Cache and Efficient Inference | Advanced Concepts in Text Generation
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Transformers Theory Essentials

bookKV-Cache and Efficient Inference

Transformers generate text by predicting one token at a time, using previous tokens as context. During this process, the model repeatedly applies self-attention to all tokens generated so far. Without any optimization, each new token requires recomputing attention for the entire context, which becomes increasingly expensive as the sequence grows. To address this, transformers use a key-value cache (KV-cache). The KV-cache stores the computed key and value tensors for each past token. When generating a new token, the model only needs to compute keys and values for the latest token, then append these to the cache. The attention mechanism then uses the cached keys and values from all previous tokens, rather than recalculating them from scratch. This approach significantly reduces redundant computation and speeds up the decoding process, especially for long sequences.

Note
Definition

The memory-speed trade-off in transformer inference refers to the balance between using more memory to store cached key-value pairs (for faster inference) versus saving memory by recomputing these values (at the cost of slower inference).

Without KV-cache, the model must repeatedly recompute keys, values, and attention for the entire sequence every time it generates a new token, causing the computation cost to grow with sequence length; with KV-cache, the model stores all previously computed keys and values and only computes them for the newest token, reusing the cached history and dramatically reducing the work required at each decoding step.

question mark

Which statements correctly describe the effects of using KV-cache in transformer text generation?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 3. Розділ 1

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

bookKV-Cache and Efficient Inference

Свайпніть щоб показати меню

Transformers generate text by predicting one token at a time, using previous tokens as context. During this process, the model repeatedly applies self-attention to all tokens generated so far. Without any optimization, each new token requires recomputing attention for the entire context, which becomes increasingly expensive as the sequence grows. To address this, transformers use a key-value cache (KV-cache). The KV-cache stores the computed key and value tensors for each past token. When generating a new token, the model only needs to compute keys and values for the latest token, then append these to the cache. The attention mechanism then uses the cached keys and values from all previous tokens, rather than recalculating them from scratch. This approach significantly reduces redundant computation and speeds up the decoding process, especially for long sequences.

Note
Definition

The memory-speed trade-off in transformer inference refers to the balance between using more memory to store cached key-value pairs (for faster inference) versus saving memory by recomputing these values (at the cost of slower inference).

Without KV-cache, the model must repeatedly recompute keys, values, and attention for the entire sequence every time it generates a new token, causing the computation cost to grow with sequence length; with KV-cache, the model stores all previously computed keys and values and only computes them for the newest token, reusing the cached history and dramatically reducing the work required at each decoding step.

question mark

Which statements correctly describe the effects of using KV-cache in transformer text generation?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 3. Розділ 1
some-alt