Lære Kullback–Leibler (KL) Divergence: Information-Theoretic Loss

The Kullback–Leibler (KL) divergence is a fundamental concept in information theory and machine learning, measuring how one probability distribution diverges from a second, expected distribution. Mathematically, it is defined as:

D_{KL}(P \| Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}

Here, $P$ and $Q$ are discrete probability distributions over the same set of events or outcomes, and the sum runs over all possible events $i$ . In this formula, $P(i)$ represents the true probability of event $i$ , while $Q(i)$ represents the probability assigned to event $i$ by the model or approximation.

Note

KL divergence quantifies the inefficiency of assuming $Q$ when the true distribution is $P$ . It can be interpreted as the extra number of bits needed to encode samples from $P$ using a code optimized for $Q$ instead of the optimal code for $P$ .


              12345678910111213141516171819202122232425
            
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Synthetic distributions
# P ~ Normal(0, 1)
# Q ~ Normal(1.5, 1.2)
rng = np.random.default_rng(42)

P_samples = rng.normal(loc=0, scale=1.0, size=2000)
Q_samples = rng.normal(loc=1.5, scale=1.2, size=2000)

plt.figure(figsize=(10,5))

sns.kdeplot(P_samples, fill=True, linewidth=2,
            label="P (True Distribution)", color="#1f77b4", alpha=0.6)

sns.kdeplot(Q_samples, fill=True, linewidth=2,
            label="Q (Model Distribution)", color="#ff7f0e", alpha=0.6)

plt.title("KL Divergence – Comparing Two Continuous Distributions", fontsize=14)
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.show()

KL divergence has several important properties. First, it is asymmetric:

D_{KL}(P \| Q) \neq D_{KL}(Q \| P)

This means the divergence from $P$ to $Q$ is not the same as from $Q$ to $P$ , reflecting that the "cost" of assuming $Q$ when $P$ is true is not the same as the reverse.

Second, KL divergence is non-negative:

D_{KL}(P \| Q) \geq 0

for all valid probability distributions $P$ and $Q$ , with equality if and only if $P = Q$ everywhere.

In machine learning, KL divergence is widely used as a loss function, especially in scenarios involving probability distributions. It plays a central role in variational inference, where it measures how close an approximate distribution is to the true posterior. Additionally, KL divergence often appears as a regularization term in models that seek to prevent overfitting by encouraging distributions predicted by the model to remain close to a prior or reference distribution.

Var alt klart?

Tak for dine kommentarer!

Sektion 4. Kapitel 1

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Suggested prompts:

Can you explain how to compute KL divergence for continuous distributions?

What are some practical examples of using KL divergence in machine learning?

Can you clarify why KL divergence is asymmetric?

Stryg for at vise menuen

D_{KL}(P \| Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}