Oppiskele Kullback–Leibler (KL) Divergence: Information-Theoretic Loss

The Kullback–Leibler (KL) divergence is a fundamental concept in information theory and machine learning, measuring how one probability distribution diverges from a second, expected distribution. Mathematically, it is defined as:

D_{KL}(P \| Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}

Here, $P$ and $Q$ are discrete probability distributions over the same set of events or outcomes, and the sum runs over all possible events $i$ . In this formula, $P(i)$ represents the true probability of event $i$ , while $Q(i)$ represents the probability assigned to event $i$ by the model or approximation.

Note

KL divergence quantifies the inefficiency of assuming $Q$ when the true distribution is $P$ . It can be interpreted as the extra number of bits needed to encode samples from $P$ using a code optimized for $Q$ instead of the optimal code for $P$ .


              12345678910111213141516171819202122232425
            
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Synthetic distributions
# P ~ Normal(0, 1)
# Q ~ Normal(1.5, 1.2)
rng = np.random.default_rng(42)

P_samples = rng.normal(loc=0, scale=1.0, size=2000)
Q_samples = rng.normal(loc=1.5, scale=1.2, size=2000)

plt.figure(figsize=(10,5))

sns.kdeplot(P_samples, fill=True, linewidth=2,
            label="P (True Distribution)", color="#1f77b4", alpha=0.6)

sns.kdeplot(Q_samples, fill=True, linewidth=2,
            label="Q (Model Distribution)", color="#ff7f0e", alpha=0.6)

plt.title("KL Divergence – Comparing Two Continuous Distributions", fontsize=14)
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.show()

KL divergence has several important properties. First, it is asymmetric:

D_{KL}(P \| Q) \neq D_{KL}(Q \| P)

This means the divergence from $P$ to $Q$ is not the same as from $Q$ to $P$ , reflecting that the "cost" of assuming $Q$ when $P$ is true is not the same as the reverse.

Second, KL divergence is non-negative:

D_{KL}(P \| Q) \geq 0

for all valid probability distributions $P$ and $Q$ , with equality if and only if $P = Q$ everywhere.

In machine learning, KL divergence is widely used as a loss function, especially in scenarios involving probability distributions. It plays a central role in variational inference, where it measures how close an approximate distribution is to the true posterior. Additionally, KL divergence often appears as a regularization term in models that seek to prevent overfitting by encouraging distributions predicted by the model to remain close to a prior or reference distribution.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 4. Luku 1

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Pyyhkäise näyttääksesi valikon

D_{KL}(P \| Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}