Lære Jensen–Shannon Divergence and GANs | Advanced and Specialized Losses

The Jensen–Shannon (JS) divergence is a fundamental concept in information theory and machine learning, especially in the context of generative adversarial networks (GANs). It builds upon the Kullback–Leibler (KL) divergence by introducing a symmetrized and smoothed measure of the difference between two probability distributions. The formula for the Jensen–Shannon divergence is:

JS(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)

where $M = \frac{1}{2}(P+Q)$ , and $D_{KL}$ denotes the KL divergence.


              12345678910111213141516171819202122232425
            
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

rng = np.random.default_rng(0)

# Synthetic distributions:
# P = Gaussian centered at 0
# Q = Gaussian centered at 2
P_samples = rng.normal(loc=0, scale=1.0, size=2000)
Q_samples = rng.normal(loc=2, scale=1.0, size=2000)

# Mixture distribution M = 0.5 * (P + Q)
M_samples = np.concatenate([P_samples/2, Q_samples/2])

plt.figure(figsize=(10,5))
sns.kdeplot(P_samples, fill=True, color="#1f77b4", alpha=0.6, label="P (Real Distribution)")
sns.kdeplot(Q_samples, fill=True, color="#ff7f0e", alpha=0.6, label="Q (Generated Distribution)")
sns.kdeplot(M_samples, fill=True, color="#2ca02c", alpha=0.5, label="M = 0.5(P+Q)")

plt.title("JS Divergence: Comparing P, Q, and the Mixture M")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()
plt.show()

This looks extremely intuitive in a GAN context — real distribution vs generated distribution vs mixture.

Note

Jensen–Shannon divergence is symmetric and bounded, making it suitable for adversarial learning.

Symmetry means that $JS(P \| Q) = JS(Q \| P)$ , unlike KL divergence, which is asymmetric. The JS divergence is also always bounded between 0 and $\log 2$ (or 1, depending on the logarithm base), making it numerically stable and easier to interpret in optimization problems.


              1234567891011121314151617181920
            
# Discretize distributions into bins to approximate JS
bins = np.linspace(-5, 7, 300)

P_hist, _ = np.histogram(P_samples, bins=bins, density=True)
Q_hist, _ = np.histogram(Q_samples, bins=bins, density=True)
M_hist, _ = np.histogram(M_samples, bins=bins, density=True)

eps = 1e-12
P_hist += eps
Q_hist += eps
M_hist += eps

KL_PM = np.sum(P_hist * np.log(P_hist / M_hist))
KL_QM = np.sum(Q_hist * np.log(Q_hist / M_hist))

JS = 0.5 * KL_PM + 0.5 * KL_QM

print("KL(P || M):", KL_PM)
print("KL(Q || M):", KL_QM)
print("JS(P || Q):", JS)

In the context of GANs, the generator and discriminator are engaged in a minimax game: the generator tries to produce samples that are indistinguishable from the real data, while the discriminator tries to tell them apart. The original GAN formulation uses the JS divergence as the theoretical basis for its loss function. This choice is motivated by the fact that JS divergence provides a meaningful, symmetric measure of the overlap between the generated and real data distributions. If the distributions do not overlap, the JS divergence reaches its maximum, signaling the generator to produce more realistic samples. Conversely, as the generator improves, the divergence decreases, guiding the model towards better performance.

However, the bounded and symmetric nature of JS divergence can also lead to unique training dynamics. When the real and generated data distributions have little or no overlap, the gradient provided by the JS divergence can vanish, making it difficult for the generator to improve. This phenomenon is known as the vanishing gradient problem in GANs. Despite this limitation, the JS divergence remains a foundational concept in understanding how GANs measure the distance between probability distributions and why certain loss functions are chosen for adversarial learning.

Var alt klart?

Tak for dine kommentarer!

Sektion 4. Kapitel 2

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Suggested prompts:

Can you explain how the JS divergence is actually computed from sample data?

What is the difference between JS divergence and KL divergence in practical terms?

How does the vanishing gradient problem affect GAN training in practice?

Stryg for at vise menuen

JS(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)