Divergences, Latent Variables, and the Core Questions
At the heart of modern Artificial Intelligence lies a beautifully complex challenge: the ability to model intricate data distributions that govern the world around us. Consider the rich variety of information we encounter daily—the subtle interplay of pixels in photographic images, the nuanced flow of natural language, the temporal patterns of human speech, the elegant folding structures of proteins, or the seemingly chaotic yet meaningful fluctuations of financial markets. All of these diverse domains share a profound mathematical commonality: the data we observe emerges from some unknown, high-dimensional probability distribution.
This fundamental insight leads us to the core questions that generative modeling seeks to address:
1. Realistic Sample Generation
Modern AI’s most impressive achievements stem from successfully learning data distributions. GANs create photorealistic images indistinguishable from real photographs. Diffusion models generate stunning text-to-image outputs that rival human artistry. Large language models produce coherent, contextually rich text that captures the subtleties of human communication. Each breakthrough becomes possible because these models learn to approximate $P_X$—the probability distribution underlying their training data—enabling them to sample from this learned distribution to create genuinely novel outputs.
2. Representation Learning Through Distribution Modeling
When models attempt to capture $P_X$, they naturally discover and encode meaningful latent representations of the data. These representations possess a remarkable quality: they capture semantic relationships without explicit supervision. In facial recognition, models learn to distinguish “smiling” from “frowning,” “young” from “old,” “casual” from “formal”—all without labeled examples. These rich, unsupervised features become invaluable assets for downstream applications across diverse domains.
3. Uncertainty Quantification
Unlike deterministic models that output single predictions, generative models provide complete probabilistic descriptions of data. This framework enables principled uncertainty reasoning—absolutely crucial in scientific research, medical diagnosis, and engineering applications where understanding confidence levels and potential variations matters as much as the predictions themselves. The probabilistic nature enables Monte Carlo methods, Bayesian inference, and robust decision-making under uncertainty.
4. Exploring the Rare and Unseen
Generative modeling unlocks the ability to simulate scenarios that may be sparse or absent from training data. Want to model century-scale climate events? Design novel molecular structures? Anticipate unprecedented market conditions? Through their learned understanding of underlying distributions, these models can extrapolate beyond historical observations, generating plausible scenarios that expand our understanding and preparation capabilities.
5. Theoretical Elegance
From a mathematical perspective, generative modeling represents a beautiful intersection of information theory, probability, and optimization. Every major approach—GANs, VAEs, Diffusion Models—can be understood as different methods for projecting one probability distribution onto another. Despite their varied formulations and optimization strategies, they share the elegant goal of learning complex, high-dimensional distributions through mathematically tractable frameworks.
Consider that we are given a dataset
\begin{equation} D = {x_1, x_2, \dots, x_n}, \quad x_i \overset{\text{iid}}{\sim} P_X \label{eq:data-sample} \end{equation}
where $P_X$ represents the true but unknown distribution governing our data—whether we’re dealing with natural images, audio waveforms, text sequences, or any other complex data modality.
Our Goal: Estimate the underlying distribution $P_X$ and develop the capability to generate new samples that appear as though they were authentically drawn from this same distribution.
The fundamental approach to generative modeling can be elegantly decomposed into three interconnected steps:
1. Assume a Model Family
We begin by selecting a parametric family of distributions ${P_\theta : \theta \in \Theta}$ that we hypothesize can effectively approximate the true distribution $P_X$. In contemporary practice, $P_\theta$ is typically represented through the expressive power of deep neural networks, where $\theta$ encompasses all the learnable parameters—weights, biases, and architectural choices that define our model’s capacity to capture complex patterns.
2. Define a Divergence Measure
The next crucial step involves introducing a mathematically principled measure of difference between distributions—a divergence or distance metric—that quantifies how far our model distribution $P_\theta$ deviates from the true data distribution $P_X$. We denote this divergence as $D(P_{X} \, \| \, P_{\theta})$. The choice of divergence profoundly influences the learning dynamics and the types of solutions our model will discover.
3. Optimization Process
Finally, we formulate the learning problem as an optimization challenge:
\begin{equation} \theta^{*} = \arg \min_{\theta} D(P_{X} \, || \, P_{\theta}) \label{eq:kl-min} \end{equation}
This optimization yields parameters $\theta^*$ that minimize the chosen divergence, ensuring our model distribution $P_{\theta^*}$ approximates the true data distribution as closely as possible under our mathematical framework.
Once we successfully complete this three-step process, the magic happens: we can generate entirely new samples by drawing from our learned distribution $P_{\theta^*}$. These samples, while never seen during training, should capture the essential characteristics and patterns present in the original data distribution.
This framework’s beauty lies in its generality—different generative models (GANs, VAEs, Diffusion Models) essentially differ in their choices for steps 1 and 2, but they all follow this fundamental mathematical blueprint for learning to generate realistic data.
Consider starting with a latent variable $z \in \mathbb{R}^k$—a vector in some lower-dimensional space—drawn from a simple, well-understood distribution such as
\[z \sim \mathcal{N}(0, I).\]This choice of a standard multivariate Gaussian provides us with an easily sampled, mathematically tractable foundation from which to build complexity.
Now we introduce a transformation function
\[g_\theta : \mathbb{R}^k \to \mathcal{X},\]parameterized by learnable parameters $\theta$, which maps our simple latent variable $z$ into the complex data space $\mathcal{X}$. This function generates data-like samples through the transformation
\[\tilde{x} = g_\theta(z).\]The resulting distribution of $\tilde{x}$ depends entirely on how we design and parameterize $g_\theta$, revealing the profound impact of our modeling choices:
Linear Transformations: If $g_\theta$ represents a linear mapping, the generated samples $\tilde{x}$ will maintain the Gaussian structure of the input, resulting in another Gaussian distribution—useful but limited in expressiveness.
Deep Neural Networks: When $g_\theta$ is implemented as a deep neural network with nonlinear activations, the generated samples $\tilde{x}$ can exhibit extraordinarily complex distributional properties. The network’s layers progressively transform the simple Gaussian input through a series of nonlinear operations, potentially capturing intricate patterns, multimodal structures, and sophisticated dependencies that characterize real-world data.
We formally denote the distribution of the generated samples $\tilde{x}$ as $P_\theta$. This distribution represents the pushforward of the simple Gaussian prior through our transformation function—essentially, $P_\theta$ captures the distribution of all possible samples we can generate by passing Gaussian noise through our parameterized neural network $g_\theta$.
Our objective is to minimize the discrepancy between the model distribution $P_\theta$ and the true data distribution $P_X$. This yields the optimization problem as shown in Equation \eqref{eq:kl-min},
Upon successful convergence to the optimal parameters $\theta^*$, the following sequence of operations holds:
The optimization process establishes a formal equivalence between two sampling procedures:
This equivalence can be stated formally as: the pushforward measure of the prior distribution $\mathcal{N}(0, I)$ under the learned transformation $g_{\theta^*}$ approximates the true data distribution $P_X$. The framework demonstrates that we have successfully parameterized a complex, high-dimensional probability distribution $P_X$ through the composition of a simple prior and a learnable deterministic mapping. The generator function $g_{\theta^*}$ effectively encodes the statistical structure of the data manifold, enabling tractable sampling from an otherwise intractable distribution through the transformation of easily sampled noise.
A fundamental breakthrough in generative modeling stems from a powerful theoretical result: every $f$-divergence can be expressed as a variational optimization problem involving only expectations. This dual formulation eliminates the need for explicit density computations and forms the mathematical foundation for adversarial training methods.
The proof leverages the fundamental Fenchel–Young inequality from convex analysis:
To see how the variational representation works in practice, let’s compute the forward KL divergence between two Bernoulli distributions.
Let \(P = \mathrm{Bernoulli}(p), \quad Q = \mathrm{Bernoulli}(q),\) with support ${0,1}$.
For forward KL, we have \(f(t) = t \log t, \qquad f^{*}(u) = e^{u-1}.\)
Suppose the critic $T$ takes values \(T(0) = a, \quad T(1) = b.\)
The variational objective is \(\mathcal{J}(a,b) = \underbrace{\mathbb{E}_{x\sim P}[T(x)]}_{(1-p)a + p b} - \underbrace{\mathbb{E}_{y\sim Q}[f^{*}(T(y))]}_{(1-q)e^{a-1} + q e^{b-1}}.\)
Step 1. Optimize over $a,b$ (the critic).
Take derivatives and set to zero: \(\frac{\partial \mathcal{J}}{\partial a} = (1-p) - (1-q)e^{a-1} = 0 \;\;\Rightarrow\;\; a^{*} = 1 + \log\frac{1-p}{1-q},\) \(\frac{\partial \mathcal{J}}{\partial b} = p - q e^{b-1} = 0 \;\;\Rightarrow\;\; b^{*} = 1 + \log\frac{p}{q}.\)
Step 2. Plug back.
\[\sup_{a,b}\,\mathcal{J}(a,b) = (1-p)\log\frac{1-p}{1-q} + p \log\frac{p}{q}.\]But this is exactly the forward KL divergence: \(D_{\mathrm{KL}}(P\|Q) = p\log\frac{p}{q} + (1-p)\log\frac{1-p}{1-q}.\)
Intuition.
This variational representation transforms an intractable density-based computation into a tractable optimization problem over critic functions $T$. In practice, $T$ is parameterized by a neural network (the discriminator in GANs), enabling gradient-based optimization without explicit density estimation—the mathematical foundation that makes adversarial training possible.
The mathematical framework immediately gives rise to two fundamental questions that determine both the theoretical foundation and practical implementation of any generative modeling approach:
The practical reality constrains us to work exclusively with:
Consequently, all divergence estimation must proceed through sample-based approximations rather than analytical density computations.
1. Likelihood-Based Models
When the model density $p_\theta(x)$ admits tractable computation (as in normalizing flows), we can directly maximize the data likelihood:
\[\theta^* = \arg\max_\theta \sum_{i=1}^n \log p_\theta(x_i),\]which corresponds precisely to minimizing the forward KL divergence $D_{\mathrm{KL}}(P_X \,|\, P_\theta)$.
2. Variational Lower Bounds
For intractable densities $p_\theta(x)$, we construct tractable lower bounds. Variational Autoencoders employ the Evidence Lower Bound (ELBO) to approximate $\log p_\theta(x)$ through a differentiable surrogate objective amenable to gradient-based optimization.
3. Adversarial Estimation via Dual Representations
Recall from the Fenchel duality result that $f$-divergences can be written as a supremum over critics $T$.
In practice, we approximate this supremum using neural networks as critics/discriminators.
Training alternates between maximizing over $T$ (critic) and minimizing over $\theta$ (generator), which is precisely the principle behind GANs.
4. Optimal Transport Theory
A second powerful divergence family comes from Optimal Transport.
Instead of comparing distributions pointwise (as with $f$-divergences), Optimal Transport measures the cost of moving probability mass from one distribution to another.
Definition (Primal Wasserstein-1 distance).
For two distributions $P$ and $Q$ on a metric space $(\mathcal{X}, d)$,
\begin{equation} W_1(P,Q) \;=\; \inf_{\pi \in \Pi(P,Q)} \; \mathbb{E}_{(x,y)\sim \pi}[\,d(x,y)\,] \label{eq:wasserstein-1} \end{equation}
where $\Pi(P,Q)$ is the set of all couplings (joint distributions) whose marginals are $P$ and $Q$.
The proof establishes equality between the primal optimal transport formulation and the dual formulation over Lipschitz functions.
5. Score Function Matching
Diffusion models circumvent explicit likelihood computation by matching score functions—the gradients of log-densities—between data and model distributions across multiple noise scales. This approach avoids density estimation entirely while maintaining theoretical rigor.
The selection of divergence measure fundamentally determines the algorithmic behavior, optimization dynamics, and failure modes of the resulting generative model. Each divergence embodies different mathematical principles and leads to distinct practical outcomes.
To render the optimization problem in Eq. \eqref{eq:kl-min} mathematically well-defined, we must establish a rigorous divergence measure that quantifies the discrepancy between probability distributions. A particularly elegant and theoretically unified framework is provided by the $f$-divergence family.
Consider two probability distributions $P$ and $Q$ defined over the same measurable space $\mathcal{X}$, with corresponding probability density functions $p(x)$ and $q(x)$ with respect to some common dominating measure.
For a strictly convex function $f: \mathbb{R}^+ \to \mathbb{R}$ satisfying the normalization condition $f(1) = 0$, the $f$-divergence is formally defined as:
\begin{equation}
\begin{aligned}
D_f(P \, || \, Q) &= \int_{\mathcal{X}} q(x)\, f\,\left(\tfrac{p(x)}{q(x)}\right)\, dx
&= \mathbb{E}_{x \sim Q}\,\left[f\,\left(\tfrac{p(x)}{q(x)}\right)\right]
\end{aligned}
\label{eq:f-div}
\end{equation}
The quantity $\frac{p(x)}{q(x)}$ represents the likelihood ratio (equivalently termed the Radon–Nikodym derivative $\frac{dP}{dQ}$), which encodes the local preference of distribution $P$ relative to distribution $Q$ at each point $x$.
The $f$-divergence framework possesses several crucial theoretical properties:
1. Non-Negativity
\[D_f(P\|Q) \;\geq\; 0 \quad \text{for all probability distributions } P, Q.\]2. Identity of Indiscernibles
\[D_f(P\|Q) \;=\; 0 \quad \text{if and only if} \quad P = Q \text{ almost everywhere}.\]These properties follow rigorously from Jensen’s inequality applied to the convex function $f$ and the probabilistic interpretation of the expectation operator.
The generality of the $f$-divergence framework encompasses numerous classical divergence measures through appropriate selection of the generating function $f$:
1. Kullback–Leibler (Forward KL) Divergence
\begin{equation}
\begin{aligned}
f(u) &= u \log u,
D_{\mathrm{KL}}(P \,|\, Q) &= \int p(x)\, \log \frac{p(x)}{q(x)}\, dx
\end{aligned}
\label{eq:kl-def}
\end{equation}
Mode-covering behavior: Penalizes $Q$ for assigning low probability to regions where $P$ has significant mass. Consider a dataset of human faces where the true distribution $P$ has three distinct modes: 60% young faces, 30% middle-aged faces, and 10% elderly faces. If our model $Q$ learns to generate only young and middle-aged faces while completely ignoring elderly faces (setting $q(x) \approx 0$ for elderly face regions), the forward KL divergence $D_{\mathrm{KL}}(P \, \| \, Q) = \int p(x) \log \frac{p(x)}{q(x)} dx$ will explode because $\frac{p(x)}{q(x)} \to \infty$ in elderly face regions where $p(x) = 0.1$ but $q(x) \approx 0$. To minimize this divergence, the model is forced to assign reasonable probability to all face types, including the rare elderly faces—it cannot simply ignore 10% of the data distribution without suffering enormous penalties. This results in a generator that covers all modes of the face distribution, producing diverse outputs across all age groups, though individual samples might appear somewhat blurred or averaged as the model spreads its probability mass broadly to avoid missing any demographic.
2. Reverse KL Divergence
\begin{equation} \begin{aligned} f(u) &= -\log u, \,\,\, D_{\mathrm{KL}}(Q \,|\, P) &= \int q(x)\, \log \frac{q(x)}{p(x)}\, dx \end{aligned} \label{eq:reverse-kl} \end{equation}
3. Jensen–Shannon (JS) Divergence
\[f(u) = \frac{1}{2}\left(u \log u - (u+1)\log\frac{u+1}{2}\right),\]yielding the symmetric formulation:
\begin{equation} D_{\mathrm{JS}}(P \,||\, Q) = \tfrac{1}{2}\, D_{\mathrm{KL}}\,\left(P \,\bigg|\bigg|\, \tfrac{P+Q}{2}\right) + \tfrac{1}{2}\, D_{\mathrm{KL}}\,\left(Q \,\bigg|\bigg|\, \tfrac{P+Q}{2}\right) \label{eq:js} \end{equation}
4. Total Variation (TV) Distance
\begin{equation} \begin{aligned} f(u) &= \tfrac{1}{2}\,|u-1|, \,\,\, D_{\mathrm{TV}}(P, Q) &= \tfrac{1}{2} \int |p(x) - q(x)|\, dx \end{aligned} \label{eq:tv} \end{equation}
So far, we’ve established the mathematical blueprint of generative modeling:
We’ve also seen how $f$-divergences and Wasserstein distances provide the theoretical building blocks for this optimization. But there’s still one crucial gap: how do we actually compute and minimize these divergences when the densities $p_X$ and $p_\theta$ are unknown or intractable?
The answer comes from two key insights:
Sample-based expectations.
By the Law of Large Numbers, integrals with respect to $p_X$ or $p_\theta$ can be replaced by averages over samples. This means we can approximate expectations without ever writing down explicit densities.
Variational Divergence Minimization (VDM).
Using Fenchel conjugacy, any $f$-divergence can be rewritten as a variational optimization problem involving a critic function $T$. Parameterizing $T$ with a neural network and alternating optimization of $(\theta, w)$ gives rise to a saddle-point problem:
This adversarial min–max game is not just an abstract construction. When we choose the right $f$ (corresponding to Jensen–Shannon divergence), the resulting objective is exactly the Generative Adversarial Network (GAN) loss:
\[J_{\text{GAN}}(\theta,w) = \mathbb{E}_{x\sim P_{X}}[\log D_w(x)] + \mathbb{E}_{y\sim P_{\theta}}[\log(1-D_w(y))],\]where $D_w(x)$ is the discriminator network.
What began as a purely mathematical problem—minimizing divergences between distributions—naturally evolves into the adversarial training paradigm that underlies GANs.
In the next part of this series, we will explore this connection in detail: how the general theory of $f$-divergences and variational duality gives rise to GANs and their modern extensions.