What Is Generative Modeling?

Divergences, Latent Variables, and the Core Questions

Why Generative Modeling?

At the heart of modern Artificial Intelligence lies a beautifully complex challenge: the ability to model intricate data distributions that govern the world around us. Consider the rich variety of information we encounter daily—the subtle interplay of pixels in photographic images, the nuanced flow of natural language, the temporal patterns of human speech, the elegant folding structures of proteins, or the seemingly chaotic yet meaningful fluctuations of financial markets. All of these diverse domains share a profound mathematical commonality: the data we observe emerges from some unknown, high-dimensional probability distribution.

This fundamental insight leads us to the core questions that generative modeling seeks to address:

Can we capture the hidden structure within these distributions and unveil the latent patterns that shape what we observe?
Can we simulate new samples that authentically represent the same underlying source, rather than merely copying existing data?
Can we learn meaningful features that transcend generation itself, providing powerful representations for tasks like classification, clustering, and reinforcement learning?

Why This Matters Deeply

1. Realistic Sample Generation

Modern AI’s most impressive achievements stem from successfully learning data distributions. GANs create photorealistic images indistinguishable from real photographs. Diffusion models generate stunning text-to-image outputs that rival human artistry. Large language models produce coherent, contextually rich text that captures the subtleties of human communication. Each breakthrough becomes possible because these models learn to approximate $P_X$—the probability distribution underlying their training data—enabling them to sample from this learned distribution to create genuinely novel outputs.

2. Representation Learning Through Distribution Modeling

When models attempt to capture $P_X$, they naturally discover and encode meaningful latent representations of the data. These representations possess a remarkable quality: they capture semantic relationships without explicit supervision. In facial recognition, models learn to distinguish “smiling” from “frowning,” “young” from “old,” “casual” from “formal”—all without labeled examples. These rich, unsupervised features become invaluable assets for downstream applications across diverse domains.

3. Uncertainty Quantification

Unlike deterministic models that output single predictions, generative models provide complete probabilistic descriptions of data. This framework enables principled uncertainty reasoning—absolutely crucial in scientific research, medical diagnosis, and engineering applications where understanding confidence levels and potential variations matters as much as the predictions themselves. The probabilistic nature enables Monte Carlo methods, Bayesian inference, and robust decision-making under uncertainty.

4. Exploring the Rare and Unseen

Generative modeling unlocks the ability to simulate scenarios that may be sparse or absent from training data. Want to model century-scale climate events? Design novel molecular structures? Anticipate unprecedented market conditions? Through their learned understanding of underlying distributions, these models can extrapolate beyond historical observations, generating plausible scenarios that expand our understanding and preparation capabilities.

5. Theoretical Elegance

From a mathematical perspective, generative modeling represents a beautiful intersection of information theory, probability, and optimization. Every major approach—GANs, VAEs, Diffusion Models—can be understood as different methods for projecting one probability distribution onto another. Despite their varied formulations and optimization strategies, they share the elegant goal of learning complex, high-dimensional distributions through mathematically tractable frameworks.

The Core Framework

Consider that we are given a dataset

\begin{equation} D = {x_1, x_2, \dots, x_n}, \quad x_i \overset{\text{iid}}{\sim} P_X \label{eq:data-sample} \end{equation}

where $P_X$ represents the true but unknown distribution governing our data—whether we’re dealing with natural images, audio waveforms, text sequences, or any other complex data modality.

Our Goal: Estimate the underlying distribution $P_X$ and develop the capability to generate new samples that appear as though they were authentically drawn from this same distribution.

The Three-Step Mathematical Recipe

The fundamental approach to generative modeling can be elegantly decomposed into three interconnected steps:

1. Assume a Model Family

We begin by selecting a parametric family of distributions ${P_\theta : \theta \in \Theta}$ that we hypothesize can effectively approximate the true distribution $P_X$. In contemporary practice, $P_\theta$ is typically represented through the expressive power of deep neural networks, where $\theta$ encompasses all the learnable parameters—weights, biases, and architectural choices that define our model’s capacity to capture complex patterns.

2. Define a Divergence Measure

The next crucial step involves introducing a mathematically principled measure of difference between distributions—a divergence or distance metric—that quantifies how far our model distribution $P_\theta$ deviates from the true data distribution $P_X$. We denote this divergence as $D(P_{X} \, \| \, P_{\theta})$. The choice of divergence profoundly influences the learning dynamics and the types of solutions our model will discover.

3. Optimization Process

Finally, we formulate the learning problem as an optimization challenge:

\begin{equation} \theta^{*} = \arg \min_{\theta} D(P_{X} \, || \, P_{\theta}) \label{eq:kl-min} \end{equation}

This optimization yields parameters $\theta^*$ that minimize the chosen divergence, ensuring our model distribution $P_{\theta^*}$ approximates the true data distribution as closely as possible under our mathematical framework.

From Learning to Generation

Once we successfully complete this three-step process, the magic happens: we can generate entirely new samples by drawing from our learned distribution $P_{\theta^*}$. These samples, while never seen during training, should capture the essential characteristics and patterns present in the original data distribution.

This framework’s beauty lies in its generality—different generative models (GANs, VAEs, Diffusion Models) essentially differ in their choices for steps 1 and 2, but they all follow this fundamental mathematical blueprint for learning to generate realistic data.

A Concrete Example

Consider starting with a latent variable $z \in \mathbb{R}^k$—a vector in some lower-dimensional space—drawn from a simple, well-understood distribution such as

\[z \sim \mathcal{N}(0, I).\]

This choice of a standard multivariate Gaussian provides us with an easily sampled, mathematically tractable foundation from which to build complexity.

The Transformation Function

Now we introduce a transformation function

\[g_\theta : \mathbb{R}^k \to \mathcal{X},\]

parameterized by learnable parameters $\theta$, which maps our simple latent variable $z$ into the complex data space $\mathcal{X}$. This function generates data-like samples through the transformation

\[\tilde{x} = g_\theta(z).\]

The Power of Architectural Choice

The resulting distribution of $\tilde{x}$ depends entirely on how we design and parameterize $g_\theta$, revealing the profound impact of our modeling choices:

Linear Transformations: If $g_\theta$ represents a linear mapping, the generated samples $\tilde{x}$ will maintain the Gaussian structure of the input, resulting in another Gaussian distribution—useful but limited in expressiveness.
Deep Neural Networks: When $g_\theta$ is implemented as a deep neural network with nonlinear activations, the generated samples $\tilde{x}$ can exhibit extraordinarily complex distributional properties. The network’s layers progressively transform the simple Gaussian input through a series of nonlinear operations, potentially capturing intricate patterns, multimodal structures, and sophisticated dependencies that characterize real-world data.

Induced Distribution

We formally denote the distribution of the generated samples $\tilde{x}$ as $P_\theta$. This distribution represents the pushforward of the simple Gaussian prior through our transformation function—essentially, $P_\theta$ captures the distribution of all possible samples we can generate by passing Gaussian noise through our parameterized neural network $g_\theta$.

The Optimization Problem

Our objective is to minimize the discrepancy between the model distribution $P_\theta$ and the true data distribution $P_X$. This yields the optimization problem as shown in Equation \eqref{eq:kl-min},

Theoretical Consequences of Optimal Solution

Upon successful convergence to the optimal parameters $\theta^*$, the following sequence of operations holds:

Sampling: Draw $z \sim \mathcal{N}(0, I)$ from the prior distribution
Transformation: Apply the learned mapping $\tilde{x} = g_{\theta^*}(z)$
Result: The generated sample satisfies $\tilde{x} \sim P_{\theta^*} \approx P_X$

Formal Mathematical Equivalence

The optimization process establishes a formal equivalence between two sampling procedures:

Direct sampling from the intractable data distribution $P_X$
Indirect sampling via the composition of tractable operations: $z \sim \mathcal{N}(0, I)$ followed by the deterministic transformation $g_{\theta^*}(z)$

This equivalence can be stated formally as: the pushforward measure of the prior distribution $\mathcal{N}(0, I)$ under the learned transformation $g_{\theta^*}$ approximates the true data distribution $P_X$. The framework demonstrates that we have successfully parameterized a complex, high-dimensional probability distribution $P_X$ through the composition of a simple prior and a learnable deterministic mapping. The generator function $g_{\theta^*}$ effectively encodes the statistical structure of the data manifold, enabling tractable sampling from an otherwise intractable distribution through the transformation of easily sampled noise.

Variational Representation of $f$-Divergences (Fenchel Duality)

A fundamental breakthrough in generative modeling stems from a powerful theoretical result: every $f$-divergence can be expressed as a variational optimization problem involving only expectations. This dual formulation eliminates the need for explicit density computations and forms the mathematical foundation for adversarial training methods.

Theorem 1.1

For any convex function $f: \mathbb{R}_+ \to \mathbb{R}$ satisfying $f(1) = 0$, the $f$-divergence admits the variational form: \begin{equation} D_f(P \,\|\, Q) = \sup_T \left( \mathbb{E}_{x \sim P}[T(x)] - \mathbb{E}_{y \sim Q}[f^*(T(y))] \right) \label{eq:variational-f} \end{equation} where $f^\*$ denotes the Fenchel conjugate of $f$, defined as $f^*(t) = \sup_u (tu - f(u))$, and the supremum ranges over all measurable functions $T$.

Proof

The proof leverages the fundamental Fenchel–Young inequality from convex analysis:

Apply Fenchel–Young Inequality
For any convex function $f$ and its conjugate $f^*$, we have $ut \leq f(t) + f^*(u)$ for all $u, t$. Setting $t = r(x) = \frac{p(x)}{q(x)}$ and $u = T(x)$: $$ T(x) \cdot \frac{p(x)}{q(x)} \leq f\!\left(\tfrac{p(x)}{q(x)}\right) + f^*(T(x)) $$
Rearrange and Integrate
Rearranging the inequality: $$ f\!\left(\tfrac{p(x)}{q(x)}\right) \geq T(x) \cdot \tfrac{p(x)}{q(x)} - f^*(T(x)) $$ Taking expectation with respect to $Q$: $$ \mathbb{E}_{x \sim Q}\!\left[f\!\left(\tfrac{p(x)}{q(x)}\right)\right] \geq \mathbb{E}_{x \sim Q}[T(x)\tfrac{p(x)}{q(x)}] - \mathbb{E}_{x \sim Q}[f^*(T(x))] $$
Change of Measure
The key insight is recognizing that: $$ \mathbb{E}_{x \sim Q}[T(x) \tfrac{p(x)}{q(x)}] = \int T(x)\tfrac{p(x)}{q(x)} q(x)\, dx = \int T(x) p(x)\, dx = \mathbb{E}_{x \sim P}[T(x)] $$
Establish Lower Bound
Substituting back: $$ D_f(P\|Q) \geq \mathbb{E}_{x \sim P}[T(x)] - \mathbb{E}_{x \sim Q}[f^*(T(x))] $$
Achievable Supremum
The supremum over all functions $T$ yields: $$ D_f(P\|Q) \geq \sup_T \Big( \mathbb{E}_{x \sim P}[T(x)] - \mathbb{E}_{x \sim Q}[f^*(T(x))] \Big) $$
Tightness Condition
Equality is achieved when $T(x) = f'\!\left(\tfrac{p(x)}{q(x)}\right)$ (the derivative of $f$ at the likelihood ratio), establishing: $$ D_f(P\|Q) = \sup_T \Big( \mathbb{E}_{x \sim P}[T(x)] - \mathbb{E}_{x \sim Q}[f^*(T(x))] \Big) $$

Example: Forward KL in a Simple Discrete Case

To see how the variational representation works in practice, let’s compute the forward KL divergence between two Bernoulli distributions.

Let $P = \mathrm{Bernoulli}(p), \quad Q = \mathrm{Bernoulli}(q),$ with support ${0,1}$.

For forward KL, we have $f(t) = t \log t, \qquad f^{*}(u) = e^{u-1}.$

Suppose the critic $T$ takes values $T(0) = a, \quad T(1) = b.$

The variational objective is $\mathcal{J}(a,b) = \underbrace{\mathbb{E}_{x\sim P}[T(x)]}_{(1-p)a + p b} - \underbrace{\mathbb{E}_{y\sim Q}[f^{*}(T(y))]}_{(1-q)e^{a-1} + q e^{b-1}}.$

Step 1. Optimize over $a,b$ (the critic).

Take derivatives and set to zero: $\frac{\partial \mathcal{J}}{\partial a} = (1-p) - (1-q)e^{a-1} = 0 \;\;\Rightarrow\;\; a^{*} = 1 + \log\frac{1-p}{1-q},$ $\frac{\partial \mathcal{J}}{\partial b} = p - q e^{b-1} = 0 \;\;\Rightarrow\;\; b^{*} = 1 + \log\frac{p}{q}.$

Step 2. Plug back.

\[\sup_{a,b}\,\mathcal{J}(a,b) = (1-p)\log\frac{1-p}{1-q} + p \log\frac{p}{q}.\]

But this is exactly the forward KL divergence: $D_{\mathrm{KL}}(P\|Q) = p\log\frac{p}{q} + (1-p)\log\frac{1-p}{1-q}.$

Intuition.

The optimal critic $T^*(x)$ is essentially the log likelihood ratio between $P$ and $Q$.
The generator distribution $Q$ is updated to make this ratio closer to $1$, i.e. to match $P$.
In this toy example, we can see the min–max game: the critic tries to highlight mismatches between $P$ and $Q$, while the generator moves $q$ toward $p$ to eliminate them.

Practical Significance

This variational representation transforms an intractable density-based computation into a tractable optimization problem over critic functions $T$. In practice, $T$ is parameterized by a neural network (the discriminator in GANs), enabling gradient-based optimization without explicit density estimation—the mathematical foundation that makes adversarial training possible.

The Core Questions

The mathematical framework immediately gives rise to two fundamental questions that determine both the theoretical foundation and practical implementation of any generative modeling approach:

How do we compute the divergence $D(P_X \,\|\, P_\theta)$ when the probability density functions are unknown or intractable?
- The true data density $p_X$ cannot be analytically expressed for real-world datasets
- The model density $p_\theta$ is often computationally intractable (except in specialized cases such as normalizing flows)
- Therefore, how do we evaluate or approximate the divergence $D$ in practice?
What constitutes an appropriate choice of divergence metric $D$?
- Different divergence measures capture distinct notions of distributional similarity
- Forward KL divergence, Reverse KL divergence, Jensen–Shannon divergence, Wasserstein distance, Fisher divergence—each induces fundamentally different generative modeling algorithms
- The selection of divergence metric effectively determines the entire algorithmic framework and optimization procedure

Question 1: Computing Divergences

The practical reality constrains us to work exclusively with:

Empirical samples from $P_X$ (constituting our training dataset)
Generated samples from $P_\theta$ (obtained by sampling $z \sim \mathcal{N}(0,I)$ and applying the transformation $g_\theta$)

Consequently, all divergence estimation must proceed through sample-based approximations rather than analytical density computations.

Established Methodological Approaches

1. Likelihood-Based Models

When the model density $p_\theta(x)$ admits tractable computation (as in normalizing flows), we can directly maximize the data likelihood:

\[\theta^* = \arg\max_\theta \sum_{i=1}^n \log p_\theta(x_i),\]

which corresponds precisely to minimizing the forward KL divergence $D_{\mathrm{KL}}(P_X \,|\, P_\theta)$.

Proof

The forward KL divergence between the empirical data distribution and the model is: $$ \begin{aligned} D_{\mathrm{KL}}(P_X \,\|\, P_\theta) &= \int p_X(x) \log \frac{p_X(x)}{p_\theta(x)}\, dx \\ &= \mathbb{E}_{x \sim P_X}\!\left[\log p_X(x)\right] - \mathbb{E}_{x \sim P_X}\!\left[\log p_\theta(x)\right] \end{aligned} $$ The first term $\mathbb{E}_{x \sim P_X}[\log p_X(x)]$ is the entropy of the true data distribution, which is independent of $\theta$ and thus irrelevant for optimization. The second term can be approximated using the empirical distribution from our dataset $D = \{x_1, \ldots, x_n\}$: $$ \mathbb{E}_{x \sim P_X}[\log p_\theta(x)] \approx \frac{1}{n} \sum_{i=1}^n \log p_\theta(x_i). $$ Therefore: $$ \begin{aligned} \arg\min_\theta D_{\mathrm{KL}}(P_X \,\|\, P_\theta) &= \arg\min_\theta \left(-\tfrac{1}{n} \sum_{i=1}^n \log p_\theta(x_i)\right) \\ &= \arg\max_\theta \sum_{i=1}^n \log p_\theta(x_i) \end{aligned} $$

2. Variational Lower Bounds

For intractable densities $p_\theta(x)$, we construct tractable lower bounds. Variational Autoencoders employ the Evidence Lower Bound (ELBO) to approximate $\log p_\theta(x)$ through a differentiable surrogate objective amenable to gradient-based optimization.

3. Adversarial Estimation via Dual Representations

Recall from the Fenchel duality result that $f$-divergences can be written as a supremum over critics $T$.
In practice, we approximate this supremum using neural networks as critics/discriminators.
Training alternates between maximizing over $T$ (critic) and minimizing over $\theta$ (generator), which is precisely the principle behind GANs.

4. Optimal Transport Theory

A second powerful divergence family comes from Optimal Transport.
Instead of comparing distributions pointwise (as with $f$-divergences), Optimal Transport measures the cost of moving probability mass from one distribution to another.

Definition (Primal Wasserstein-1 distance).

For two distributions $P$ and $Q$ on a metric space $(\mathcal{X}, d)$,

\begin{equation} W_1(P,Q) \;=\; \inf_{\pi \in \Pi(P,Q)} \; \mathbb{E}_{(x,y)\sim \pi}[\,d(x,y)\,] \label{eq:wasserstein-1} \end{equation}

where $\Pi(P,Q)$ is the set of all couplings (joint distributions) whose marginals are $P$ and $Q$.

Interpretation: $\pi$ describes a transportation plan telling us how to move probability mass from $P$ to $Q$.
The Wasserstein distance is the expected cost (with cost metric $d$) of the cheapest possible plan.

Theorem 1.2 (Kantorovich–Rubinstein Duality)

For probability measures $P, Q$ on a Polish metric space A Polish metric space is simply a metric space that is both complete (every Cauchy sequence converges within the space) and separable (it has a countable dense subset). Typical examples include $\mathbb{R}^d$ with Euclidean distance. This assumption ensures existence of couplings and validity of strong duality., \begin{equation} W_1(P,Q) \;=\; \sup_{\|f\|_{\mathrm{Lip}} \leq 1} \Big( \mathbb{E}_{x\sim P}[f(x)] - \mathbb{E}_{y\sim Q}[f(y)] \Big), \label{eq:KR-Duality} \end{equation} where $\|f\|_{\text{Lip}} = \sup_{x \neq y} \tfrac{|f(x) - f(y)|}{d(x,y)}$ is the Lipschitz constant of $f$.

Proof

The proof establishes equality between the primal optimal transport formulation and the dual formulation over Lipschitz functions.

Primal formulation.
By definition, $$ W_1(P, Q) = \inf_{\gamma \in \Pi(P,Q)} \int d(x,y)\, \gamma(dx,dy), $$ where $\gamma$ ranges over all couplings of $P$ and $Q$.
Dual relaxation.
This is a linear program: minimize $\langle d, \gamma \rangle$ subject to $\gamma \in \Pi(P,Q)$. By LP duality, the dual problem is $$ \sup_{f,g} \left( \int f(x)\,P(dx) + \int g(y)\,Q(dy) \right) $$ subject to $f(x) + g(y) \leq d(x,y)$ for all $x,y$.
Reduction to one function.
Defining $h(x) = \inf_y \{ d(x,y) - g(y)\}$, the constraint ensures $h(x)\ge f(x)$ and $h(x)+g(y)\le d(x,y)$. Substituting gives an equivalent dual in terms of $h$ alone.
Lipschitz property.
The construction implies $h$ is 1-Lipschitz: $$ |h(x_1) - h(x_2)| \le d(x_1,x_2). $$
Optimal choice of dual variables.
Taking $g(y) = -h(y)$ and $f(x)\le h(x)$, the dual objective reduces to $$ \int h(x)\,P(dx) - \int h(y)\,Q(dy). $$
Strong duality.
Under mild conditions (Polish metric space, finite first moments), strong duality holds, so $$ W_1(P,Q) = \sup_{\|h\|_{\text{Lip}}\le 1} \left( \int h(x)\,P(dx) - \int h(y)\,Q(dy)\right). $$
Final form.
Rewriting in expectation notation yields exactly $$ W_1(P,Q) = \sup_{\|f\|_{\text{Lip}} \le 1} \left( \mathbb{E}_{x\sim P}[f(x)] - \mathbb{E}_{y\sim Q}[f(y)]\right). $$

5. Score Function Matching

Diffusion models circumvent explicit likelihood computation by matching score functions—the gradients of log-densities—between data and model distributions across multiple noise scales. This approach avoids density estimation entirely while maintaining theoretical rigor.

Question 2: Choice of divergence

The selection of divergence measure fundamentally determines the algorithmic behavior, optimization dynamics, and failure modes of the resulting generative model. Each divergence embodies different mathematical principles and leads to distinct practical outcomes.

Introducing Divergences Formally

To render the optimization problem in Eq. \eqref{eq:kl-min} mathematically well-defined, we must establish a rigorous divergence measure that quantifies the discrepancy between probability distributions. A particularly elegant and theoretically unified framework is provided by the $f$-divergence family.

General Definition of $f$-Divergence

Consider two probability distributions $P$ and $Q$ defined over the same measurable space $\mathcal{X}$, with corresponding probability density functions $p(x)$ and $q(x)$ with respect to some common dominating measure.

For a strictly convex function $f: \mathbb{R}^+ \to \mathbb{R}$ satisfying the normalization condition $f(1) = 0$, the $f$-divergence is formally defined as:

\begin{equation} \begin{aligned} D_f(P \, || \, Q) &= \int_{\mathcal{X}} q(x)\, f\,\left(\tfrac{p(x)}{q(x)}\right)\, dx
&= \mathbb{E}_{x \sim Q}\,\left[f\,\left(\tfrac{p(x)}{q(x)}\right)\right] \end{aligned} \label{eq:f-div} \end{equation}

The quantity $\frac{p(x)}{q(x)}$ represents the likelihood ratio (equivalently termed the Radon–Nikodym derivative $\frac{dP}{dQ}$), which encodes the local preference of distribution $P$ relative to distribution $Q$ at each point $x$.

Fundamental Properties

The $f$-divergence framework possesses several crucial theoretical properties:

1. Non-Negativity

\[D_f(P\|Q) \;\geq\; 0 \quad \text{for all probability distributions } P, Q.\]

2. Identity of Indiscernibles

\[D_f(P\|Q) \;=\; 0 \quad \text{if and only if} \quad P = Q \text{ almost everywhere}.\]

These properties follow rigorously from Jensen’s inequality applied to the convex function $f$ and the probabilistic interpretation of the expectation operator.

Canonical Examples of $f$-Divergences

The generality of the $f$-divergence framework encompasses numerous classical divergence measures through appropriate selection of the generating function $f$:

1. Kullback–Leibler (Forward KL) Divergence

\begin{equation} \begin{aligned} f(u) &= u \log u,
D_{\mathrm{KL}}(P \,|\, Q) &= \int p(x)\, \log \frac{p(x)}{q(x)}\, dx \end{aligned} \label{eq:kl-def} \end{equation}

Asymmetric: $D_{\mathrm{KL}}(P \, \| \, Q) \neq D_{\mathrm{KL}}(Q \, \| \, P)$ in general
Mode-covering behavior: Penalizes $Q$ for assigning low probability to regions where $P$ has significant mass. Consider a dataset of human faces where the true distribution $P$ has three distinct modes: 60% young faces, 30% middle-aged faces, and 10% elderly faces. If our model $Q$ learns to generate only young and middle-aged faces while completely ignoring elderly faces (setting $q(x) \approx 0$ for elderly face regions), the forward KL divergence $D_{\mathrm{KL}}(P \, \| \, Q) = \int p(x) \log \frac{p(x)}{q(x)} dx$ will explode because $\frac{p(x)}{q(x)} \to \infty$ in elderly face regions where $p(x) = 0.1$ but $q(x) \approx 0$. To minimize this divergence, the model is forced to assign reasonable probability to all face types, including the rare elderly faces—it cannot simply ignore 10% of the data distribution without suffering enormous penalties. This results in a generator that covers all modes of the face distribution, producing diverse outputs across all age groups, though individual samples might appear somewhat blurred or averaged as the model spreads its probability mass broadly to avoid missing any demographic.
Used in: Maximum likelihood estimation, VAEs, normalizing flows.

2. Reverse KL Divergence

\begin{equation} \begin{aligned} f(u) &= -\log u, \,\,\, D_{\mathrm{KL}}(Q \,|\, P) &= \int q(x)\, \log \frac{q(x)}{p(x)}\, dx \end{aligned} \label{eq:reverse-kl} \end{equation}

Mode-seeking behavior: Encourages $Q$ to concentrate probability mass only where $P$ has substantial support.
Used in: Variational inference, certain EM formulations.

3. Jensen–Shannon (JS) Divergence

\[f(u) = \frac{1}{2}\left(u \log u - (u+1)\log\frac{u+1}{2}\right),\]

yielding the symmetric formulation:

\begin{equation} D_{\mathrm{JS}}(P \,||\, Q) = \tfrac{1}{2}\, D_{\mathrm{KL}}\,\left(P \,\bigg|\bigg|\, \tfrac{P+Q}{2}\right) + \tfrac{1}{2}\, D_{\mathrm{KL}}\,\left(Q \,\bigg|\bigg|\, \tfrac{P+Q}{2}\right) \label{eq:js} \end{equation}

Symmetric: $D_{\mathrm{JS}}(P \, \| \, Q) = D_{\mathrm{JS}}(Q \, \| \, P)$
Bounded: $0 \leq D_{\mathrm{JS}}(P \, \| \, Q) \leq \log 2$
Foundation of the original Generative Adversarial Network objective function

4. Total Variation (TV) Distance

\begin{equation} \begin{aligned} f(u) &= \tfrac{1}{2}\,|u-1|, \,\,\, D_{\mathrm{TV}}(P, Q) &= \tfrac{1}{2} \int |p(x) - q(x)|\, dx \end{aligned} \label{eq:tv} \end{equation}

Statistical interpretation: Quantifies the maximum probability of error in the optimal statistical test for distinguishing between distributions $P$ and $Q$
Geometric interpretation: Measures the $L^1$ distance between probability density functions

Looking Ahead: Adversarial Training

So far, we’ve established the mathematical blueprint of generative modeling:

Assume a flexible parametric family $P_\theta$.
Define a divergence $D(P_X \,|\, P_\theta)$ to measure mismatch.
Optimize $\theta$ to minimize that divergence.

We’ve also seen how $f$-divergences and Wasserstein distances provide the theoretical building blocks for this optimization. But there’s still one crucial gap: how do we actually compute and minimize these divergences when the densities $p_X$ and $p_\theta$ are unknown or intractable?

The answer comes from two key insights:

Sample-based expectations.
By the Law of Large Numbers, integrals with respect to $p_X$ or $p_\theta$ can be replaced by averages over samples. This means we can approximate expectations without ever writing down explicit densities.
Variational Divergence Minimization (VDM).
Using Fenchel conjugacy, any $f$-divergence can be rewritten as a variational optimization problem involving a critic function $T$. Parameterizing $T$ with a neural network and alternating optimization of $(\theta, w)$ gives rise to a saddle-point problem:
\[\theta^{*}, w^{*} = \arg\min_\theta \;\arg\max_w \Big( \mathbb{E}_{x\sim P_X}[T_w(x)] - \mathbb{E}_{y\sim P_\theta}[f^{*}(T_w(y))] \Big).\]

This adversarial min–max game is not just an abstract construction. When we choose the right $f$ (corresponding to Jensen–Shannon divergence), the resulting objective is exactly the Generative Adversarial Network (GAN) loss:

\[J_{\text{GAN}}(\theta,w) = \mathbb{E}_{x\sim P_{X}}[\log D_w(x)] + \mathbb{E}_{y\sim P_{\theta}}[\log(1-D_w(y))],\]

where $D_w(x)$ is the discriminator network.

Closing Thought

What began as a purely mathematical problem—minimizing divergences between distributions—naturally evolves into the adversarial training paradigm that underlies GANs.
In the next part of this series, we will explore this connection in detail: how the general theory of $f$-divergences and variational duality gives rise to GANs and their modern extensions.

What Is Generative Modeling?

Why Generative Modeling?

Why This Matters Deeply

The Core Framework

The Three-Step Mathematical Recipe

From Learning to Generation

A Concrete Example

The Transformation Function

The Power of Architectural Choice

Induced Distribution

The Optimization Problem

Theoretical Consequences of Optimal Solution

Formal Mathematical Equivalence

Variational Representation of $f$-Divergences (Fenchel Duality)

Theorem 1.1

Proof

Example: Forward KL in a Simple Discrete Case

Practical Significance

The Core Questions

Question 1: Computing Divergences

Established Methodological Approaches

Proof

Theorem 1.2 (Kantorovich–Rubinstein Duality)

Proof

Question 2: Choice of divergence

Introducing Divergences Formally

General Definition of $f$-Divergence

Fundamental Properties

Canonical Examples of $f$-Divergences

Looking Ahead: Adversarial Training

Closing Thought

Further Reading