Variational Divergence Minimization

Deriving adversarial training from the variational form of $f$-divergences

Motivation

In the previous post, we defined the general principle of generative modeling as learning a model distribution $P_\theta$ that approximates the true data distribution $P_X$ by minimizing a divergence:

\[\theta^{*} = \arg\min_\theta D(P_X \,\|\, P_\theta).\]

But in practice, both $p_X(x)$ and $p_\theta(x)$ are intractable.
We only have:

Samples ${x_i}_{i=1}^n \sim P_X$ from the dataset
Samples ${\tilde{x}_j = g_\theta(z_j)}_{j=1}^m$ with $z_j \sim \mathcal{N}(0,I)$ from the model

So, how can we minimize a divergence without knowing either density?

Sample-Based Estimation of Expectations

Let’s start with a basic idea from probability theory.

Theorem 2.1 — Law of Large Numbers (LLN)

Let $X_1, X_2, \ldots, X_n$ be i.i.d. random variables drawn from $P_X$. Then for any integrable function $h(x)$, $$ \frac{1}{n}\sum_{i=1}^{n} h(X_i) \;\xrightarrow[n\to\infty]{a.s.}\; \mathbb{E}_{x\sim P_X}[h(x)]. $$

This means integrals over $p_X$ can be approximated by finite sample averages:

\[\int h(x)p_X(x)\,dx \;\approx\; \frac{1}{n}\sum_{i=1}^n h(x_i).\]

this is the foundation of Monte Carlo estimation used across all generative learning paradigms — GANs, VAEs, Diffusion Models — which all optimize expectations rather than explicit densities.

Expressing $f$-Divergence in Expectation Form

Recall that for a convex function $f:\mathbb{R}_{+} \to \mathbb{R}$ with $f(1)=0$,

\[D_f(P_X\|P_\theta) = \int p_\theta(x)\,f\!\left(\frac{p_X(x)}{p_\theta(x)}\right)\,dx.\]

Since this involves unknown densities, we need a reformulation that depends only on expectations — quantities we can estimate from samples.

Fenchel–Legendre Conjugate

Definition 2.1 — Conjugate Function

For a convex $f$, the **Fenchel conjugate** $f^*$ is defined as: $$ f^*(t) = \sup_{u>0}\; (ut - f(u)). $$ Properties: 1. $f^*$ is convex. 2. The double conjugate recovers $f$, i.e. $f^{**}=f$.

By the Fenchel–Young inequality, $ut \le f(u) + f^*(t), \quad \forall\,u,t.$

Equality holds when $t=f’(u)$.

Deriving the Variational Lower Bound

We apply the conjugate definition to rewrite $f$ inside the divergence.

Proof

Substitute $f(u)=\sup_t\{tu - f^*(t)\}$ into the $f$-divergence: $$ \begin{aligned} D_f(P_X\|P_\theta) &= \int p_\theta(x)\, \sup_t \Big\{t \tfrac{p_X(x)}{p_\theta(x)} - f^*(t)\Big\}\,dx \\ &= \sup_T \int p_\theta(x)\Big[T(x)\tfrac{p_X(x)}{p_\theta(x)} - f^*(T(x))\Big]dx \\ &= \sup_T \Big(\mathbb{E}_{x\sim P_X}[T(x)] - \mathbb{E}_{x\sim P_\theta}[f^*(T(x))]\Big). \end{aligned} $$

This yields the variational representation:

\[\boxed{ D_f(P_X\|P_\theta) \ge \sup_T \Big(\mathbb{E}_{x\sim P_X}[T(x)] - \mathbb{E}_{x\sim P_\theta}[f^*(T(x))]\Big). }\]

The inequality appears because the space of test functions $T$ we optimize over may not include the exact optimum $T^*(x)=f’!\big(\frac{p_X(x)}{p_\theta(x)}\big)$.