Optimizing on the Probability Simplex: PGD vs Softmax Repara

The Problem: Optimizing a Non-Convex Function on the Probability Simplex

You have a categorical probability distribution with $k$ categories, represented as a vector $\mathbf{x} \in \mathbb{R}^k$ on the probability simplex $\Delta^{k-1}$: $\sum x_i = 1$, $x_i \geq 0$. You also have a non-convex function $f(\mathbf{x})$ that is expensive to evaluate and differentiate. Your goal: find $\mathbf{x}^* = \arg\min_{\mathbf{x} \in \Delta^{k-1}} f(\mathbf{x})$.

The motivating example: protein binder design via hallucination. $\mathbf{x}$ is a distribution over 20 amino acids at each position in a sequence (a position-specific scoring matrix), and $f$ is a folding model like AlphaFold. You want the sequence that gives the best fold.

Approach 1: Softmax Reparameterization (The 'Easy' Way)

Your first instinct: eliminate constraints by reparameterizing. Use softmax: $x_i = e^{\ell_i} / \sum_j e^{\ell_j}$, where $\ell \in \mathbb{R}^k$ are unconstrained logits. Now optimize $\ell$ with gradient descent. This is standard practice in ML — it's what you do for classification.

But there's a hidden cost: the geometry of the optimization changes. The gradient of $f$ with respect to $\ell$ involves the Jacobian of softmax, which can slow convergence or lead to different local minima. Since $f$ is non-convex, your choice of parameterization matters.

Approach 2: Projected Gradient Descent (PGD)

Instead of reparameterizing, stay on the simplex and correct any violations. The update:

$$\mathbf{x}{t+1} = \Pi{\Delta^{k-1}}[\mathbf{x}_t - \eta \nabla f(\mathbf{x}_t)]$$

where $\Pi_{\Delta^{k-1}}$ projects onto the simplex. The gradient must be centered (subtract its mean) to avoid breaking normalization. The projection itself runs in $O(k)$ time (see the ICML 2008 paper by Duchi et al.).

The Sparsity Problem in High Dimensions

In low dimensions (e.g., $k=3$), PGD works fine. But as $k$ grows, random steps almost always land outside the simplex. The author ran a simulation: for $k=9$, the probability of staying inside after a random step from a vertex is about $10^{-7}$; for $k=20$, it's essentially zero. As a result, PGD tends to project onto lower-dimensional faces, forcing some coordinates to zero. This produces sparse solutions — which may or may not be desirable.

Approach 3: Mirror Descent with Bregman Divergences

The article introduces mirror descent as a unifying framework. Standard gradient descent can be seen as minimizing a local linear approximation of $f$ plus an $L2$ penalty (squared Euclidean distance). But you can replace the $L2$ penalty with any Bregman divergence $D_h(\mathbf{y}||\mathbf{x}) = h(\mathbf{y}) - h(\mathbf{x}) - \langle \nabla h(\mathbf{x}), \mathbf{y} - \mathbf{x} \rangle$.

Choosing $h(\mathbf{x}) = \sum_i x_i \log x_i$ (the negative entropy) gives the KL divergence. The resulting update is multiplicative: $x_i \propto \exp(-\eta \nabla f(\mathbf{x})_i)$. This is exactly the softmax reparameterization update seen through a different lens.

Which One Should You Use?

Softmax reparameterization is simple, works out of the box with any optimizer (Adam, SGD), and doesn't require a projection step. It's ideal if you want dense solutions and have a moderate number of categories.
PGD is conceptually simpler (just gradient descent + projection) but suffers from sparsity in high dimensions. Use it if you want sparse solutions or if the projection is cheap relative to gradient evaluation.
Mirror descent with KL is mathematically elegant and can converge faster on the simplex, but requires custom update rules.

For the protein binder problem ($k=20$), the author suggests PGD may naturally yield sparse amino acid distributions, which could be beneficial for design. But the article doesn't provide experimental results — it's a conceptual comparison.

Key Takeaway

Optimization on constrained domains isn't one-size-fits-all. The choice of parameterization or projection method changes the optimization landscape and the quality of solutions. Understanding the geometry of your constraint set (here the simplex) helps you pick the right tool.