Nilson Chapagain

Image credit: Huang et al. (2022)

In this blog series, we are exploring:

How to determine if a function is convex.
Properties and operations that preserve convexity.
Standard forms of convex optimization problems.
The Karush-Kuhn-Tucker (KKT) conditions: what they are and when they apply.
Mathematical derivations, with an SVM-like example and various convex loss functions.
A broad survey of commonly used convex functions in optimization.

Series Division:

Part 1: Fundamentals, Convexity Criteria, and Examples.
Part 2: Duality, KKT Conditions, and Advanced Applications.
Part 3: Convex Loss Functions.

Building on the fundamentals and examples covered in Part 1, in this blog we dive deeper into the structural aspects of convex optimization. Here, we examine the duality framework and the Karush-Kuhn-Tucker (KKT) conditions that not only bridge the primal and dual formulations but also provide the critical criteria for optimality. We will be then exploring applications like the soft-margin SVM and a distributionally robust cross-entropy problem that shows how these theoretical tools are put into practice to solve optimization challenges.

3. Duality and Optimality Conditions in Convex Optimization.

Once we understand the convexity, the next step is to explore how these optimization problems can be viewed from different angles to deepen our understanding and improve our solution methods. In many cases, we can express a problem in its original, or primal, form and derive a corresponding dual form that offers complementary insights or computational benefits. At the heart of this discussion are the Karush-Kuhn-Tucker (KKT) conditions, which provide the necessary and, under certain conditions, sufficient criteria for finding the best solution (optimality) when there are constraints.
In this section, we'll dive into these ideas, discussing how the primal and dual formulations relate and how the KKT conditions help us confirm when we've reached an optimal solution.

3.1 Standard Form of a Convex Optimization Problem

A convex optimization problem typically takes the form:

\begin{aligned} \min_{x \in \mathbb{R}^n} \quad & f(x) \\ \text{subject to} \quad & g_i(x) \le 0, \quad i = 1, \dots, m, \\ & h_j(x) = 0, \quad j = 1, \dots, p, \end{aligned}

where $f$ is a convex function, each $g_i$ is convex, and each $h_j$ is affine. If these conditions hold, and if there exists a feasible point that satisfies Slater's condition (strict feasibility), powerful results like strong duality often kick in.

3.2 Karush-Kuhn-Tucker (KKT) Conditions

The KKT conditions are necessary (and often sufficient) for optimality in convex optimization problems with inequality and equality constraints. Given a problem:

\begin{aligned} \min_x \quad & f(x) \\ \text{subject to} \quad & g_i(x) \le 0,\; i=1,\dots,m, \\ & h_j(x) = 0,\; j=1,\dots,p, \end{aligned}

there exist Lagrange multipliers $\lambda_i \ge 0$ and $\nu_j$ such that at the solution $x^*$ :

Stationarity:
$\nabla f(x^*) + \sum_{i=1}^{m} \lambda_i^* \nabla g_i(x^*) + \sum_{j=1}^{p} \nu_j^* \nabla h_j(x^*) = 0.$
Primal Feasibility:
$g_i(x^*) \le 0, \quad h_j(x^*) = 0.$
Dual Feasibility:
$\lambda_i^* \ge 0 \quad \text{for all } i.$
Complementary Slackness:
$\lambda_i^* \, g_i(x^*) = 0 \quad \text{for all } i.$

If the problem is convex, these conditions are both necessary and sufficient for global optimality.

Application Case I: Soft-Margin SVM

Here, we work on Soft-Margin SVM's primal and dual forms using KKT conditions. The starting point is the primal formulation:

\min_{w,b,\xi} \quad \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i,

subject to margin and nonnegativity constraints:

y_i(w^\top x_i + b) \ge 1 - \xi_i, \quad \xi_i \ge 0 \quad \text{for } i = 1,\dots,n.

Let $\alpha_i \ge 0$ and $\beta_i \ge 0$ be the Lagrange multipliers. Form the Lagrangian:

\begin{aligned} L(w,b,\xi,\alpha,\beta) &= \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \\ &\quad + \sum_{i=1}^n \alpha_i\bigl[1 - \xi_i - y_i(w^\top x_i + b)\bigr] - \sum_{i=1}^n \beta_i\,\xi_i. \end{aligned}

To find the dual, differentiate $L$ with respect to the primal variables and set to zero:

Stationarity w.r.t $w$ :
$w - \sum_{i=1}^n \alpha_i y_i x_i = 0 \quad \Longrightarrow \quad w = \sum_{i=1}^n \alpha_i y_i x_i.$
Stationarity w.r.t $b$ :
$-\sum_{i=1}^n \alpha_i y_i = 0 \quad \Longrightarrow \quad \sum_{i=1}^n \alpha_i y_i = 0.$
Stationarity w.r.t $\xi_i$ :
$C - \alpha_i - \beta_i = 0 \quad \Longrightarrow \quad \beta_i = C - \alpha_i.$
Since $\beta_i \ge 0$ , we get $0 \le \alpha_i \le C$ .

We then substitute back into the Lagrangian, eliminating $(w, b, \xi)$ to obtain the dual problem:

\max_{\alpha} \quad \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \,(x_i^\top x_j),

subject to $0 \le \alpha_i \le C$ and $\sum_{i=1}^n \alpha_i y_i = 0$ . This exemplifies how the KKT conditions unify primal and dual formulations in a convex setting.

Application Case II: A Distributionally Robust Cross-Entropy Problem with Per-Sample Temperature

Consider a fixed data point $(x, y)$ , and define

s_k = w_k^\top x - w_y^\top x \quad\text{for}\; k = 1,\dots,K.

Think of $s_k$ as the (shifted) logit score for class $k$ . We introduce a distribution $\mathbf{p} = (p_1,\dots,p_K)$ over these $K$ classes. Our primal problem is:

\max_{\mathbf{p}\in \mathbb{R}^K} \sum_{k=1}^K p_k\,s_k

subject to

$\sum_{k=1}^K p_k = 1,$
$\sum_{k=1}^K p_k \,\log\bigl(p_k\,K\bigr)\;\le\;\rho,$
$p_k \ge 0 \quad\forall k.$

In words:

We want to maximize a linear function of $\mathbf{p}$ .
$\mathbf{p}$ must lie in the probability simplex.
We impose an entropic constraint $\sum_k p_k \log(p_k K) \le \rho,$ which restricts how sharply peaked or uniform $\mathbf{p}$ can be.

Proving Convexity

Although the objective is a maximization of a linear function, we can equivalently view it as

\min_{\mathbf{p}\in \mathcal{F}} \;\;-\sum_{k=1}^K p_k\,s_k,

where $\mathcal{F}$ is the feasible set:

Probability simplex: $p_k\ge 0,\, \sum_{k=1}^K p_k = 1.$
Entropic constraint: $\sum_{k=1}^K p_k \,\log(p_k K)\,\le\,\rho.$

Linear functions are both convex and concave. The probability simplex is convex, and the set $\{\mathbf{p}: \sum_k p_k \log(p_k K) \le \rho\}$ is convex because $p \mapsto \sum_k p_k \log(p_k)$ is concave (its negative is convex). Thus, the intersection of convex sets is convex, making the overall formulation a convex optimization problem (when viewed in minimization form).

Strict Feasibility and Slater’s Condition

To use strong duality, we typically invoke Slater’s condition. We need a strictly feasible point that satisfies

$\sum_{k=1}^K p_k = 1,$
$p_k > 0$ for all $k,$
$\sum_{k=1}^K p_k \,\log\bigl(p_k\,K\bigr) < \rho.$

A simple choice is the uniform distribution $p_k = 1/K.$ Then

\sum_{k=1}^K \frac{1}{K} \,\log\Bigl(\tfrac{1}{K}\cdot K\Bigr) = \sum_{k=1}^K \frac{1}{K}\,\log(1) = 0.

If $\rho > 0,$ this strictly satisfies the constraint. Hence, Slater’s condition holds.

Forming the Lagrangian

Let $\alpha$ be the Lagrange multiplier for $\sum_k p_k = 1,$ and let $\beta \ge 0$ be the multiplier for the entropic constraint $\sum_k p_k \log(p_k K)\le \rho.$ Define

\Phi(\mathbf{p}) = \sum_{k=1}^K p_k \,\log\bigl(p_k K\bigr) - \rho \le 0.

The Lagrangian is

\mathcal{L}(\mathbf{p},\alpha,\beta) = \sum_{k=1}^K p_k\,s_k -\alpha \Bigl(\sum_{k=1}^K p_k -1\Bigr) -\beta \Bigl(\sum_{k=1}^K p_k \log(p_k K) - \rho\Bigr).

Rewriting,

\mathcal{L}(\mathbf{p},\alpha,\beta) = \sum_{k=1}^K p_k\,s_k -\alpha \sum_{k=1}^K p_k -\beta \sum_{k=1}^K p_k \log\bigl(p_k K\bigr) + \alpha + \beta \,\rho.

Solving for $\mathbf{p}$

The dual function $d(\alpha,\beta)$ is given by

d(\alpha,\beta) = \max_{\mathbf{p}\ge 0,\;\sum_k p_k=1} \;\mathcal{L}(\mathbf{p},\alpha,\beta).

Taking derivative w.r.t. $p_k$ and setting to zero:

\frac{\partial \mathcal{L}}{\partial p_k} = s_k -\alpha -\beta\Bigl[\log\bigl(p_k K\bigr) + 1\Bigr] = 0.

Hence,

s_k - \alpha - \beta \log(p_k K) - \beta =0,

implying

\log\bigl(p_k K\bigr) = \frac{1}{\beta}\bigl(s_k - \alpha - \beta\bigr), \quad p_k = \frac{1}{K}\exp\!\Bigl(\tfrac{s_k - \alpha - \beta}{\beta}\Bigr).

We must also enforce $\sum_{k=1}^K p_k=1.$ That yields

\sum_{k=1}^K p_k = \sum_{k=1}^K \tfrac{1}{K} \exp\!\Bigl(\tfrac{s_k - \alpha - \beta}{\beta}\Bigr) =1.

Solving for $\alpha$ in terms of $\beta$ leads to

-\frac{\alpha+\beta}{\beta} = \log\Bigl(K\Bigr) - \log\Bigl(\sum_{k=1}^K \exp(s_k/\beta)\Bigr),

\alpha = -\beta \log\bigl(K\bigr) + \beta \log\Bigl(\sum_{k=1}^K e^{s_k/\beta}\Bigr) - \beta.

Substitute back to find the primal-optimal distribution:

p_k^*(\beta) = \frac{1}{K}\, \exp\!\Bigl(\frac{s_k}{\beta}\Bigr) \Big/ \Bigl[\sum_{\ell=1}^K \exp\!\bigl(\tfrac{s_\ell}{\beta}\bigr)\Bigr].

Constructing the Dual Problem

We then substitute $p_k^*(\beta)$ into $\mathcal{L}$ to get the dual function $d(\alpha,\beta)$ (with $\alpha$ eliminated in favor of $\beta$ ). The entropic constraint $\sum_k p_k \log(p_k K) \le \rho$ implies a complementary slackness condition:

\beta\Bigl(\sum_{k=1}^K p_k^*(\beta)\,\log\bigl(p_k^*(\beta)\,K\bigr)-\rho\Bigr) =0.

In other words,

If $\sum_k p_k^*(\beta)\,\log\bigl(p_k^*(\beta)\,K\bigr) < \rho,$ then $\beta=0.$
If $\sum_k p_k^*(\beta)\,\log\bigl(p_k^*(\beta)\,K\bigr) = \rho,$ then $\beta>0.$

Hence, the dual problem is typically a single-dimensional minimization w.r.t. $\beta\ge 0$ , where we check whether the constraint is active. Once $\beta^*$ is found, we get $p_k^*$ by the expression above.

Primal Solution from Optimal Dual Variables

Once the optimal $\beta^*$ is determined (depending on whether the entropic constraint is binding), the final solution is

p_k^* = \frac{\exp\!\bigl(\tfrac{s_k}{\beta^*}\bigr)} {K\,\sum_{\ell=1}^K \exp\!\bigl(\tfrac{s_\ell}{\beta^*}\bigr)}.

This has the flavor of a softmax distribution with an effective temperature $1/\beta^*$ . If the constraint is active, it sets the distribution’s entropy to a specific level.

Interpretation

Learning individual $\tau$ : In knowledge distillation, one might want a separate temperature for each sample. The above derivation shows how $1/\beta^*$ can play that role.
Distributional robustness: The entropic constraint limits how peaked $p_k$ can be, thus mitigating the impact of uncertain labels or outliers.
Entropy regularization view: $\sum_k p_k \log p_k$ is (negative) entropy, so adding or constraining it aligns with approaches like KL divergence regularization.

Conclusion:

This part has demonstrated how duality and the KKT conditions serve as powerful instruments for unraveling and solving convex optimization problems, with practical applications that highlight their utility in machine learning and beyond. To continue your journey through this comprehensive guide, you can next proceed to Part 3: Convex Loss Functions, where we explore various loss functions observed in optimization frameworks.

References & Further Reading

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.
Schölkopf, B., & Smola, A. (2001). Learning with Kernels. MIT Press.
García-Gonzalo, E., et al. (2016). A Study on Kernel Methods for Classification.