Convex Optimization: Part 2

Convex OptimizationMathematicsMachine LearningKKT ConditionsLoss Functions
Convex optimization illustration

Image credit: Huang et al. (2022)

In this blog series, we are exploring:

  • How to determine if a function is convex.
  • Properties and operations that preserve convexity.
  • Standard forms of convex optimization problems.
  • The Karush-Kuhn-Tucker (KKT) conditions: what they are and when they apply.
  • Mathematical derivations, with an SVM-like example and various convex loss functions.
  • A broad survey of commonly used convex functions in optimization.

Series Division:

  • Part 1: Fundamentals, Convexity Criteria, and Examples.
  • Part 2: Duality, KKT Conditions, and Advanced Applications.
  • Part 3: Convex Loss Functions.

Building on the fundamentals and examples covered in Part 1, in this blog we dive deeper into the structural aspects of convex optimization. Here, we examine the duality framework and the Karush-Kuhn-Tucker (KKT) conditions that not only bridge the primal and dual formulations but also provide the critical criteria for optimality. We will be then exploring applications like the soft-margin SVM and a distributionally robust cross-entropy problem that shows how these theoretical tools are put into practice to solve optimization challenges.

3. Duality and Optimality Conditions in Convex Optimization.

Once we understand the convexity, the next step is to explore how these optimization problems can be viewed from different angles to deepen our understanding and improve our solution methods. In many cases, we can express a problem in its original, or primal, form and derive a corresponding dual form that offers complementary insights or computational benefits. At the heart of this discussion are the Karush-Kuhn-Tucker (KKT) conditions, which provide the necessary and, under certain conditions, sufficient criteria for finding the best solution (optimality) when there are constraints.
In this section, we'll dive into these ideas, discussing how the primal and dual formulations relate and how the KKT conditions help us confirm when we've reached an optimal solution.

3.1 Standard Form of a Convex Optimization Problem

A convex optimization problem typically takes the form:

minxRnf(x)subject togi(x)0,i=1,,m,hj(x)=0,j=1,,p,\begin{aligned} \min_{x \in \mathbb{R}^n} \quad & f(x) \\ \text{subject to} \quad & g_i(x) \le 0, \quad i = 1, \dots, m, \\ & h_j(x) = 0, \quad j = 1, \dots, p, \end{aligned}

where ff is a convex function, each gig_i is convex, and each hjh_j is affine. If these conditions hold, and if there exists a feasible point that satisfies Slater's condition (strict feasibility), powerful results like strong duality often kick in.

3.2 Karush-Kuhn-Tucker (KKT) Conditions

The KKT conditions are necessary (and often sufficient) for optimality in convex optimization problems with inequality and equality constraints. Given a problem:

minxf(x)subject togi(x)0,  i=1,,m,hj(x)=0,  j=1,,p,\begin{aligned} \min_x \quad & f(x) \\ \text{subject to} \quad & g_i(x) \le 0,\; i=1,\dots,m, \\ & h_j(x) = 0,\; j=1,\dots,p, \end{aligned}

there exist Lagrange multipliers λi0\lambda_i \ge 0 and νj\nu_j such that at the solution xx^*:

  1. Stationarity:

    f(x)+i=1mλigi(x)+j=1pνjhj(x)=0.\nabla f(x^*) + \sum_{i=1}^{m} \lambda_i^* \nabla g_i(x^*) + \sum_{j=1}^{p} \nu_j^* \nabla h_j(x^*) = 0.
  2. Primal Feasibility:

    gi(x)0,hj(x)=0.g_i(x^*) \le 0, \quad h_j(x^*) = 0.
  3. Dual Feasibility:

    λi0for all i.\lambda_i^* \ge 0 \quad \text{for all } i.
  4. Complementary Slackness:

    λigi(x)=0for all i.\lambda_i^* \, g_i(x^*) = 0 \quad \text{for all } i.

If the problem is convex, these conditions are both necessary and sufficient for global optimality.

Application Case I: Soft-Margin SVM

Here, we work on Soft-Margin SVM's primal and dual forms using KKT conditions. The starting point is the primal formulation:

minw,b,ξ12w2+Ci=1nξi,\min_{w,b,\xi} \quad \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i,

subject to margin and nonnegativity constraints:

yi(wxi+b)1ξi,ξi0for i=1,,n.y_i(w^\top x_i + b) \ge 1 - \xi_i, \quad \xi_i \ge 0 \quad \text{for } i = 1,\dots,n.

Let αi0\alpha_i \ge 0 and βi0\beta_i \ge 0 be the Lagrange multipliers. Form the Lagrangian:

L(w,b,ξ,α,β)=12w2+Ci=1nξi+i=1nαi[1ξiyi(wxi+b)]i=1nβiξi.\begin{aligned} L(w,b,\xi,\alpha,\beta) &= \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \\ &\quad + \sum_{i=1}^n \alpha_i\bigl[1 - \xi_i - y_i(w^\top x_i + b)\bigr] - \sum_{i=1}^n \beta_i\,\xi_i. \end{aligned}

To find the dual, differentiate LL with respect to the primal variables and set to zero:

  1. Stationarity w.r.t ww:

    wi=1nαiyixi=0w=i=1nαiyixi.w - \sum_{i=1}^n \alpha_i y_i x_i = 0 \quad \Longrightarrow \quad w = \sum_{i=1}^n \alpha_i y_i x_i.
  2. Stationarity w.r.t bb:

    i=1nαiyi=0i=1nαiyi=0.-\sum_{i=1}^n \alpha_i y_i = 0 \quad \Longrightarrow \quad \sum_{i=1}^n \alpha_i y_i = 0.
  3. Stationarity w.r.t ξi\xi_i:

    Cαiβi=0βi=Cαi.C - \alpha_i - \beta_i = 0 \quad \Longrightarrow \quad \beta_i = C - \alpha_i.

    Since βi0\beta_i \ge 0, we get 0αiC0 \le \alpha_i \le C.

We then substitute back into the Lagrangian, eliminating (w,b,ξ)(w, b, \xi) to obtain the dual problem:

maxαi=1nαi12i,j=1nαiαjyiyj(xixj),\max_{\alpha} \quad \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j \,(x_i^\top x_j),

subject to 0αiC0 \le \alpha_i \le C and i=1nαiyi=0\sum_{i=1}^n \alpha_i y_i = 0. This exemplifies how the KKT conditions unify primal and dual formulations in a convex setting.

Application Case II: A Distributionally Robust Cross-Entropy Problem with Per-Sample Temperature

Consider a fixed data point (x,y)(x, y), and define

sk=wkxwyxfor  k=1,,K.s_k = w_k^\top x - w_y^\top x \quad\text{for}\; k = 1,\dots,K.

Think of sks_k as the (shifted) logit score for class kk. We introduce a distribution p=(p1,,pK)\mathbf{p} = (p_1,\dots,p_K) over these KK classes. Our primal problem is:

maxpRKk=1Kpksk\max_{\mathbf{p}\in \mathbb{R}^K} \sum_{k=1}^K p_k\,s_k

subject to

  1. k=1Kpk=1,\sum_{k=1}^K p_k = 1,
  2. k=1Kpklog(pkK)    ρ,\sum_{k=1}^K p_k \,\log\bigl(p_k\,K\bigr)\;\le\;\rho,
  3. pk0k.p_k \ge 0 \quad\forall k.

In words:

  • We want to maximize a linear function of p\mathbf{p}.
  • p\mathbf{p} must lie in the probability simplex.
  • We impose an entropic constraint kpklog(pkK)ρ,\sum_k p_k \log(p_k K) \le \rho, which restricts how sharply peaked or uniform p\mathbf{p} can be.

Proving Convexity

Although the objective is a maximization of a linear function, we can equivalently view it as

minpF    k=1Kpksk,\min_{\mathbf{p}\in \mathcal{F}} \;\;-\sum_{k=1}^K p_k\,s_k,

where F\mathcal{F} is the feasible set:

  • Probability simplex: pk0,k=1Kpk=1.p_k\ge 0,\, \sum_{k=1}^K p_k = 1.
  • Entropic constraint: k=1Kpklog(pkK)ρ.\sum_{k=1}^K p_k \,\log(p_k K)\,\le\,\rho.

Linear functions are both convex and concave. The probability simplex is convex, and the set {p:kpklog(pkK)ρ}\{\mathbf{p}: \sum_k p_k \log(p_k K) \le \rho\} is convex because pkpklog(pk)p \mapsto \sum_k p_k \log(p_k) is concave (its negative is convex). Thus, the intersection of convex sets is convex, making the overall formulation a convex optimization problem (when viewed in minimization form).

Strict Feasibility and Slater’s Condition

To use strong duality, we typically invoke Slater’s condition. We need a strictly feasible point that satisfies

  • k=1Kpk=1,\sum_{k=1}^K p_k = 1,
  • pk>0p_k > 0 for all k,k,
  • k=1Kpklog(pkK)<ρ.\sum_{k=1}^K p_k \,\log\bigl(p_k\,K\bigr) < \rho.

A simple choice is the uniform distribution pk=1/K.p_k = 1/K. Then

k=1K1Klog(1KK)=k=1K1Klog(1)=0.\sum_{k=1}^K \frac{1}{K} \,\log\Bigl(\tfrac{1}{K}\cdot K\Bigr) = \sum_{k=1}^K \frac{1}{K}\,\log(1) = 0.

If ρ>0,\rho > 0, this strictly satisfies the constraint. Hence, Slater’s condition holds.

Forming the Lagrangian

Let α\alpha be the Lagrange multiplier for kpk=1,\sum_k p_k = 1, and let β0\beta \ge 0 be the multiplier for the entropic constraint kpklog(pkK)ρ.\sum_k p_k \log(p_k K)\le \rho. Define

Φ(p)=k=1Kpklog(pkK)ρ0.\Phi(\mathbf{p}) = \sum_{k=1}^K p_k \,\log\bigl(p_k K\bigr) - \rho \le 0.

The Lagrangian is

L(p,α,β)=k=1Kpkskα(k=1Kpk1)β(k=1Kpklog(pkK)ρ).\mathcal{L}(\mathbf{p},\alpha,\beta) = \sum_{k=1}^K p_k\,s_k -\alpha \Bigl(\sum_{k=1}^K p_k -1\Bigr) -\beta \Bigl(\sum_{k=1}^K p_k \log(p_k K) - \rho\Bigr).

Rewriting,

L(p,α,β)=k=1Kpkskαk=1Kpkβk=1Kpklog(pkK)+α+βρ.\mathcal{L}(\mathbf{p},\alpha,\beta) = \sum_{k=1}^K p_k\,s_k -\alpha \sum_{k=1}^K p_k -\beta \sum_{k=1}^K p_k \log\bigl(p_k K\bigr) + \alpha + \beta \,\rho.

Solving for p\mathbf{p}

The dual function d(α,β)d(\alpha,\beta) is given by

d(α,β)=maxp0,  kpk=1  L(p,α,β).d(\alpha,\beta) = \max_{\mathbf{p}\ge 0,\;\sum_k p_k=1} \;\mathcal{L}(\mathbf{p},\alpha,\beta).

Taking derivative w.r.t. pkp_k and setting to zero:

Lpk=skαβ[log(pkK)+1]=0.\frac{\partial \mathcal{L}}{\partial p_k} = s_k -\alpha -\beta\Bigl[\log\bigl(p_k K\bigr) + 1\Bigr] = 0.

Hence,

skαβlog(pkK)β=0,s_k - \alpha - \beta \log(p_k K) - \beta =0,

implying

log(pkK)=1β(skαβ),pk=1Kexp ⁣(skαββ).\log\bigl(p_k K\bigr) = \frac{1}{\beta}\bigl(s_k - \alpha - \beta\bigr), \quad p_k = \frac{1}{K}\exp\!\Bigl(\tfrac{s_k - \alpha - \beta}{\beta}\Bigr).

We must also enforce k=1Kpk=1.\sum_{k=1}^K p_k=1. That yields

k=1Kpk=k=1K1Kexp ⁣(skαββ)=1.\sum_{k=1}^K p_k = \sum_{k=1}^K \tfrac{1}{K} \exp\!\Bigl(\tfrac{s_k - \alpha - \beta}{\beta}\Bigr) =1.

Solving for α\alpha in terms of β\beta leads to

α+ββ=log(K)log(k=1Kexp(sk/β)),-\frac{\alpha+\beta}{\beta} = \log\Bigl(K\Bigr) - \log\Bigl(\sum_{k=1}^K \exp(s_k/\beta)\Bigr),

so

α=βlog(K)+βlog(k=1Kesk/β)β.\alpha = -\beta \log\bigl(K\bigr) + \beta \log\Bigl(\sum_{k=1}^K e^{s_k/\beta}\Bigr) - \beta.

Substitute back to find the primal-optimal distribution:

pk(β)=1Kexp ⁣(skβ)/[=1Kexp ⁣(sβ)].p_k^*(\beta) = \frac{1}{K}\, \exp\!\Bigl(\frac{s_k}{\beta}\Bigr) \Big/ \Bigl[\sum_{\ell=1}^K \exp\!\bigl(\tfrac{s_\ell}{\beta}\bigr)\Bigr].

Constructing the Dual Problem

We then substitute pk(β)p_k^*(\beta) into L\mathcal{L} to get the dual function d(α,β)d(\alpha,\beta) (with α\alpha eliminated in favor of β\beta). The entropic constraint kpklog(pkK)ρ\sum_k p_k \log(p_k K) \le \rho implies a complementary slackness condition:

β(k=1Kpk(β)log(pk(β)K)ρ)=0.\beta\Bigl(\sum_{k=1}^K p_k^*(\beta)\,\log\bigl(p_k^*(\beta)\,K\bigr)-\rho\Bigr) =0.

In other words,

  • If kpk(β)log(pk(β)K)<ρ,\sum_k p_k^*(\beta)\,\log\bigl(p_k^*(\beta)\,K\bigr) < \rho, then β=0.\beta=0.
  • If kpk(β)log(pk(β)K)=ρ,\sum_k p_k^*(\beta)\,\log\bigl(p_k^*(\beta)\,K\bigr) = \rho, then β>0.\beta>0.

Hence, the dual problem is typically a single-dimensional minimization w.r.t. β0\beta\ge 0, where we check whether the constraint is active. Once β\beta^* is found, we get pkp_k^* by the expression above.

Primal Solution from Optimal Dual Variables

Once the optimal β\beta^* is determined (depending on whether the entropic constraint is binding), the final solution is

pk=exp ⁣(skβ)K=1Kexp ⁣(sβ).p_k^* = \frac{\exp\!\bigl(\tfrac{s_k}{\beta^*}\bigr)} {K\,\sum_{\ell=1}^K \exp\!\bigl(\tfrac{s_\ell}{\beta^*}\bigr)}.

This has the flavor of a softmax distribution with an effective temperature 1/β1/\beta^*. If the constraint is active, it sets the distribution’s entropy to a specific level.

Interpretation

  • Learning individual τ\tau: In knowledge distillation, one might want a separate temperature for each sample. The above derivation shows how 1/β1/\beta^* can play that role.
  • Distributional robustness: The entropic constraint limits how peaked pkp_k can be, thus mitigating the impact of uncertain labels or outliers.
  • Entropy regularization view: kpklogpk\sum_k p_k \log p_k is (negative) entropy, so adding or constraining it aligns with approaches like KL divergence regularization.

Conclusion:

This part has demonstrated how duality and the KKT conditions serve as powerful instruments for unraveling and solving convex optimization problems, with practical applications that highlight their utility in machine learning and beyond. To continue your journey through this comprehensive guide, you can next proceed to Part 3: Convex Loss Functions, where we explore various loss functions observed in optimization frameworks.

References & Further Reading

  1. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
  2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.
  3. Schölkopf, B., & Smola, A. (2001). Learning with Kernels. MIT Press.
  4. García-Gonzalo, E., et al. (2016). A Study on Kernel Methods for Classification.