Convex Optimization: Part 3

Convex OptimizationMathematicsMachine LearningKKT ConditionsLoss Functions
Convex optimization illustration

Image credit: Huang et al. (2022)

In this blog series, we are exploring:

  • How to determine if a function is convex.
  • Properties and operations that preserve convexity.
  • Standard forms of convex optimization problems.
  • The Karush-Kuhn-Tucker (KKT) conditions: what they are and when they apply.
  • Mathematical derivations, with an SVM-like example and various convex loss functions.
  • A broad survey of commonly used convex functions in optimization.

Series Division:

  • Part 1: Fundamentals, Convexity Criteria, and Examples.
  • Part 2: Duality, KKT Conditions, and Advanced Applications.
  • Part 3: Convex Loss Functions.

Building on the insights from Parts 1 and 2, in this final installment we shift our focus entirely to convex loss functions—the critical components that power many machine learning algorithms. In this section, we explore a variety of loss functions, from simple linear and quadratic formulations to more complex constructs like the hinge loss and cross-entropy loss. By dissecting their properties, gradients, and underlying convexity, we provide you with the essential tools to understand and effectively utilize these functions in practical optimization problems.

4. Common Convex Loss Functions

Many machine learning algorithms hinge on convex losses to ensure tractable optimization and theoretical guarantees. Beyond loss functions, many function families are used as building blocks in convex analysis and optimization problems. Below is a (non-exhaustive) list of important convex functions, along with their gradients or subgradients.

As a reminder, Following is the General Thinking Process When Verifying Convexity

  • Definition Check: Confirm that for any two points xx and yy in the domain and any θ[0,1]\theta \in [0,1], f(θx+(1θ)y)    θf(x)+(1θ)f(y).f(\theta x + (1-\theta)y) \;\le\; \theta f(x) + (1-\theta)f(y).
  • First-Order Condition (if differentiable): Verify f(y)f(x)+f(x)T(yx)f(y) \ge f(x) + \nabla f(x)^T (y-x) holds for all x,yx, y.
  • Second-Order Condition (if twice differentiable): Check the Hessian is positive semidefinite everywhere.
  • Composition Rules: Recognize that sums, nonnegative scalings, affine transformations, pointwise maxima of convex functions, and suitable compositions preserve convexity.
  • Reformulation & Relaxation: If needed, rewrite the function in terms of simpler convex components or use known convex surrogates.

Below, each example is tied back to one or more of these rules to illustrate why it is convex.

4.1 Linear and Affine Functions

  • Linear Function
    f(x)=ax,aRn  .f(\mathbf{x}) = \mathbf{a}^\top \mathbf{x}, \quad \mathbf{a} \in \mathbb{R}^n\;.
    This is trivially convex in x\mathbf{x}.
    Gradient: f(x)=a\nabla f(\mathbf{x}) = \mathbf{a}.

Why It’s Convex: A linear function has zero curvature (its Hessian is the zero matrix). By the Second-Order Condition, a zero Hessian is positive semidefinite, hence convex. Also, by Definition Check, linear functions satisfy the convexity inequality exactly.


  • Affine Function
    f(x)=ax+b,aRn,  bR.f(\mathbf{x}) = \mathbf{a}^\top \mathbf{x} + b, \quad \mathbf{a} \in \mathbb{R}^n,\; b \in \mathbb{R}.
    Affine functions are both convex and concave (their Hessian is zero).
    Gradient: f(x)=a\nabla f(\mathbf{x}) = \mathbf{a}.

Why It’s Convex: An affine function is just a linear function plus a constant shift. The same reasoning (zero Hessian, or direct verification by definition) shows it is convex. In fact, it’s both convex and concave.


4.2 Quadratic Functions

  • Univariate Quadratic
    f(x)=ax2+bx+c,a0.f(x) = ax^2 + bx + c, \quad a \ge 0.
    Convex if a0a \ge 0.
    Gradient: f(x)=2ax+b\nabla f(x) = 2ax + b.

Why It’s Convex: The Second-Order Condition in one dimension says a function is convex if its second derivative is nonnegative everywhere. The derivative is 2a2a, which is 0\ge 0 if a0a \ge 0. Thus, convex.


  • Multivariate Quadratic
    f(x)=xAx+bx+c,A0.f(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} + \mathbf{b}^\top \mathbf{x} + c,\quad A \succeq 0.
    Ensures convexity.
    Gradient: f(x)=2Ax+b\nabla f(\mathbf{x}) = 2 A \mathbf{x} + \mathbf{b}.

Why It’s Convex: By the Second-Order Condition, we check the Hessian 2A2A, which is positive semidefinite if A0A \succeq 0. Hence, ff is convex.


4.3 Norm Functions

  • 1\ell_1-Norm
    f(x)=x1=i=1nxi.f(\mathbf{x}) = \|\mathbf{x}\|_1 = \sum_{i=1}^n |x_i|.
    Subgradient: sign(x)\text{sign}(\mathbf{x}) (elementwise).

Why It’s Convex: The absolute value xi|x_i| is convex (by Definition Check or First-Order Condition), and the 1\ell_1-norm is a sum of these convex terms. The Sum rule in composition preserves convexity.


  • 2\ell_2-Norm
    f(x)=x2=i=1nxi2.f(\mathbf{x}) = \|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.
    Gradient: f(x)=xx2\nabla f(\mathbf{x}) = \frac{\mathbf{x}}{\|\mathbf{x}\|_2}.

Why It’s Convex: For 2\ell_2-norm, you can use either Subgradient* arguments or note that x2\|\mathbf{x}\|_2 is the Minkowski functional of the Euclidean unit ball, which is a convex set. Hence, by geometry and the Definition Check, it’s convex.


  • Max Norm (\ell_\infty-Norm)
    f(x)=x=maxixi.f(\mathbf{x}) = \|\mathbf{x}\|_\infty = \max_i |x_i|.
    Subgradient at the maximum element depends on which coordinate(s) achieve the maximum in absolute value.

Why It’s Convex: A maximum of convex functions {x1,,xn}\{\,|x_1|,\dots,|x_n|\} is still convex by the Pointwise Maximum rule. Since xi|x_i| is convex, maxixi\max_i |x_i| must be convex.


4.4 Exponential and Logarithmic Functions

  • Exponential Function
    f(x)=ex.f(x) = e^x.
    Convex in xx.
    Gradient: f(x)=ex\nabla f(x) = e^x.

Why It’s Convex: The Second-Order Condition shows its second derivative exe^x is positive for all xx. Thus, the Hessian (in one dimension) is always > 0, satisfying convexity.


  • Logarithmic Function
    f(x)=log(x),x>0.f(x) = \log(x), \quad x > 0.
    log(x)\log(x) is concave, but log(x)-\log(x) is convex.
    Gradient: f(x)=1/x\nabla f(x) = 1/x.

Why It’s Convex: Actually, log(x)\log(x) is concave. However, log(x)-\log(x) is convex. By the Second-Order Condition, d2dx2(log(x))=1x20\frac{d^2}{dx^2}(-\log(x)) = \frac{1}{x^2} \ge 0, so log(x)-\log(x) is convex.


4.5 Power Functions

  • Power Function
    f(x)=xp,x0,  p1.f(x) = x^p, \quad x \ge 0, \; p \ge 1.
    Convex in xx.
    Gradient: f(x)=pxp1\nabla f(x) = p x^{p-1}.

Why It’s Convex: For x0x\ge 0, you can confirm via the Second-Order Condition that the second derivative p(p1)xp2p(p-1)x^{p-2} is 0\ge 0 when p1p \ge 1. This ensures convexity.


  • Reciprocal Function
    f(x)=1x,x>0.f(x) = \frac{1}{x}, \quad x > 0.
    Convex in xx.
    Gradient: f(x)=1x2\nabla f(x) = -\frac{1}{x^2}.

Why It’s Convex: A quick Second-Order Condition check shows the second derivative 2x3\frac{2}{x^3} is positive for x>0x>0. Alternatively, you can verify 1x\frac{1}{x} is decreasing and convex on (0,)(0,\infty).


4.6 Entropy and Related Functions

  • Negative Entropy
    f(p)=i=1npilog(pi),pi0,  ipi=1.f(\mathbf{p}) = \sum_{i=1}^n p_i \log(p_i),\quad p_i \ge 0,\; \sum_i p_i = 1.
    Convex in p\mathbf{p}.
    Gradient: f(p)=1+log(p)\nabla f(\mathbf{p}) = 1 + \log(\mathbf{p}) (elementwise).

Why It’s Convex: Use the Second-Order Condition or known results from information theory that pilog(pi)\sum p_i \log(p_i) is concave in p\mathbf{p}, so its negative is convex. In practice, you can also employ Definition Check with the Gibbs–Shannon property.


  • Log-Sum-Exp Function
    f(x)=log(i=1nexi).f(\mathbf{x}) = \log \bigl(\sum_{i=1}^n e^{x_i}\bigr).
    Strictly convex in x\mathbf{x}.
    Gradient: f(x)=exi=1nexi\nabla f(\mathbf{x}) = \frac{e^{\mathbf{x}}}{\sum_{i=1}^n e^{x_i}} (elementwise).

Why It’s Convex: By the Second-Order Condition, its Hessian is a covariance-like structure that is positive semidefinite. Or see it as a smooth approximation to max(xi)\max(x_i), and max()\max(\cdot) is convex.


4.7 Hinge Loss Function

  • Hinge Loss
    f(x)=max(0,1yx),y{1,+1}.f(x) = \max(0,\, 1 - y \cdot x),\quad y \in \{-1, +1\}.
    Convex in xx.
    Subgradient:

    f(x)={y,if 1yx>0,0,if 1yx<0,any value in [y,0],if 1yx=0.\nabla f(x) = \begin{cases} -y, & \text{if } 1 - y \cdot x > 0,\\[1mm] 0, & \text{if } 1 - y \cdot x < 0,\\[1mm] \text{any value in } [-y,0], & \text{if } 1 - y \cdot x = 0. \end{cases}

Why It’s Convex: A Pointwise Maximum of two affine/linear functions (0(0 and 1yx)1-y\cdot x) is convex. Also, each piece is linear, and the maximum of linear functions is convex by construction.


4.8 Absolute Value and Piecewise Linear Functions

  • Absolute Value
    f(x)=x.f(x) = |x|.
    Convex in xx.
    Subgradient: sign(x)\text{sign}(x).

Why It’s Convex: Using the Definition Check and the triangle inequality, f(αx+βy)=αx+βy      triangle inequality    αx+βy=αx+βy=αf(x)+βf(y),\begin{aligned} f(\alpha x + \beta y) &= |\alpha x + \beta y| \;\;\;\underbrace{\le}_{\text{triangle inequality}}\;\; |\alpha x| + |\beta y| \\ &= \alpha |x| + \beta |y| = \alpha f(x) + \beta f(y), \end{aligned} where α,βR0\alpha, \beta \in \mathbb{R}_{\ge 0} with α+β=1\alpha + \beta = 1.
This directly satisfies the definition of convexity. Alternatively, one can note x=max(x,x)|x| = \max(x, -x), which is a pointwise maximum of two affine (linear) functions, ensuring convexity. And away from x=0x=0, the Second-Order Condition also holds (the function is piecewise linear with zero curvature).

  • ReLU
    f(x)=max(0,x).f(x) = \max(0, x).
    Convex in xx.
    Subgradient: f(x)={1,x>0,0,x0.\nabla f(x) = \begin{cases} 1, & x > 0, \\ 0, & x \le 0. \end{cases}

Why It’s Convex: ReLU is a Pointwise Maximum of the linear function xx and constant 00. Hence, by that rule, it is convex.


4.9 Log-Barrier and Soft Constraints

A log barrier function is used in optimization to enforce inequality constraints. It adds a term to the objective function that becomes very large (tends to infinity) as the solution approaches the boundary of the feasible region, effectively “penalizing” moves toward infeasibility.

Consider the problem of minimizing a simple quadratic function with the constraint that the variable stays positive. For example, suppose you want to minimize: f(x)=(x1)2subject tox>0.f(x) = (x-1)^2 \quad \text{subject to} \quad x > 0.

A log barrier function adds a penalty term to the objective that grows large as xx approaches the boundary x=0x=0. In this case, the barrier term is: μln(x),-\mu \ln(x), where μ>0\mu > 0 is a parameter that controls the strength of the barrier. The modified objective function becomes: minx(x1)2μln(x).\min_x \quad (x-1)^2 - \mu \ln(x).

This formulation ensures that as xx gets closer to zero, the μln(x)-\mu \ln(x) term tends to infinity, discouraging solutions that violate the x>0x > 0 constraint.

  • Log-Barrier Function
    f(x)=log(x),x>0.f(x) = -\log(x), \quad x > 0.
    Convex in xx.
    Gradient: f(x)=1x\nabla f(x) = -\frac{1}{x}.

Why It’s Convex: As above, log(x)\log(x) is concave, so log(x)-\log(x) is convex. The Second-Order Condition reveals its Hessian 1x2\frac{1}{x^2} is > 0 for x>0x>0.


  • Softplus Function
    f(x)=log(1+ex).f(x) = \log(1 + e^x).
    Convex in xx.
    Gradient: f(x)=11+ex\nabla f(x) = \frac{1}{1 + e^{-x}}.

Why It’s Convex: You can see this as log(e())\log(\sum e^{(\cdot)}) in single dimension. By the Second-Order Condition, the Hessian ex(1+ex)2\frac{e^x}{(1+e^x)^2} is nonnegative. Alternatively, it’s a composition of an exponential and a log()\log(\cdot), both of which preserve convexity.


4.10 Distance Functions

  • Squared Euclidean Distance
    f(x,y)=xy22=(xy)(xy).f(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_2^2 = (\mathbf{x} - \mathbf{y})^\top (\mathbf{x} - \mathbf{y}).
    Convex in both x\mathbf{x} and y\mathbf{y}.
    Gradient w.r.t. x\mathbf{x}: f(x)=2(xy)\nabla f(\mathbf{x}) = 2(\mathbf{x} - \mathbf{y}).

Why It’s Convex: This is a Sum of squares (xiyi)2(x_i-y_i)^2, each of which is convex. Summation preserves convexity, so xy22\|\mathbf{x}-\mathbf{y}\|_2^2 is convex in x\mathbf{x} (and y\mathbf{y}).


4.11 Matrix Norms and Functions

  • Frobenius Norm
    f(X)=XF=i,jXij2,XRm×n.f(X) = \|X\|_F = \sqrt{\sum_{i,j} X_{ij}^2},\quad X \in \mathbb{R}^{m \times n}.
    Convex in XX.
    Gradient: f(X)=XXF\nabla f(X) = \frac{X}{\|X\|_F}.

Why It’s Convex: It’s analogous to the 2\ell_2-norm for matrices. By Definition Check (the set {X:XF1}\{\,X : \|X\|_F \le 1\} is convex), or known results that all norms are convex, it holds.


  • Log-Determinant Function
    f(X)=log(det(X)),X0,  XRn×n.f(X) = -\log(\det(X)), \quad X \succ 0,\; X \in \mathbb{R}^{n \times n}.
    Convex in XX.
    Gradient: f(X)=X1\nabla f(X) = -X^{-1}.

Why It’s Convex: The matrix log\log-det function is concave in det(X)\det(X), so log(det(X))-\log(\det(X)) is convex. The Second-Order Condition can be checked with matrix calculus, revealing a positive semidefinite Hessian for X0X \succ 0.


4.12 Cross-Entropy Loss Function

  • Cross-Entropy Loss
    f(x,y)=i=1nyilog(pi),f(\mathbf{x}, \mathbf{y}) = -\sum_{i=1}^n y_i \log(p_i), where pip_i is a predicted probability and yiy_i is the actual label. Often used in classification scenarios (e.g., multi-class).
    Gradient (w.r.t. pp): f(p)=yp\nabla f(\mathbf{p}) = -\frac{\mathbf{y}}{\mathbf{p}}.

Why It’s Convex: If y\mathbf{y} is fixed and p\mathbf{p} lies in the simplex, yilog(pi)-\sum y_i \log(p_i) can be seen as a negative log-likelihood. Information theory shows this is convex in p\mathbf{p}, or use the Second-Order Condition to confirm 1pi\frac{1}{p_i} terms yield positive semidefinite curvature.


Final Thoughts

This part has provided a comprehensive overview of convex loss functions, illustrating how their inherent properties contribute to robust and tractable optimization in diverse applications. With the foundation laid across the three parts of this series, from fundamental convexity criteria to advanced duality and KKT conditions, and finally, to specialized loss functions, you are now equipped to tackle complex optimization challenges. For a refresher on earlier topics, feel free to revisit Part 1 and Part 2. I hope this expanded guide clarifies key aspects of convex optimization, from fundamental definitions to applying it to an actual optimizaiton challages.

References & Further Reading

  1. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
  2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.
  3. Schölkopf, B., & Smola, A. (2001). Learning with Kernels. MIT Press.
  4. García-Gonzalo, E., et al. (2016). A Study on Kernel Methods for Classification.