Nilson Chapagain

Image credit: Huang et al. (2022)

In this blog series, we are exploring:

How to determine if a function is convex.
Properties and operations that preserve convexity.
Standard forms of convex optimization problems.
The Karush-Kuhn-Tucker (KKT) conditions: what they are and when they apply.
Mathematical derivations, with an SVM-like example and various convex loss functions.
A broad survey of commonly used convex functions in optimization.

Series Division:

Part 1: Fundamentals, Convexity Criteria, and Examples.
Part 2: Duality, KKT Conditions, and Advanced Applications.
Part 3: Convex Loss Functions.

Building on the insights from Parts 1 and 2, in this final installment we shift our focus entirely to convex loss functions—the critical components that power many machine learning algorithms. In this section, we explore a variety of loss functions, from simple linear and quadratic formulations to more complex constructs like the hinge loss and cross-entropy loss. By dissecting their properties, gradients, and underlying convexity, we provide you with the essential tools to understand and effectively utilize these functions in practical optimization problems.

4. Common Convex Loss Functions

Many machine learning algorithms hinge on convex losses to ensure tractable optimization and theoretical guarantees. Beyond loss functions, many function families are used as building blocks in convex analysis and optimization problems. Below is a (non-exhaustive) list of important convex functions, along with their gradients or subgradients.

As a reminder, Following is the General Thinking Process When Verifying Convexity

Definition Check: Confirm that for any two points $x$ and $y$ in the domain and any $\theta \in [0,1]$ , $f(\theta x + (1-\theta)y) \;\le\; \theta f(x) + (1-\theta)f(y).$
First-Order Condition (if differentiable): Verify $f(y) \ge f(x) + \nabla f(x)^T (y-x)$ holds for all $x, y$ .
Second-Order Condition (if twice differentiable): Check the Hessian is positive semidefinite everywhere.
Composition Rules: Recognize that sums, nonnegative scalings, affine transformations, pointwise maxima of convex functions, and suitable compositions preserve convexity.
Reformulation & Relaxation: If needed, rewrite the function in terms of simpler convex components or use known convex surrogates.

Below, each example is tied back to one or more of these rules to illustrate why it is convex.

4.1 Linear and Affine Functions

Linear Function
$f(\mathbf{x}) = \mathbf{a}^\top \mathbf{x}, \quad \mathbf{a} \in \mathbb{R}^n\;.$
This is trivially convex in $\mathbf{x}$ .
Gradient: $\nabla f(\mathbf{x}) = \mathbf{a}$ .

Why It’s Convex: A linear function has zero curvature (its Hessian is the zero matrix). By the Second-Order Condition, a zero Hessian is positive semidefinite, hence convex. Also, by Definition Check, linear functions satisfy the convexity inequality exactly.

Affine Function
$f(\mathbf{x}) = \mathbf{a}^\top \mathbf{x} + b, \quad \mathbf{a} \in \mathbb{R}^n,\; b \in \mathbb{R}.$
Affine functions are both convex and concave (their Hessian is zero).
Gradient: $\nabla f(\mathbf{x}) = \mathbf{a}$ .

Why It’s Convex: An affine function is just a linear function plus a constant shift. The same reasoning (zero Hessian, or direct verification by definition) shows it is convex. In fact, it’s both convex and concave.

4.2 Quadratic Functions

Univariate Quadratic
$f(x) = ax^2 + bx + c, \quad a \ge 0.$
Convex if $a \ge 0$ .
Gradient: $\nabla f(x) = 2ax + b$ .

Why It’s Convex: The Second-Order Condition in one dimension says a function is convex if its second derivative is nonnegative everywhere. The derivative is $2a$ , which is $\ge 0$ if $a \ge 0$ . Thus, convex.

Multivariate Quadratic
$f(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} + \mathbf{b}^\top \mathbf{x} + c,\quad A \succeq 0.$
Ensures convexity.
Gradient: $\nabla f(\mathbf{x}) = 2 A \mathbf{x} + \mathbf{b}$ .

Why It’s Convex: By the Second-Order Condition, we check the Hessian $2A$ , which is positive semidefinite if $A \succeq 0$ . Hence, $f$ is convex.

4.3 Norm Functions

$\ell_1$ -Norm
$f(\mathbf{x}) = \|\mathbf{x}\|_1 = \sum_{i=1}^n |x_i|.$
Subgradient: $\text{sign}(\mathbf{x})$ (elementwise).

Why It’s Convex: The absolute value $|x_i|$ is convex (by Definition Check or First-Order Condition), and the $\ell_1$ -norm is a sum of these convex terms. The Sum rule in composition preserves convexity.

$\ell_2$ -Norm
$f(\mathbf{x}) = \|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}.$
Gradient: $\nabla f(\mathbf{x}) = \frac{\mathbf{x}}{\|\mathbf{x}\|_2}$ .

Why It’s Convex: For $\ell_2$ -norm, you can use either Subgradient* arguments or note that $\|\mathbf{x}\|_2$ is the Minkowski functional of the Euclidean unit ball, which is a convex set. Hence, by geometry and the Definition Check, it’s convex.

Max Norm ( $\ell_\infty$ -Norm)
$f(\mathbf{x}) = \|\mathbf{x}\|_\infty = \max_i |x_i|.$
Subgradient at the maximum element depends on which coordinate(s) achieve the maximum in absolute value.

Why It’s Convex: A maximum of convex functions $\{\,|x_1|,\dots,|x_n|\}$ is still convex by the Pointwise Maximum rule. Since $|x_i|$ is convex, $\max_i |x_i|$ must be convex.

4.4 Exponential and Logarithmic Functions

Exponential Function
$f(x) = e^x.$
Convex in $x$ .
Gradient: $\nabla f(x) = e^x$ .

Why It’s Convex: The Second-Order Condition shows its second derivative $e^x$ is positive for all $x$ . Thus, the Hessian (in one dimension) is always > 0, satisfying convexity.

Logarithmic Function
$f(x) = \log(x), \quad x > 0.$
$\log(x)$ is concave, but $-\log(x)$ is convex.
Gradient: $\nabla f(x) = 1/x$ .

Why It’s Convex: Actually, $\log(x)$ is concave. However, $-\log(x)$ is convex. By the Second-Order Condition, $\frac{d^2}{dx^2}(-\log(x)) = \frac{1}{x^2} \ge 0$ , so $-\log(x)$ is convex.

4.5 Power Functions

Power Function
$f(x) = x^p, \quad x \ge 0, \; p \ge 1.$
Convex in $x$ .
Gradient: $\nabla f(x) = p x^{p-1}$ .

Why It’s Convex: For $x\ge 0$ , you can confirm via the Second-Order Condition that the second derivative $p(p-1)x^{p-2}$ is $\ge 0$ when $p \ge 1$ . This ensures convexity.

Reciprocal Function
$f(x) = \frac{1}{x}, \quad x > 0.$
Convex in $x$ .
Gradient: $\nabla f(x) = -\frac{1}{x^2}$ .

Why It’s Convex: A quick Second-Order Condition check shows the second derivative $\frac{2}{x^3}$ is positive for $x>0$ . Alternatively, you can verify $\frac{1}{x}$ is decreasing and convex on $(0,\infty)$ .

4.6 Entropy and Related Functions

Negative Entropy
$f(\mathbf{p}) = \sum_{i=1}^n p_i \log(p_i),\quad p_i \ge 0,\; \sum_i p_i = 1.$
Convex in $\mathbf{p}$ .
Gradient: $\nabla f(\mathbf{p}) = 1 + \log(\mathbf{p})$ (elementwise).

Why It’s Convex: Use the Second-Order Condition or known results from information theory that $\sum p_i \log(p_i)$ is concave in $\mathbf{p}$ , so its negative is convex. In practice, you can also employ Definition Check with the Gibbs–Shannon property.

Log-Sum-Exp Function
$f(\mathbf{x}) = \log \bigl(\sum_{i=1}^n e^{x_i}\bigr).$
Strictly convex in $\mathbf{x}$ .
Gradient: $\nabla f(\mathbf{x}) = \frac{e^{\mathbf{x}}}{\sum_{i=1}^n e^{x_i}}$ (elementwise).

Why It’s Convex: By the Second-Order Condition, its Hessian is a covariance-like structure that is positive semidefinite. Or see it as a smooth approximation to $\max(x_i)$ , and $\max(\cdot)$ is convex.

4.7 Hinge Loss Function

Hinge Loss
$f(x) = \max(0,\, 1 - y \cdot x),\quad y \in \{-1, +1\}.$
Convex in $x$ .
Subgradient:

$\nabla f(x) = \begin{cases} -y, & \text{if } 1 - y \cdot x > 0,\\[1mm] 0, & \text{if } 1 - y \cdot x < 0,\\[1mm] \text{any value in } [-y,0], & \text{if } 1 - y \cdot x = 0. \end{cases}$

Why It’s Convex: A Pointwise Maximum of two affine/linear functions $(0$ and $1-y\cdot x)$ is convex. Also, each piece is linear, and the maximum of linear functions is convex by construction.

4.8 Absolute Value and Piecewise Linear Functions

Absolute Value
$f(x) = |x|.$
Convex in $x$ .
Subgradient: $\text{sign}(x)$ .

Why It’s Convex: Using the Definition Check and the triangle inequality, $\begin{aligned} f(\alpha x + \beta y) &= |\alpha x + \beta y| \;\;\;\underbrace{\le}_{\text{triangle inequality}}\;\; |\alpha x| + |\beta y| \\ &= \alpha |x| + \beta |y| = \alpha f(x) + \beta f(y), \end{aligned}$ where $\alpha, \beta \in \mathbb{R}_{\ge 0}$ with $\alpha + \beta = 1$ .
This directly satisfies the definition of convexity. Alternatively, one can note $|x| = \max(x, -x)$ , which is a pointwise maximum of two affine (linear) functions, ensuring convexity. And away from $x=0$ , the Second-Order Condition also holds (the function is piecewise linear with zero curvature).

ReLU
$f(x) = \max(0, x).$
Convex in $x$ .
Subgradient: $\nabla f(x) = \begin{cases} 1, & x > 0, \\ 0, & x \le 0. \end{cases}$

Why It’s Convex: ReLU is a Pointwise Maximum of the linear function $x$ and constant $0$ . Hence, by that rule, it is convex.

4.9 Log-Barrier and Soft Constraints

A log barrier function is used in optimization to enforce inequality constraints. It adds a term to the objective function that becomes very large (tends to infinity) as the solution approaches the boundary of the feasible region, effectively “penalizing” moves toward infeasibility.

Consider the problem of minimizing a simple quadratic function with the constraint that the variable stays positive. For example, suppose you want to minimize: $f(x) = (x-1)^2 \quad \text{subject to} \quad x > 0.$

A log barrier function adds a penalty term to the objective that grows large as $x$ approaches the boundary $x=0$ . In this case, the barrier term is: $-\mu \ln(x),$ where $\mu > 0$ is a parameter that controls the strength of the barrier. The modified objective function becomes: $\min_x \quad (x-1)^2 - \mu \ln(x).$

This formulation ensures that as $x$ gets closer to zero, the $-\mu \ln(x)$ term tends to infinity, discouraging solutions that violate the $x > 0$ constraint.

Log-Barrier Function
$f(x) = -\log(x), \quad x > 0.$
Convex in $x$ .
Gradient: $\nabla f(x) = -\frac{1}{x}$ .

Why It’s Convex: As above, $\log(x)$ is concave, so $-\log(x)$ is convex. The Second-Order Condition reveals its Hessian $\frac{1}{x^2}$ is > 0 for $x>0$ .

Softplus Function
$f(x) = \log(1 + e^x).$
Convex in $x$ .
Gradient: $\nabla f(x) = \frac{1}{1 + e^{-x}}$ .

Why It’s Convex: You can see this as $\log(\sum e^{(\cdot)})$ in single dimension. By the Second-Order Condition, the Hessian $\frac{e^x}{(1+e^x)^2}$ is nonnegative. Alternatively, it’s a composition of an exponential and a $\log(\cdot)$ , both of which preserve convexity.

4.10 Distance Functions

Squared Euclidean Distance
$f(\mathbf{x}, \mathbf{y}) = \|\mathbf{x} - \mathbf{y}\|_2^2 = (\mathbf{x} - \mathbf{y})^\top (\mathbf{x} - \mathbf{y}).$
Convex in both $\mathbf{x}$ and $\mathbf{y}$ .
Gradient w.r.t. $\mathbf{x}$ : $\nabla f(\mathbf{x}) = 2(\mathbf{x} - \mathbf{y})$ .

Why It’s Convex: This is a Sum of squares $(x_i-y_i)^2$ , each of which is convex. Summation preserves convexity, so $\|\mathbf{x}-\mathbf{y}\|_2^2$ is convex in $\mathbf{x}$ (and $\mathbf{y}$ ).

4.11 Matrix Norms and Functions

Frobenius Norm
$f(X) = \|X\|_F = \sqrt{\sum_{i,j} X_{ij}^2},\quad X \in \mathbb{R}^{m \times n}.$
Convex in $X$ .
Gradient: $\nabla f(X) = \frac{X}{\|X\|_F}$ .

Why It’s Convex: It’s analogous to the $\ell_2$ -norm for matrices. By Definition Check (the set $\{\,X : \|X\|_F \le 1\}$ is convex), or known results that all norms are convex, it holds.

Log-Determinant Function
$f(X) = -\log(\det(X)), \quad X \succ 0,\; X \in \mathbb{R}^{n \times n}.$
Convex in $X$ .
Gradient: $\nabla f(X) = -X^{-1}$ .

Why It’s Convex: The matrix $\log$ -det function is concave in $\det(X)$ , so $-\log(\det(X))$ is convex. The Second-Order Condition can be checked with matrix calculus, revealing a positive semidefinite Hessian for $X \succ 0$ .

4.12 Cross-Entropy Loss Function

Cross-Entropy Loss
$f(\mathbf{x}, \mathbf{y}) = -\sum_{i=1}^n y_i \log(p_i),$ where $p_i$ is a predicted probability and $y_i$ is the actual label. Often used in classification scenarios (e.g., multi-class).
Gradient (w.r.t. $p$ ): $\nabla f(\mathbf{p}) = -\frac{\mathbf{y}}{\mathbf{p}}$ .

Why It’s Convex: If $\mathbf{y}$ is fixed and $\mathbf{p}$ lies in the simplex, $-\sum y_i \log(p_i)$ can be seen as a negative log-likelihood. Information theory shows this is convex in $\mathbf{p}$ , or use the Second-Order Condition to confirm $\frac{1}{p_i}$ terms yield positive semidefinite curvature.

Final Thoughts

This part has provided a comprehensive overview of convex loss functions, illustrating how their inherent properties contribute to robust and tractable optimization in diverse applications. With the foundation laid across the three parts of this series, from fundamental convexity criteria to advanced duality and KKT conditions, and finally, to specialized loss functions, you are now equipped to tackle complex optimization challenges. For a refresher on earlier topics, feel free to revisit Part 1 and Part 2. I hope this expanded guide clarifies key aspects of convex optimization, from fundamental definitions to applying it to an actual optimizaiton challages.

References & Further Reading

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.
Schölkopf, B., & Smola, A. (2001). Learning with Kernels. MIT Press.
García-Gonzalo, E., et al. (2016). A Study on Kernel Methods for Classification.