Vector and Matrix Derivatives: Essential Tools for Optimization

In many areas of machine learning and optimization, especially when dealing with gradient-based methods, it is crucial to know the derivatives of vector and matrix functions. In this article, we summarize some common derivative formulas, specify the dimensions involved, and discuss how these results are applied in practice.

Dimensions and Notation

We use the following dimensions throughout:

$\mathbf{a},\, \mathbf{b},\, \mathbf{Y} \in \mathbb{R}^{n \times 1}, \quad A \in \mathbb{R}^{n \times n}, \quad B \in \mathbb{R}^{p \times n}, \quad X \in \mathbb{R}^{n \times p}.$

Vector Derivatives

\frac{\partial \mathbf{a}^\top \mathbf{b}}{\partial \mathbf{a}} = \mathbf{b},

\frac{\partial \mathbf{b}^\top \mathbf{a}}{\partial \mathbf{a}} = \mathbf{b},

\frac{\partial \mathbf{a}^\top A \mathbf{a}}{\partial \mathbf{a}} = (A + A^\top)\,\mathbf{a},

\frac{\partial (A\,\mathbf{a})}{\partial \mathbf{a}} = A,

\frac{\partial \operatorname{Tr}(A\,\mathbf{a})}{\partial \mathbf{a}} = A.

Matrix Derivatives

\frac{\partial \mathbf{a}^\top A \mathbf{b}}{\partial A} = \mathbf{a}\,\mathbf{b}^\top,

\frac{\partial \mathbf{a}^\top A^\top \mathbf{b}}{\partial A} = \mathbf{b}\,\mathbf{a}^\top,

\frac{\partial \mathbf{a}^\top A^\top A\,\mathbf{a}}{\partial A} = A^\top\,\mathbf{a}\,\mathbf{a}^\top,

\frac{\partial \operatorname{Tr}(A)}{\partial A} = I,

\frac{\partial \operatorname{Tr}(A^\top X)}{\partial X} = \frac{\partial \operatorname{Tr}(X^\top A)}{\partial X} = A,

\frac{\partial \operatorname{Tr}(A X B)}{\partial X} = A^\top B^\top,

\frac{\partial \operatorname{Tr}(B^\top X^\top A)}{\partial X} = A\,B^\top.

$L^2$ Norm Derivatives

\frac{\partial \|Y - W\,X\|_2^2}{\partial W} = -2\,(Y - W\,X)\,X^\top,

\frac{\partial \|Y - X\,W\|_2^2}{\partial W} = -2\,X^\top\,(Y - X\,W).

Other Common Derivatives

\frac{\partial \ln(\mathbf{x}^\top \mathbf{a})}{\partial \mathbf{x}} = \frac{\mathbf{a}}{\mathbf{x}^\top \mathbf{a}},

\frac{\partial \ln(\det(A))}{\partial A} = (A^{-1})^\top,

\frac{\partial \|A\|_F^2}{\partial A} = 2A,

where $\|A\|_F$ denotes the Frobenius norm of $A$ .

Useful Resources

For further reading on matrix calculus, a highly recommended resource is the Matrix Cookbook. Other valuable references include:

Matrix Differential Calculus with Applications in Statistics and Econometrics by Magnus and Neudecker.
Online resources and lecture notes on advanced linear algebra and optimization, such as MIT OpenCourseWare's Linear Algebra course by Gilbert Strang.