Multivariate Logistic Regression: An Extension for Multiclass Problems

Multivariate logistic regression (often referred to as multinomial logistic regression) extends the binary logistic regression framework to problems where the outcome variable $Y$ can take one of $K$ classes. In this article, we introduce the model setup, detail the dimensions and assumptions, derive the log-likelihood and its derivative in a concise mathematical sequence, and discuss practical applications.

1. Model Setup

For each observation $i = 1,\dots,n$ , let $\mathbf{x}_i \in \mathbb{R}^{(p+1) \times 1}$ denote the predictor vector (including the intercept), defined as
$\mathbf{x}_i = (1,\, x_{i1},\, x_{i2},\, \dots,\, x_{ip})^T.$ The response $Y_i$ is represented as a one-hot encoded vector in $\{0,1\}^K$ ; that is, $Y_i \in \mathbb{R}^K$ with exactly one entry equal to 1 and the remaining entries equal to 0. Notice, In the binary classification case (i.e., $K=2$ ), the response $Y_i$ would simply be a scalar in $\{0,1\}$ . However, for a multi-class setting with $K > 2$ , we represent $Y_i$ using a one-hot encoded vector in $\{0,1\}^K$ . For example, if $K = 3$ and the true class for observation $i$ is 2, then $Y_i = (0,\, 1,\, 0)^T.$

In multinomial logistic regression, we have a parameter vector for each class $k = 1,\dots,K$ , where each $\boldsymbol{\beta}_k \in \mathbb{R}^{(p+1) \times 1}$ . These parameter vectors are collected in the matrix $\boldsymbol{B} = \bigl[\boldsymbol{\beta}_1,\, \boldsymbol{\beta}_2,\, \dots,\, \boldsymbol{\beta}_K\bigr] \in \mathbb{R}^{(p+1) \times K}.$ The model for the probability that $Y_i$ belongs to class $k$ is defined via the softmax function: $\Pr(Y_i = k \mid \mathbf{x}_i) = \frac{\exp\!\bigl(x_i^\top \boldsymbol{\beta}_k\bigr)}{\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)}.$

Softmax Probability for Class $k$

\Pr(Y_i = k \mid \mathbf{x}_i) = \pi_{ik} = \sigma_k(\mathbf{x}_i; \boldsymbol{B}) = \text{softmax}_k(\mathbf{x}_i; \boldsymbol{B}) = \frac{\exp\!\bigl(x_i^\top \boldsymbol{\beta}_k\bigr)}{\sum_{j=1}^K \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)}

2. Likelihood and Estimation

Assume the data are encoded in one-hot form so that $y_{ik} = 1$ if observation $i$ belongs to class $k$ and 0 otherwise, with $\sum_{k=1}^K y_{ik} = 1$ . The likelihood is given by:

L(\boldsymbol{B}) = \prod_{i=1}^{n} \prod_{k=1}^{K} \pi_{ik}^{\,y_{ik}}

Taking the logarithm, we have the log-likelihood:

\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \ln\!\left(\frac{\exp\!\bigl(x_i^\top \boldsymbol{\beta}_k\bigr)}{\sum_{j=1}^K \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)}\right)

Expanding the logarithm yields:

\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \Bigl[ x_i^\top \boldsymbol{\beta}_k - \ln\!\Bigl(\sum_{j=1}^K \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)\Bigr) \Bigr]

we separate the $\sum_{k=1}^{K} y_{ik}$ along the two terms:

\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \left[ \sum_{k=1}^{K} y_{ik}\,x_i^\top \boldsymbol{\beta}_k \;-\; \sum_{k=1}^{K} y_{ik}\,\ln\!\left(\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)\right) \right].

Since the term $\ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right)$ does not depend on $k$ , we can factor it out of the outer summation.

\sum_{k=1}^{K} y_{ik}\,\ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right) = \ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right) \sum_{k=1}^{K} y_{ik}.

Using the one-hot property, $\sum_{k=1}^{K} y_{ik} = 1$ (Remember that $y_i$ is a one-hot encoded vector; for example, if $K=3$ and the true class is 2, then $y_i = (0,\,1,\,0)^T$ , so that $0+1+0=1$ ), we obtain

\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \left[ \sum_{k=1}^{K} y_{ik}\,x_i^\top \boldsymbol{\beta}_k - \ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right) \right].

3. Assumptions

The multinomial logistic regression model is built on several key assumptions:

Linearity in the Log-Odds: The log-odds for each class are modeled as a linear function of the predictors.
Independence: Each observation is assumed to be independent.
No Perfect Separation: No combination of predictors perfectly predicts the outcome.
Absence of High Multicollinearity: Excessively correlated predictors can lead to unstable estimates.
Correct Link Function: The softmax link function is appropriate for the multiclass outcome.

4. Derivation of the Log-Likelihood Derivative

Starting from

\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \left[ \sum_{k=1}^{K} y_{ik}\,x_i^\top \boldsymbol{\beta}_k - \ln\!\left(\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)\right) \right],

we take the derivative with respect to $\boldsymbol{\beta}_r$ (for a fixed class $r$ ). Since for the second summation, all except $\boldsymbol{\beta}_r$ 's component will differentiate to zero.

\frac{\partial \ell(\boldsymbol{B})}{\partial \boldsymbol{\beta}_r} =\sum_{i=1}^{n}\left[\frac{\partial}{\partial \boldsymbol{\beta}_r}\Bigl(y_{ir}\,x_i^\top \boldsymbol{\beta}_r\Bigr) -\frac{\partial}{\partial \boldsymbol{\beta}_r}\ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right)\right].

For the first term, we differentiate $y_{ir}\,x_i^\top \boldsymbol{\beta}_r,$ with respect to $\boldsymbol{\beta}_r$ . Applying the basic rule of matrix calculus—that is, for a constant vector $\mathbf{a}$ , $\frac{\partial}{\partial \boldsymbol{\beta}} (\mathbf{a}^\top \boldsymbol{\beta}) = \mathbf{a},$ we obtain:

\frac{\partial}{\partial \boldsymbol{\beta}_r}\Bigl(y_{ir}\,x_i^\top \boldsymbol{\beta}_r\Bigr) = y_{ir}\,x_i.

Similarly, differentiate the second term using the chain rule:

\frac{\partial}{\partial \boldsymbol{\beta}_r}\ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right) =\frac{1}{\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)} \cdot \frac{\partial}{\partial \boldsymbol{\beta}_r}\exp(x_i^\top \boldsymbol{\beta}_r) =\frac{\exp(x_i^\top \boldsymbol{\beta}_r)}{\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)}\,x_i.

Here, only the $j=r$ term depends on $\boldsymbol{\beta}_r$ ; defining $\pi_{ir} = \frac{\exp(x_i^\top \boldsymbol{\beta}_r)}{\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)}$ gives the result $\pi_{ir}\,x_i$ .

\frac{\partial}{\partial \boldsymbol{\beta}_r}\ln\!\left(\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)\right) = \pi_{ir}\,x_i.

Thus, we have:

\frac{\partial \ell(\boldsymbol{B})}{\partial \boldsymbol{\beta}_r} =\sum_{i=1}^{n}\left[y_{ir}\,x_i - \pi_{ir}\,x_i\right] =\sum_{i=1}^{n}\left(y_{ir}-\pi_{ir}\right)x_i.

This is the gradient of the log-likelihood with respect to $\boldsymbol{\beta}_r$ . Stacking these for $r=1,\dots,K$ yields the overall gradient with respect to the parameter matrix $\boldsymbol{B}$ .

Define $X \in \mathbb{R}^{n \times (p+1)}$ as the design matrix (each row $x_i^T$ ), $Y \in \mathbb{R}^{n \times K}$ as the one-hot encoded response matrix, and $\Pi \in \mathbb{R}^{n \times K}$ with entries $\pi_{ik}$ . Then, stacking the gradients for $r=1,\dots,K$ yields the overall gradient:

\nabla_{\boldsymbol{B}} \ell(\boldsymbol{B}) = X^\top \,(Y - \Pi).

5. Gradient Ascent Update

Since our objective is to maximize the log-likelihood $\ell(\boldsymbol{B})$ , we use gradient ascent. (If we were minimizing the negative log-likelihood, we would instead use gradient descent.) At iteration $t$ , the update rule is:

\boldsymbol{B}^{(t+1)} = \boldsymbol{B}^{(t)} + \eta\, \nabla_{\boldsymbol{B}} \ell(\boldsymbol{B}^{(t)}),

where $\eta > 0$ is the learning rate. Using the vectorized gradient expression $\nabla_{\boldsymbol{B}} \ell(\boldsymbol{B}) = X^\top (Y - \Pi),$ the update becomes:

\boldsymbol{B}^{(t+1)} = \boldsymbol{B}^{(t)} + \eta\, X^\top \left(Y - \Pi^{(t)}\right),

with $\Pi^{(t)}$ denoting the softmax probability matrix at iteration $t$ , where the $(i,k)$ -th entry is given by $\pi_{ik}^{(t)} = \frac{\exp\!\bigl(x_i^\top \boldsymbol{\beta}_k^{(t)}\bigr)}{\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j^{(t)}\bigr)}.$ This gradient ascent procedure maximizes the log-likelihood, and iterations continue until convergence (for example, when the change in $\ell(\boldsymbol{B})$ is below a preset threshold).

5. Applications in Real-World Problems

Multinomial logistic regression is especially useful when the response variable can take more than two categories. Its real-world applications include:

Epidemiology: Predicting the type or stage of a disease based on multiple risk factors.
Marketing: Segmenting customers into distinct groups based on purchasing behavior.
Natural Language Processing: Classifying text documents into multiple categories.
Credit Scoring: Assessing the creditworthiness of applicants by categorizing them into risk levels.
Medical Diagnosis: Differentiating among various diagnoses using multiple clinical indicators.

Multinomial logistic regression stands out because it not only separates data into multiple categories but also provides clear, interpretable numbers that show how each predictor affects the outcome for each class. This clarity makes it a reliable choice for real-world problems—whether you're trying to predict disease stages, segment customers, classify texts, assess credit risk, or pinpoint a diagnosis. By breaking down the model structure and the math behind it, practitioners can confidently use this method, check that its assumptions hold, and trust the predictions it makes in everyday applications.

Conclusion

In this article, we extended the traditional binary logistic regression framework to the multivariate (multinomial) case for multiclass problems. We introduced the model by defining the predictor vector $\mathbf{x}_i \in \mathbb{R}^{(p+1) \times 1}$ , the one-hot encoded response $Y_i \in \mathbb{R}^K$ , and the parameter matrix $\boldsymbol{B} \in \mathbb{R}^{(p+1) \times K}$ . The softmax function was employed to model the probability of each class. We derived the log-likelihood and provided a detailed mathematical derivation of its gradient using matrix calculus and the chain rule. Finally, we presented the gradient ascent update rule for maximizing the log-likelihood, noting that if the negative log-likelihood were minimized instead, the update would follow a gradient descent scheme.

This concise overview establishes the theoretical foundation of multinomial logistic regression and its estimation via gradient-based optimization, setting the stage for applying these methods to practical multiclass classification problems in various fields.