Multivariate Logistic Regression

Multinomial Logistic RegressionClassificationFrequentist ApproachBayesian ApproachGLMMulticlass ClassificationSupervised Learning

Multivariate Logistic Regression: An Extension for Multiclass Problems

Multivariate logistic regression (often referred to as multinomial logistic regression) extends the binary logistic regression framework to problems where the outcome variable YY can take one of KK classes. In this article, we introduce the model setup, detail the dimensions and assumptions, derive the log-likelihood and its derivative in a concise mathematical sequence, and discuss practical applications.


1. Model Setup

For each observation i=1,,ni = 1,\dots,n, let xiR(p+1)×1\mathbf{x}_i \in \mathbb{R}^{(p+1) \times 1} denote the predictor vector (including the intercept), defined as
xi=(1,xi1,xi2,,xip)T.\mathbf{x}_i = (1,\, x_{i1},\, x_{i2},\, \dots,\, x_{ip})^T. The response YiY_i is represented as a one-hot encoded vector in {0,1}K\{0,1\}^K; that is, YiRKY_i \in \mathbb{R}^K with exactly one entry equal to 1 and the remaining entries equal to 0. Notice, In the binary classification case (i.e., K=2K=2), the response YiY_i would simply be a scalar in {0,1}\{0,1\}. However, for a multi-class setting with K>2K > 2, we represent YiY_i using a one-hot encoded vector in {0,1}K\{0,1\}^K. For example, if K=3K = 3 and the true class for observation ii is 2, then Yi=(0,1,0)T.Y_i = (0,\, 1,\, 0)^T.

In multinomial logistic regression, we have a parameter vector for each class k=1,,Kk = 1,\dots,K, where each βkR(p+1)×1\boldsymbol{\beta}_k \in \mathbb{R}^{(p+1) \times 1}. These parameter vectors are collected in the matrix B=[β1,β2,,βK]R(p+1)×K.\boldsymbol{B} = \bigl[\boldsymbol{\beta}_1,\, \boldsymbol{\beta}_2,\, \dots,\, \boldsymbol{\beta}_K\bigr] \in \mathbb{R}^{(p+1) \times K}. The model for the probability that YiY_i belongs to class kk is defined via the softmax function: Pr(Yi=kxi)=exp ⁣(xiβk)j=1Kexp ⁣(xiβj).\Pr(Y_i = k \mid \mathbf{x}_i) = \frac{\exp\!\bigl(x_i^\top \boldsymbol{\beta}_k\bigr)}{\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)}.

Softmax Probability for Class kk

Pr(Yi=kxi)=πik=σk(xi;B)=softmaxk(xi;B)=exp ⁣(xiβk)j=1Kexp ⁣(xiβj)\Pr(Y_i = k \mid \mathbf{x}_i) = \pi_{ik} = \sigma_k(\mathbf{x}_i; \boldsymbol{B}) = \text{softmax}_k(\mathbf{x}_i; \boldsymbol{B}) = \frac{\exp\!\bigl(x_i^\top \boldsymbol{\beta}_k\bigr)}{\sum_{j=1}^K \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)}

2. Likelihood and Estimation

Assume the data are encoded in one-hot form so that yik=1y_{ik} = 1 if observation ii belongs to class kk and 0 otherwise, with k=1Kyik=1\sum_{k=1}^K y_{ik} = 1. The likelihood is given by:

L(B)=i=1nk=1KπikyikL(\boldsymbol{B}) = \prod_{i=1}^{n} \prod_{k=1}^{K} \pi_{ik}^{\,y_{ik}}

Taking the logarithm, we have the log-likelihood:

(B)=i=1nk=1Kyikln ⁣(exp ⁣(xiβk)j=1Kexp ⁣(xiβj))\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \ln\!\left(\frac{\exp\!\bigl(x_i^\top \boldsymbol{\beta}_k\bigr)}{\sum_{j=1}^K \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)}\right)

Expanding the logarithm yields:

(B)=i=1nk=1Kyik[xiβkln ⁣(j=1Kexp ⁣(xiβj))]\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \Bigl[ x_i^\top \boldsymbol{\beta}_k - \ln\!\Bigl(\sum_{j=1}^K \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)\Bigr) \Bigr]

we separate the k=1Kyik\sum_{k=1}^{K} y_{ik} along the two terms:

(B)=i=1n[k=1Kyikxiβk    k=1Kyikln ⁣(j=1Kexp ⁣(xiβj))].\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \left[ \sum_{k=1}^{K} y_{ik}\,x_i^\top \boldsymbol{\beta}_k \;-\; \sum_{k=1}^{K} y_{ik}\,\ln\!\left(\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)\right) \right].

Since the term ln ⁣(j=1Kexp(xiβj))\ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right) does not depend on kk, we can factor it out of the outer summation.

k=1Kyikln ⁣(j=1Kexp(xiβj))=ln ⁣(j=1Kexp(xiβj))k=1Kyik.\sum_{k=1}^{K} y_{ik}\,\ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right) = \ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right) \sum_{k=1}^{K} y_{ik}.

Using the one-hot property, k=1Kyik=1\sum_{k=1}^{K} y_{ik} = 1 (Remember that yiy_i is a one-hot encoded vector; for example, if K=3K=3 and the true class is 2, then yi=(0,1,0)Ty_i = (0,\,1,\,0)^T, so that 0+1+0=10+1+0=1), we obtain

(B)=i=1n[k=1Kyikxiβkln ⁣(j=1Kexp(xiβj))].\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \left[ \sum_{k=1}^{K} y_{ik}\,x_i^\top \boldsymbol{\beta}_k - \ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right) \right].

3. Assumptions

The multinomial logistic regression model is built on several key assumptions:

  • Linearity in the Log-Odds: The log-odds for each class are modeled as a linear function of the predictors.
  • Independence: Each observation is assumed to be independent.
  • No Perfect Separation: No combination of predictors perfectly predicts the outcome.
  • Absence of High Multicollinearity: Excessively correlated predictors can lead to unstable estimates.
  • Correct Link Function: The softmax link function is appropriate for the multiclass outcome.

4. Derivation of the Log-Likelihood Derivative

Starting from

(B)=i=1n[k=1Kyikxiβkln ⁣(j=1Kexp ⁣(xiβj))],\ell(\boldsymbol{B}) = \sum_{i=1}^{n} \left[ \sum_{k=1}^{K} y_{ik}\,x_i^\top \boldsymbol{\beta}_k - \ln\!\left(\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)\right) \right],

we take the derivative with respect to βr\boldsymbol{\beta}_r (for a fixed class rr). Since for the second summation, all except βr\boldsymbol{\beta}_r's component will differentiate to zero.

(B)βr=i=1n[βr(yirxiβr)βrln ⁣(j=1Kexp(xiβj))].\frac{\partial \ell(\boldsymbol{B})}{\partial \boldsymbol{\beta}_r} =\sum_{i=1}^{n}\left[\frac{\partial}{\partial \boldsymbol{\beta}_r}\Bigl(y_{ir}\,x_i^\top \boldsymbol{\beta}_r\Bigr) -\frac{\partial}{\partial \boldsymbol{\beta}_r}\ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right)\right].

For the first term, we differentiate yirxiβr,y_{ir}\,x_i^\top \boldsymbol{\beta}_r, with respect to βr\boldsymbol{\beta}_r. Applying the basic rule of matrix calculus—that is, for a constant vector a\mathbf{a}, β(aβ)=a,\frac{\partial}{\partial \boldsymbol{\beta}} (\mathbf{a}^\top \boldsymbol{\beta}) = \mathbf{a}, we obtain:

βr(yirxiβr)=yirxi.\frac{\partial}{\partial \boldsymbol{\beta}_r}\Bigl(y_{ir}\,x_i^\top \boldsymbol{\beta}_r\Bigr) = y_{ir}\,x_i.

Similarly, differentiate the second term using the chain rule:

βrln ⁣(j=1Kexp(xiβj))=1j=1Kexp(xiβj)βrexp(xiβr)=exp(xiβr)j=1Kexp(xiβj)xi.\frac{\partial}{\partial \boldsymbol{\beta}_r}\ln\!\left(\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)\right) =\frac{1}{\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)} \cdot \frac{\partial}{\partial \boldsymbol{\beta}_r}\exp(x_i^\top \boldsymbol{\beta}_r) =\frac{\exp(x_i^\top \boldsymbol{\beta}_r)}{\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)}\,x_i.

Here, only the j=rj=r term depends on βr\boldsymbol{\beta}_r; defining πir=exp(xiβr)j=1Kexp(xiβj)\pi_{ir} = \frac{\exp(x_i^\top \boldsymbol{\beta}_r)}{\sum_{j=1}^{K} \exp(x_i^\top \boldsymbol{\beta}_j)} gives the result πirxi\pi_{ir}\,x_i.

βrln ⁣(j=1Kexp ⁣(xiβj))=πirxi.\frac{\partial}{\partial \boldsymbol{\beta}_r}\ln\!\left(\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j\bigr)\right) = \pi_{ir}\,x_i.

Thus, we have:

(B)βr=i=1n[yirxiπirxi]=i=1n(yirπir)xi.\frac{\partial \ell(\boldsymbol{B})}{\partial \boldsymbol{\beta}_r} =\sum_{i=1}^{n}\left[y_{ir}\,x_i - \pi_{ir}\,x_i\right] =\sum_{i=1}^{n}\left(y_{ir}-\pi_{ir}\right)x_i.

This is the gradient of the log-likelihood with respect to βr\boldsymbol{\beta}_r. Stacking these for r=1,,Kr=1,\dots,K yields the overall gradient with respect to the parameter matrix B\boldsymbol{B}.

Define XRn×(p+1)X \in \mathbb{R}^{n \times (p+1)} as the design matrix (each row xiTx_i^T), YRn×KY \in \mathbb{R}^{n \times K} as the one-hot encoded response matrix, and ΠRn×K\Pi \in \mathbb{R}^{n \times K} with entries πik\pi_{ik}. Then, stacking the gradients for r=1,,Kr=1,\dots,K yields the overall gradient:

B(B)=X(YΠ).\nabla_{\boldsymbol{B}} \ell(\boldsymbol{B}) = X^\top \,(Y - \Pi).

5. Gradient Ascent Update

Since our objective is to maximize the log-likelihood (B)\ell(\boldsymbol{B}), we use gradient ascent. (If we were minimizing the negative log-likelihood, we would instead use gradient descent.) At iteration tt, the update rule is:

B(t+1)=B(t)+ηB(B(t)),\boldsymbol{B}^{(t+1)} = \boldsymbol{B}^{(t)} + \eta\, \nabla_{\boldsymbol{B}} \ell(\boldsymbol{B}^{(t)}),

where η>0\eta > 0 is the learning rate. Using the vectorized gradient expression B(B)=X(YΠ),\nabla_{\boldsymbol{B}} \ell(\boldsymbol{B}) = X^\top (Y - \Pi), the update becomes:

B(t+1)=B(t)+ηX(YΠ(t)),\boldsymbol{B}^{(t+1)} = \boldsymbol{B}^{(t)} + \eta\, X^\top \left(Y - \Pi^{(t)}\right),

with Π(t)\Pi^{(t)} denoting the softmax probability matrix at iteration tt, where the (i,k)(i,k)-th entry is given by πik(t)=exp ⁣(xiβk(t))j=1Kexp ⁣(xiβj(t)).\pi_{ik}^{(t)} = \frac{\exp\!\bigl(x_i^\top \boldsymbol{\beta}_k^{(t)}\bigr)}{\sum_{j=1}^{K} \exp\!\bigl(x_i^\top \boldsymbol{\beta}_j^{(t)}\bigr)}. This gradient ascent procedure maximizes the log-likelihood, and iterations continue until convergence (for example, when the change in (B)\ell(\boldsymbol{B}) is below a preset threshold).


5. Applications in Real-World Problems

Multinomial logistic regression is especially useful when the response variable can take more than two categories. Its real-world applications include:

  • Epidemiology: Predicting the type or stage of a disease based on multiple risk factors.
  • Marketing: Segmenting customers into distinct groups based on purchasing behavior.
  • Natural Language Processing: Classifying text documents into multiple categories.
  • Credit Scoring: Assessing the creditworthiness of applicants by categorizing them into risk levels.
  • Medical Diagnosis: Differentiating among various diagnoses using multiple clinical indicators.

Multinomial logistic regression stands out because it not only separates data into multiple categories but also provides clear, interpretable numbers that show how each predictor affects the outcome for each class. This clarity makes it a reliable choice for real-world problems—whether you're trying to predict disease stages, segment customers, classify texts, assess credit risk, or pinpoint a diagnosis. By breaking down the model structure and the math behind it, practitioners can confidently use this method, check that its assumptions hold, and trust the predictions it makes in everyday applications.


Conclusion

In this article, we extended the traditional binary logistic regression framework to the multivariate (multinomial) case for multiclass problems. We introduced the model by defining the predictor vector xiR(p+1)×1\mathbf{x}_i \in \mathbb{R}^{(p+1) \times 1}, the one-hot encoded response YiRKY_i \in \mathbb{R}^K, and the parameter matrix BR(p+1)×K\boldsymbol{B} \in \mathbb{R}^{(p+1) \times K}. The softmax function was employed to model the probability of each class. We derived the log-likelihood and provided a detailed mathematical derivation of its gradient using matrix calculus and the chain rule. Finally, we presented the gradient ascent update rule for maximizing the log-likelihood, noting that if the negative log-likelihood were minimized instead, the update would follow a gradient descent scheme.

This concise overview establishes the theoretical foundation of multinomial logistic regression and its estimation via gradient-based optimization, setting the stage for applying these methods to practical multiclass classification problems in various fields.