Multivariate Logistic Regression: An Extension for Multiclass Problems
Multivariate logistic regression (often referred to as multinomial logistic regression) extends the binary logistic regression framework to problems where the outcome variable Y can take one of K classes. In this article, we introduce the model setup, detail the dimensions and assumptions, derive the log-likelihood and its derivative in a concise mathematical sequence, and discuss practical applications.
1. Model Setup
For each observation i=1,…,n, let xi∈R(p+1)×1 denote the predictor vector (including the intercept), defined as
xi=(1,xi1,xi2,…,xip)T.
The response Yi is represented as a one-hot encoded vector in {0,1}K; that is, Yi∈RK with exactly one entry equal to 1 and the remaining entries equal to 0.
Notice, In the binary classification case (i.e., K=2), the response Yi would simply be a scalar in {0,1}. However, for a multi-class setting with K>2, we represent Yi using a one-hot encoded vector in {0,1}K.
For example, if K=3 and the true class for observation i is 2, then Yi=(0,1,0)T.
In multinomial logistic regression, we have a parameter vector for each class k=1,…,K, where each βk∈R(p+1)×1. These parameter vectors are collected in the matrix B=[β1,β2,…,βK]∈R(p+1)×K. The model for the probability that Yi belongs to class k is defined via the softmax function:
Pr(Yi=k∣xi)=∑j=1Kexp(xi⊤βj)exp(xi⊤βk).
Softmax Probability for Class k
Pr(Yi=k∣xi)=πik=σk(xi;B)=softmaxk(xi;B)=∑j=1Kexp(xi⊤βj)exp(xi⊤βk)
2. Likelihood and Estimation
Assume the data are encoded in one-hot form so that yik=1 if observation i belongs to class k and 0 otherwise, with ∑k=1Kyik=1. The likelihood is given by:
L(B)=i=1∏nk=1∏Kπikyik
Taking the logarithm, we have the log-likelihood:
ℓ(B)=i=1∑nk=1∑Kyikln(∑j=1Kexp(xi⊤βj)exp(xi⊤βk))
Expanding the logarithm yields:
ℓ(B)=i=1∑nk=1∑Kyik[xi⊤βk−ln(j=1∑Kexp(xi⊤βj))]
we separate the ∑k=1Kyik along the two terms:
ℓ(B)=i=1∑n[k=1∑Kyikxi⊤βk−k=1∑Kyikln(j=1∑Kexp(xi⊤βj))].
Since the term ln(∑j=1Kexp(xi⊤βj)) does not depend on k, we can factor it out of the outer summation.
k=1∑Kyikln(j=1∑Kexp(xi⊤βj))=ln(j=1∑Kexp(xi⊤βj))k=1∑Kyik.
Using the one-hot property, ∑k=1Kyik=1 (Remember that yi is a one-hot encoded vector; for example, if K=3 and the true class is 2, then yi=(0,1,0)T, so that 0+1+0=1), we obtain
ℓ(B)=i=1∑n[k=1∑Kyikxi⊤βk−ln(j=1∑Kexp(xi⊤βj))].
3. Assumptions
The multinomial logistic regression model is built on several key assumptions:
- Linearity in the Log-Odds: The log-odds for each class are modeled as a linear function of the predictors.
- Independence: Each observation is assumed to be independent.
- No Perfect Separation: No combination of predictors perfectly predicts the outcome.
- Absence of High Multicollinearity: Excessively correlated predictors can lead to unstable estimates.
- Correct Link Function: The softmax link function is appropriate for the multiclass outcome.
4. Derivation of the Log-Likelihood Derivative
Starting from
ℓ(B)=i=1∑n[k=1∑Kyikxi⊤βk−ln(j=1∑Kexp(xi⊤βj))],
we take the derivative with respect to βr (for a fixed class r). Since for the second summation, all except βr's component will differentiate to zero.
∂βr∂ℓ(B)=i=1∑n[∂βr∂(yirxi⊤βr)−∂βr∂ln(j=1∑Kexp(xi⊤βj))].For the first term, we differentiate
yirxi⊤βr,
with respect to βr. Applying the basic rule of matrix calculus—that is, for a constant vector a,
∂β∂(a⊤β)=a,
we obtain:
∂βr∂(yirxi⊤βr)=yirxi.Similarly, differentiate the second term using the chain rule:
∂βr∂ln(j=1∑Kexp(xi⊤βj))=∑j=1Kexp(xi⊤βj)1⋅∂βr∂exp(xi⊤βr)=∑j=1Kexp(xi⊤βj)exp(xi⊤βr)xi.Here, only the j=r term depends on βr; defining πir=∑j=1Kexp(xi⊤βj)exp(xi⊤βr) gives the result πirxi.
∂βr∂ln(j=1∑Kexp(xi⊤βj))=πirxi.
Thus, we have:
∂βr∂ℓ(B)=i=1∑n[yirxi−πirxi]=i=1∑n(yir−πir)xi.
This is the gradient of the log-likelihood with respect to βr. Stacking these for r=1,…,K yields the overall gradient with respect to the parameter matrix B.
Define X∈Rn×(p+1) as the design matrix (each row xiT), Y∈Rn×K as the one-hot encoded response matrix, and Π∈Rn×K with entries πik. Then, stacking the gradients for r=1,…,K yields the overall gradient:
∇Bℓ(B)=X⊤(Y−Π).
5. Gradient Ascent Update
Since our objective is to maximize the log-likelihood ℓ(B), we use gradient ascent. (If we were minimizing the negative log-likelihood, we would instead use gradient descent.) At iteration t, the update rule is:
B(t+1)=B(t)+η∇Bℓ(B(t)),
where η>0 is the learning rate. Using the vectorized gradient expression
∇Bℓ(B)=X⊤(Y−Π),
the update becomes:
B(t+1)=B(t)+ηX⊤(Y−Π(t)),
with Π(t) denoting the softmax probability matrix at iteration t, where the (i,k)-th entry is given by
πik(t)=∑j=1Kexp(xi⊤βj(t))exp(xi⊤βk(t)).
This gradient ascent procedure maximizes the log-likelihood, and iterations continue until convergence (for example, when the change in ℓ(B) is below a preset threshold).
5. Applications in Real-World Problems
Multinomial logistic regression is especially useful when the response variable can take more than two categories. Its real-world applications include:
- Epidemiology: Predicting the type or stage of a disease based on multiple risk factors.
- Marketing: Segmenting customers into distinct groups based on purchasing behavior.
- Natural Language Processing: Classifying text documents into multiple categories.
- Credit Scoring: Assessing the creditworthiness of applicants by categorizing them into risk levels.
- Medical Diagnosis: Differentiating among various diagnoses using multiple clinical indicators.
Multinomial logistic regression stands out because it not only separates data into multiple categories but also provides clear, interpretable numbers that show how each predictor affects the outcome for each class. This clarity makes it a reliable choice for real-world problems—whether you're trying to predict disease stages, segment customers, classify texts, assess credit risk, or pinpoint a diagnosis. By breaking down the model structure and the math behind it, practitioners can confidently use this method, check that its assumptions hold, and trust the predictions it makes in everyday applications.
Conclusion
In this article, we extended the traditional binary logistic regression framework to the multivariate (multinomial) case for multiclass problems. We introduced the model by defining the predictor vector xi∈R(p+1)×1, the one-hot encoded response Yi∈RK, and the parameter matrix B∈R(p+1)×K. The softmax function was employed to model the probability of each class. We derived the log-likelihood and provided a detailed mathematical derivation of its gradient using matrix calculus and the chain rule. Finally, we presented the gradient ascent update rule for maximizing the log-likelihood, noting that if the negative log-likelihood were minimized instead, the update would follow a gradient descent scheme.
This concise overview establishes the theoretical foundation of multinomial logistic regression and its estimation via gradient-based optimization, setting the stage for applying these methods to practical multiclass classification problems in various fields.