Logistic Regression: A Fundamental Classification Method
Logistic regression is a fundamental method for binary classification. Binary means having two outcomes, such as A or B, or 0 or 1. It models the probability of belonging to a certain class (often labeled as “1”) using a logistic function (i.e. 1 / (1 + exp(-z))). Before we dive into the details of parameter estimation, it is important to understand the two main approaches to logistic regression. This article covers the Frequentist derivation (via Maximum Likelihood Estimation), introduces the Bayesian viewpoint (placing priors on coefficients), and finally discusses the assumptions behind logistic regression and strategies to check them.
Key Logistic Regression Framework
At its core, logistic regression predicts the probability that a binary outcome Y∈{0,1} is “1” given predictors x1,x2,…,xp. Denote xi=(1,xi1,xi2,…,xip)T for observation i (including intercept). The model is written as:
Component Form:
logit(Pr(Yi=1∣xi))=β0+β1xi1+⋯+βpxip
Vectorized Form:
logit(πi)=xiTβ,where β=(β0,β1,…,βp)T
where the logit function logit(p)=ln(1−pp) serves as the link function in logistic regression, mapping probabilities to the real line.
The inverse of this link is the sigmoid function, which satisfies σ(logit(p))=p and logit(σ(z))=z.
Component Form:
Pr(Yi=1∣xi)=1+exp(−(β0+β1xi1+⋯+βpxip))1
Vectorized Form:
πi=σ(xiTβ)=1+exp(−xiTβ)1
1. Frequentist Approach
In the Frequentist setting, we treat parameters β=(β0,β1,…,βp) as fixed but unknown. We estimate them by maximizing the likelihood of the observed data under the logistic model.
Likelihood and MLE
Given n independent observations {(xi,yi)}i=1n, with yi∈{0,1}, the probability of yi=1 is πi=σ(xiTβ). The likelihood is:
Component Form:
L(β)=i=1∏nπiyi(1−πi)(1−yi)
Log-Likelihood:
ℓ(β)=i=1∑n[yiln(πi)+(1−yi)ln(1−πi)]
Derivative of the Log-Likelihood
To maximize ℓ(β), compute the gradient ∂β∂ℓ:
1. Log-Likelihood Function
For a dataset of n observations, each with predictor vector xi and binary outcome yi∈{0,1}, the log-likelihood under the logistic regression model is:
ℓ(β)=i=1∑n[yiln(σ(xi⊤β))+(1−yi)ln(1−σ(xi⊤β))]
2. Sigmoid Derivative
We first recall how to differentiate σ(z):
dzdσ(z)=σ(z)(1−σ(z)).
This property is crucial when applying the chain rule to the log-likelihood terms.
3. Differentiating Each Term
Consider the i-th term of the sum:
Li(β)=yiln(σ(xi⊤β))+(1−yi)ln(1−σ(xi⊤β)).
Let zi=xi⊤β. Then σ(xi⊤β)=σ(zi). Using the chain rule:
1. Derivative of ln(σ(zi)):
∂β∂[ln(σ(zi))]=σ(zi)1⋅σ(zi)(1−σ(zi))⋅xi=(1−σ(zi))xi.
2. Derivative of ln(1−σ(zi)):
∂β∂[ln(1−σ(zi))]=1−σ(zi)1⋅[−σ(zi)(1−σ(zi))]⋅xi=−σ(zi)xi.
Combining both parts for the i-th term:
∂β∂Li=yi(1−σ(zi))xi−(1−yi)σ(zi)xi.
Rearranging terms:
∂β∂Li=xi[yi−σ(zi)].
4. Summing Over All Observations
The total gradient is then:
∇βℓ(β)=i=1∑n∂β∂Li=i=1∑nxi[yi−σ(xi⊤β)].
Defining the vector of predicted probabilities p=σ(Xβ) (applied row by row) and stacking the xi into a matrix X, we obtain:
∇βℓ(β)=X⊤(y−p).
5. Final Expressions
-
Component Form (coefficient βj):
∂βj∂ℓ=i=1∑n[yi−σ(xi⊤β)]xij.
-
Vectorized Form (gradient vector):
∇ℓ(β)=X⊤(y−σ(Xβ))=X⊤(y−p).
Hessian Matrix
To proceed beyond the gradient and perform methods like Newton-Raphson, we need the Hessian H, which is the matrix of second derivatives of the log-likelihood.
-
Component Form
For coefficients βj and βk, the second partial derivative is:
∂βj∂βk∂2ℓ=−i=1∑nxijxikπi(1−πi),
where πi=σ(xi⊤β).
-
Vectorized Form
Define the diagonal matrix W=diag(πi(1−πi)). Then the Hessian matrix is:
H=∂β∂β⊤∂2ℓ(β)=−X⊤WX.
Why No Closed-Form Solution?
Setting ∇ℓ(β)=0 gives:
XT(y−p)=0
This is a system of nonlinear equations because p depends nonlinearly on β via σ(⋅). No analytical solution exists, necessitating numerical methods like:
- Newton-Raphson: Iteratively update β using:
β(t+1)=β(t)−H−1∇ℓ(β(t))
- Gradient Descent: Update via β(t+1)=β(t)+η∇ℓ(β(t)) with step size η.
2. Bayesian Approach
Bayesian logistic regression treats β as random variables, with a specified prior p(β). A common choice is a normal prior:
β∼N(0,α−1I)
Posterior Inference
The posterior is proportional to the likelihood times the prior:
p(β∣y,X)∝p(y∣X,β)p(β)
Log-Posterior:
lnp(β∣y,X)=ℓ(β)−2αβTβ+constant
Why No Closed-Form Posterior?
The logistic likelihood is not conjugate to the Gaussian prior, making the posterior intractable. Approximation methods are required:
- Markov Chain Monte Carlo (MCMC): Sample from the posterior using algorithms like Metropolis-Hastings or Hamiltonian Monte Carlo.
- Variational Inference: Approximate the posterior with a tractable distribution (e.g., Gaussian) by maximizing the Evidence Lower Bound (ELBO).
Estimating Parameters Step by Step
Frequentist (MLE) Derivation
Step 1: Define Likelihood
For each observation:
Pr(Yi=yi∣xi)=σ(xiTβ)yi(1−σ(xiTβ))1−yi
Step 2: Log-Likelihood
ℓ(β)=i=1∑n[yixiTβ−ln(1+exp(xiTβ))]
Step 3: Gradient and Hessian
- Gradient: ∇ℓ=XT(y−p)
- Hessian: H=−XTWX
Step 4: Newton-Raphson Update
Initialize β(0), then iterate:
β(t+1)=β(t)+(XTW(t)X)−1XT(y−p(t))
Estimating Parameters Step by Step
Frequentist (MLE) Derivation
-
Define the Likelihood
For each observation (xi,yi), where yi∈{0,1}, the probability of the observed outcome is:
Pr(Yi=yi∣xi)=[σ(xi⊤β)]yi[1−σ(xi⊤β)]1−yi,
where σ(z)=1+e−z1 is the sigmoid function.
-
Log-Likelihood
Taking logs turns products into sums:
ℓ(β)=i=1∑n[yi(xi⊤β)−ln(1+exp(xi⊤β))].
This expression is more convenient for differentiation and for numerical optimization.
-
Gradient and Hessian
Let X be the n×(p+1) design matrix, y the n×1 outcome vector, and p=σ(Xβ) (applied row-by-row). Then:
- Gradient:
∇ℓ(β)=X⊤(y−p).
- Hessian:
H=∂β∂β⊤∂2ℓ=−X⊤WX,
where W=diag(pi(1−pi)).
-
Newton-Raphson Update
- Initialization: Choose a starting guess β(0).
- Iterate until convergence:
β(t+1)=β(t)+[X⊤W(t)X]−1X⊤(y−p(t)).
The Hessian H and gradient ∇ℓ(β) are recomputed at each iteration, and this method typically converges quickly near the optimum.
-
Gradient Descent
Instead of using Newton-Raphson, one can perform gradient descent or stochastic gradient descent, which only requires the gradient (not the Hessian). The update rule is:
β(t+1)=β(t)+η∇ℓ(β(t)),
where η is the step size or learning rate.
Gradient descent may converge more slowly than Newton-Raphson but is often preferred for very large datasets, since each iteration avoids the cost of inverting the Hessian. In practice, stochastic or mini-batch gradient descent is popular in machine learning frameworks.
Interpreting Logistic Regression Results
1. Statistical Interpretation
When you estimate coefficients in a logistic regression, each coefficient βj corresponds to how the log-odds change with a one-unit increase in predictor xj. Specifically:
-
Log-Odds Interpretation:
- βj is the change in log(1−ππ) (log-odds of the event) for a one-unit increase in xj, holding all other predictors constant.
-
Odds Ratio:
- exp(βj) is the odds ratio (OR) associated with a one-unit increase in xj.
- If exp(βj)>1, the odds of Y=1 increase as xj increases.
- If exp(βj)<1, the odds of Y=1 decrease as xj increases.
-
Confidence Intervals:
- Often, exp(βj) is accompanied by a confidence interval. If 1 is within that interval, you cannot reject the possibility that the odds ratio = 1 (no effect).
2. Layman Interpretation
To explain logistic regression results to a non-technical audience, focus on what an increase in a predictor means for the likelihood of the outcome, rather than log-odds or odds ratios. For instance:
-
“Chance” vs. “Odds”:
- You might say, “When we increase xj by one unit, the chance (probability) of Y=1 is multiplied by about exp(βj) / (1 + some adjustment).” But, most simply, you can say it increases or decreases the likelihood of Y=1, depending on whether exp(βj) is greater than or less than 1.
-
Example Phrase:
- “If you increase the number of hours studied by one hour, the odds of passing the exam increase by a factor of exp(βj). In other words, you become [exp(βj)−1] × 100% more likely to pass, roughly speaking, compared to someone who studied one hour less (assuming everything else is the same).”
-
Intuitive Summaries:
- Stress that logistic regression turns a set of input values (predictors) into a probability of the event (the “yes” or “1” outcome). Each predictor’s coefficient shows how strongly it influences that probability, once all other factors are accounted for.
Model Assumptions & Diagnostics
Linearity in the Log-Odds
What It Means: Continuous predictors should relate linearly to the log-odds of the outcome, not directly to the outcome itself.
How to Check:- Box-Tidwell Test: Add interaction terms of the form xjln(xj) and see if they are significant. Significance indicates nonlinearity.
- Residual or Partial Residual Plots: Plot deviance or Pearson residuals vs. the predictor. A systematic pattern suggests nonlinearity.
- Spline Terms: Including spline (e.g., cubic) expansions of predictors can reveal or capture nonlinearity.
Mathematical Note: If logit(πi)=β0+∑jβjxij does not hold, consider higher-order or interaction terms.
No Perfect Separation
What It Means: A scenario where a predictor (or combination of predictors) perfectly classifies the outcome, causing MLE estimates to go to ±∞.
How to Check:- Software Warnings: Common packages (R, Python, etc.) often warn about divergence or extremely large coefficient estimates.
- Cross-Tabulations: Inspect whether specific categories of predictors contain only 0’s or only 1’s for the outcome.
Remedies: Use Firth’s bias-reduced logistic regression or add a penalty (e.g., ridge or lasso).
Mathematical Note: Perfect separation leads to unbounded likelihood: ℓ(β)→∞.
Independence of Observations
What It Means: Each data point (observation) is assumed to be independent from the others.
How to Check:- Study Design: Ensure the data-collection process didn’t violate independence (e.g., repeated measures, clustered data).
- Durbin-Watson Test: For time-series or sequential data, check for autocorrelation of residuals.
- Intraclass Correlation (ICC): For grouped data, a high ICC indicates dependence within groups.
Remedies: Use Generalized Estimating Equations (GEE) or Mixed-Effects Logistic Regression if observations are correlated.
Absence of High Multicollinearity
What It Means: Predictors should not be excessively correlated with each other, as it inflates variance of the parameter estimates.
How to Check:- Variance Inflation Factor (VIF): VIF>5 (or 10) often indicates problematic collinearity.
- Correlation Matrix: Look for predictor pairs with correlations near 1 or -1.
- Eigenvalues & Condition Indices: A condition index above 30 is suspicious.
Remedies: Remove or combine redundant predictors, or apply regularization (ridge or lasso).
Correct Link Function
What It Means: Logistic regression assumes a logit link. If the true relationship is closer to a probit or complementary log-log, the model may be misspecified.
How to Check:- Compare Fit: Fit alternative links (e.g., probit) and compare information criteria (AIC/BIC).
- Residual Plots: Look for systematic deviation that might indicate a different link is needed.
Mathematical Note: The logit link is ln(1−ππ). Alternative links just change the transformation of π.
No Overdispersion (for Aggregated Binomial Data)
What It Means: When using grouped/binomial data, logistic regression assumes binomial variance. Overdispersion occurs if the variance exceeds what's expected under the binomial assumption.
How to Check:- Deviance / Degrees of Freedom: If the ratio is noticeably greater than 1, consider overdispersion.
- Pearson Chi-Square: Compare to degrees of freedom for an overdispersion test.
Remedies: Use a quasi-binomial or beta-binomial approach if overdispersion is detected.
Adequate Sample Size
What It Means: Logistic regression requires enough data relative to the number of parameters. Small sample sizes can lead to instability or bias in estimates.
How to Check:- Events per Variable (EPV): A rule of thumb is at least 10 events (and 10 non-events) per predictor.
- Simulation or Power Analysis: Evaluate how sample size affects parameter estimation confidence.
Remedies: Collect more data, reduce the number of predictors, or apply penalized methods (ridge, lasso, or Firth).
Conclusion
Logistic regression stands as a core method for predicting yes/no, success/fail, or 0/1 outcomes. By transforming a linear combination of predictors into a probability through the logit link, it offers both interpretability and solid theoretical grounding. We have seen how to derive its gradients and Hessians for estimation, and why no closed-form solution is possible—necessitating algorithms like Newton-Raphson or gradient-based methods.
On the Bayesian side, logistic regression gains flexibility by allowing you to integrate priors and quantify uncertainties naturally, though again the likelihood’s non-conjugacy prevents a direct analytical posterior. Beyond the math, success hinges on verifying the model’s assumptions: ensuring linearity in log-odds, guarding against perfect separation, checking independence, and verifying sufficient sample size and low multicollinearity. By addressing these points, you can maintain confidence that logistic regression’s straightforward probability outputs and odds-ratio interpretations will deliver meaningful insights and robust predictions in your data projects.