Logistic Regression: A Fundamental Classification Method

Logistic regression is a fundamental method for binary classification. Binary means having two outcomes, such as A or B, or 0 or 1. It models the probability of belonging to a certain class (often labeled as “1”) using a logistic function (i.e. 1 / (1 + exp(-z))). Before we dive into the details of parameter estimation, it is important to understand the two main approaches to logistic regression. This article covers the Frequentist derivation (via Maximum Likelihood Estimation), introduces the Bayesian viewpoint (placing priors on coefficients), and finally discusses the assumptions behind logistic regression and strategies to check them.

Key Logistic Regression Framework

At its core, logistic regression predicts the probability that a binary outcome $Y \in \{0,1\}$ is “1” given predictors $x_1, x_2, \dots, x_p$ . Denote $\mathbf{x}_i = (1, x_{i1}, x_{i2}, \dots, x_{ip})^T$ for observation $i$ (including intercept). The model is written as:

Component Form:

\text{logit}\bigl(\Pr(Y_i = 1 \mid \mathbf{x}_i)\bigr) = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}

Vectorized Form:

\text{logit}(\pi_i) = \mathbf{x}_i^T \boldsymbol{\beta}, \quad \text{where } \boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_p)^T

where the logit function $\text{logit}(p) = \ln\!\bigl(\frac{p}{1-p}\bigr)$ serves as the link function in logistic regression, mapping probabilities to the real line. The inverse of this link is the sigmoid function, which satisfies $\sigma(\text{logit}(p)) = p$ and $\text{logit}\bigl(\sigma(z)\bigr) = z$ .

Component Form:

\Pr(Y_i = 1 \mid \mathbf{x}_i) = \frac{1}{1 + \exp\!\bigl(-(\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip})\bigr)}

Vectorized Form:

\pi_i = \sigma(\mathbf{x}_i^T \boldsymbol{\beta}) = \frac{1}{1 + \exp(-\mathbf{x}_i^T \boldsymbol{\beta})}

1. Frequentist Approach

In the Frequentist setting, we treat parameters $\boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_p)$ as fixed but unknown. We estimate them by maximizing the likelihood of the observed data under the logistic model.

Likelihood and MLE

Given $n$ independent observations $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$ , with $y_i \in \{0,1\}$ , the probability of $y_i = 1$ is $\pi_i = \sigma(\mathbf{x}_i^T\boldsymbol{\beta})$ . The likelihood is:

Component Form:

L(\boldsymbol{\beta}) = \prod_{i=1}^{n} \pi_i^{y_i}\,(1 - \pi_i)^{(1 - y_i)}

Log-Likelihood:

\ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \Bigl[y_i \ln(\pi_i) + (1 - y_i)\ln(1-\pi_i)\Bigr]

Derivative of the Log-Likelihood

To maximize $\ell(\boldsymbol{\beta})$ , compute the gradient $\frac{\partial \ell}{\partial \boldsymbol{\beta}}$ :

$1. \text{ Log-Likelihood Function}$

$\text{For a dataset of } n \text{ observations, each with predictor vector } \mathbf{x}_i \text{ and binary outcome } y_i \in \{0,1\}, \text{ the log-likelihood under the logistic regression model is:}$

\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \Bigl[ y_i \,\ln\bigl(\sigma(\mathbf{x}_i^\top \boldsymbol{\beta})\bigr) + (1 - y_i)\,\ln\bigl(1 - \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})\bigr) \Bigr]

$2. \text{ Sigmoid Derivative}$

$\text{We first recall how to differentiate } \sigma(z):$

\frac{d\,\sigma(z)}{dz} = \sigma(z)\,\bigl(1 - \sigma(z)\bigr).

$\text{This property is crucial when applying the chain rule to the log-likelihood terms.}$

$3. \text{ Differentiating Each Term}$

$\text{Consider the } i\text{-th term of the sum:}$

L_i(\boldsymbol{\beta}) = y_i \,\ln\bigl(\sigma(\mathbf{x}_i^\top \boldsymbol{\beta})\bigr) + (1 - y_i)\,\ln\bigl(1 - \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})\bigr).

$\text{Let } z_i = \mathbf{x}_i^\top \boldsymbol{\beta} \text{. Then } \sigma(\mathbf{x}_i^\top \boldsymbol{\beta}) = \sigma(z_i). \text{ Using the chain rule:}$

$1. \text{ Derivative of } \ln(\sigma(z_i)):$

\frac{\partial}{\partial \boldsymbol{\beta}} \bigl[\ln(\sigma(z_i))\bigr] = \frac{1}{\sigma(z_i)} \cdot \sigma(z_i)\bigl(1 - \sigma(z_i)\bigr) \cdot \mathbf{x}_i = \bigl(1 - \sigma(z_i)\bigr)\mathbf{x}_i.

$2. \text{ Derivative of } \ln\bigl(1 - \sigma(z_i)\bigr):$

\frac{\partial}{\partial \boldsymbol{\beta}} \bigl[\ln\bigl(1 - \sigma(z_i)\bigr)\bigr] = \frac{1}{1 - \sigma(z_i)} \cdot \bigl[-\sigma(z_i)\bigl(1 - \sigma(z_i)\bigr)\bigr] \cdot \mathbf{x}_i = -\sigma(z_i)\,\mathbf{x}_i.

$\text{Combining both parts for the } i\text{-th term:}$

\frac{\partial L_i}{\partial \boldsymbol{\beta}} = y_i \,\bigl(1 - \sigma(z_i)\bigr)\mathbf{x}_i \;-\; (1 - y_i)\,\sigma(z_i)\,\mathbf{x}_i.

$\text{Rearranging terms:}$

\frac{\partial L_i}{\partial \boldsymbol{\beta}} = \mathbf{x}_i\,\bigl[y_i - \sigma(z_i)\bigr].

$4. \text{ Summing Over All Observations}$

$\text{The total gradient is then:}$

\nabla_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \frac{\partial L_i}{\partial \boldsymbol{\beta}} = \sum_{i=1}^n \mathbf{x}_i \,\bigl[y_i - \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})\bigr].

$\text{Defining the vector of predicted probabilities } \mathbf{p} = \sigma(\mathbf{X}\boldsymbol{\beta}) \text{ (applied row by row) and stacking the } \mathbf{x}_i \text{ into a matrix } \mathbf{X}, \text{ we obtain:}$

\nabla_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}) = \mathbf{X}^\top \bigl(\mathbf{y} - \mathbf{p}\bigr).

$5. \text{ Final Expressions}$

$\textbf{Component Form (coefficient } \beta_j\text{):}$
$\frac{\partial \ell}{\partial \beta_j} = \sum_{i=1}^n \bigl[y_i - \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})\bigr]\,x_{ij}.$
$\textbf{Vectorized Form (gradient vector):}$
$\nabla \ell(\boldsymbol{\beta}) = \mathbf{X}^\top \,\bigl(\mathbf{y} - \sigma(\mathbf{X}\boldsymbol{\beta})\bigr) = \mathbf{X}^\top \bigl(\mathbf{y} - \mathbf{p}\bigr).$

Hessian Matrix

To proceed beyond the gradient and perform methods like Newton-Raphson, we need the Hessian $\mathbf{H}$ , which is the matrix of second derivatives of the log-likelihood.

Component Form
For coefficients $\beta_j$ and $\beta_k$ , the second partial derivative is:
$\frac{\partial^2 \ell}{\partial \beta_j \partial \beta_k} = -\sum_{i=1}^n x_{ij}\,x_{ik}\,\pi_i\bigl(1 - \pi_i\bigr),$
where $\pi_i = \sigma\bigl(\mathbf{x}_i^\top \boldsymbol{\beta}\bigr)$ .
Vectorized Form
Define the diagonal matrix $\mathbf{W} = \mathrm{diag}\!\bigl(\pi_i\bigl(1 - \pi_i\bigr)\bigr)$ . Then the Hessian matrix is:
$\mathbf{H} = \frac{\partial^2 \ell(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}\,\partial \boldsymbol{\beta}^\top} = -\mathbf{X}^\top \mathbf{W}\,\mathbf{X}.$

Why No Closed-Form Solution?

Setting $\nabla \ell(\boldsymbol{\beta}) = 0$ gives:

\mathbf{X}^T (\mathbf{y} - \mathbf{p}) = 0

This is a system of nonlinear equations because $\mathbf{p}$ depends nonlinearly on $\boldsymbol{\beta}$ via $\sigma(\cdot)$ . No analytical solution exists, necessitating numerical methods like:

Newton-Raphson: Iteratively update $\boldsymbol{\beta}$ using: $\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \mathbf{H}^{-1} \nabla \ell(\boldsymbol{\beta}^{(t)})$
Gradient Descent: Update via $\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \eta \nabla \ell(\boldsymbol{\beta}^{(t)})$ with step size $\eta$ .

2. Bayesian Approach

Bayesian logistic regression treats $\boldsymbol{\beta}$ as random variables, with a specified prior $p(\boldsymbol{\beta})$ . A common choice is a normal prior:

\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0},\, \alpha^{-1}\mathbf{I})

Posterior Inference

The posterior is proportional to the likelihood times the prior:

p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) \, p(\boldsymbol{\beta})

Log-Posterior:

\ln p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) = \ell(\boldsymbol{\beta}) - \frac{\alpha}{2} \boldsymbol{\beta}^T \boldsymbol{\beta} + \text{constant}

Why No Closed-Form Posterior?

The logistic likelihood is not conjugate to the Gaussian prior, making the posterior intractable. Approximation methods are required:

Markov Chain Monte Carlo (MCMC): Sample from the posterior using algorithms like Metropolis-Hastings or Hamiltonian Monte Carlo.
Variational Inference: Approximate the posterior with a tractable distribution (e.g., Gaussian) by maximizing the Evidence Lower Bound (ELBO).

Estimating Parameters Step by Step

Frequentist (MLE) Derivation

Step 1: Define Likelihood
For each observation:

\Pr(Y_i = y_i \mid \mathbf{x}_i) = \sigma(\mathbf{x}_i^T \boldsymbol{\beta})^{y_i} \left(1 - \sigma(\mathbf{x}_i^T \boldsymbol{\beta})\right)^{1 - y_i}

Step 2: Log-Likelihood

\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \left[ y_i \mathbf{x}_i^T \boldsymbol{\beta} - \ln\left(1 + \exp(\mathbf{x}_i^T \boldsymbol{\beta})\right) \right]

Step 3: Gradient and Hessian

Gradient: $\nabla \ell = \mathbf{X}^T (\mathbf{y} - \mathbf{p})$
Hessian: $\mathbf{H} = -\mathbf{X}^T \mathbf{W} \mathbf{X}$

Step 4: Newton-Raphson Update
Initialize $\boldsymbol{\beta}^{(0)}$ , then iterate:

\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T (\mathbf{y} - \mathbf{p}^{(t)})

Estimating Parameters Step by Step

Frequentist (MLE) Derivation

Define the Likelihood
For each observation $(\mathbf{x}_i, y_i)$ , where $y_i \in \{0,1\}$ , the probability of the observed outcome is:
$\Pr(Y_i = y_i \mid \mathbf{x}_i) = \bigl[\sigma(\mathbf{x}_i^\top \boldsymbol{\beta})\bigr]^{y_i} \,\bigl[1 - \sigma(\mathbf{x}_i^\top \boldsymbol{\beta})\bigr]^{1 - y_i},$
where $\sigma(z) = \tfrac{1}{1 + e^{-z}}$ is the sigmoid function.
Log-Likelihood
Taking logs turns products into sums:
$\ell(\boldsymbol{\beta}) = \sum_{i=1}^n \Bigl[y_i\,(\mathbf{x}_i^\top \boldsymbol{\beta}) \;-\;\ln\bigl(1 + \exp(\mathbf{x}_i^\top \boldsymbol{\beta})\bigr)\Bigr].$
This expression is more convenient for differentiation and for numerical optimization.
Gradient and Hessian
Let $\mathbf{X}$ be the $n \times (p+1)$ design matrix, $\mathbf{y}$ the $n \times 1$ outcome vector, and $\mathbf{p} = \sigma(\mathbf{X}\boldsymbol{\beta})$ (applied row-by-row). Then:
- Gradient: $\nabla \ell(\boldsymbol{\beta}) = \mathbf{X}^\top \bigl(\mathbf{y} - \mathbf{p}\bigr).$
- Hessian: $\mathbf{H} = \frac{\partial^2 \ell}{\partial \boldsymbol{\beta} \partial \boldsymbol{\beta}^\top} = -\mathbf{X}^\top \mathbf{W} \,\mathbf{X},$ where $\mathbf{W} = \mathrm{diag}\bigl(p_i(1 - p_i)\bigr)$ .
Newton-Raphson Update
- Initialization: Choose a starting guess $\boldsymbol{\beta}^{(0)}$ .
- Iterate until convergence: $\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \bigl[\mathbf{X}^\top \mathbf{W}^{(t)} \mathbf{X}\bigr]^{-1} \, \mathbf{X}^\top \bigl(\mathbf{y} - \mathbf{p}^{(t)}\bigr).$
The Hessian $\mathbf{H}$ and gradient $\nabla \ell(\boldsymbol{\beta})$ are recomputed at each iteration, and this method typically converges quickly near the optimum.
Gradient Descent

Instead of using Newton-Raphson, one can perform gradient descent or stochastic gradient descent, which only requires the gradient (not the Hessian). The update rule is:

\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} + \eta \,\nabla \ell(\boldsymbol{\beta}^{(t)}),

where $\eta$ is the step size or learning rate.

Gradient descent may converge more slowly than Newton-Raphson but is often preferred for very large datasets, since each iteration avoids the cost of inverting the Hessian. In practice, stochastic or mini-batch gradient descent is popular in machine learning frameworks.

Interpreting Logistic Regression Results

1. Statistical Interpretation

When you estimate coefficients in a logistic regression, each coefficient $\beta_j$ corresponds to how the log-odds change with a one-unit increase in predictor $x_j$ . Specifically:

Log-Odds Interpretation:
- $\beta_j$ is the change in $\log\!\Bigl(\frac{\pi}{1-\pi}\Bigr)$ (log-odds of the event) for a one-unit increase in $x_j$ , holding all other predictors constant.
Odds Ratio:
- $\exp(\beta_j)$ is the odds ratio (OR) associated with a one-unit increase in $x_j$ .
- If $\exp(\beta_j) > 1$ , the odds of $Y=1$ increase as $x_j$ increases.
- If $\exp(\beta_j) < 1$ , the odds of $Y=1$ decrease as $x_j$ increases.
Confidence Intervals:
- Often, $\exp(\beta_j)$ is accompanied by a confidence interval. If 1 is within that interval, you cannot reject the possibility that the odds ratio = 1 (no effect).

2. Layman Interpretation

To explain logistic regression results to a non-technical audience, focus on what an increase in a predictor means for the likelihood of the outcome, rather than log-odds or odds ratios. For instance:

“Chance” vs. “Odds”:
- You might say, “When we increase $x_j$ by one unit, the chance (probability) of $Y=1$ is multiplied by about $\exp(\beta_j)$ / (1 + some adjustment).” But, most simply, you can say it increases or decreases the likelihood of $Y=1$ , depending on whether $\exp(\beta_j)$ is greater than or less than 1.
Example Phrase:
- “If you increase the number of hours studied by one hour, the odds of passing the exam increase by a factor of $\exp(\beta_j)$ . In other words, you become [ $\exp(\beta_j) - 1$ ] × 100% more likely to pass, roughly speaking, compared to someone who studied one hour less (assuming everything else is the same).”
Intuitive Summaries:
- Stress that logistic regression turns a set of input values (predictors) into a probability of the event (the “yes” or “1” outcome). Each predictor’s coefficient shows how strongly it influences that probability, once all other factors are accounted for.

Model Assumptions & Diagnostics

Linearity in the Log-Odds

What It Means: Continuous predictors should relate linearly to the log-odds of the outcome, not directly to the outcome itself.

How to Check:

Box-Tidwell Test: Add interaction terms of the form $x_j \ln(x_j)$ and see if they are significant. Significance indicates nonlinearity.
Residual or Partial Residual Plots: Plot deviance or Pearson residuals vs. the predictor. A systematic pattern suggests nonlinearity.
Spline Terms: Including spline (e.g., cubic) expansions of predictors can reveal or capture nonlinearity.

Mathematical Note: If $\mathrm{logit}(\pi_i) = \beta_0 + \sum_j \beta_j x_{ij}$ does not hold, consider higher-order or interaction terms.

No Perfect Separation

What It Means: A scenario where a predictor (or combination of predictors) perfectly classifies the outcome, causing MLE estimates to go to $\pm \infty$ .

How to Check:

Software Warnings: Common packages (R, Python, etc.) often warn about divergence or extremely large coefficient estimates.
Cross-Tabulations: Inspect whether specific categories of predictors contain only 0’s or only 1’s for the outcome.

Remedies: Use Firth’s bias-reduced logistic regression or add a penalty (e.g., ridge or lasso).
Mathematical Note: Perfect separation leads to unbounded likelihood: $\ell(\boldsymbol{\beta}) \to \infty$ .

Independence of Observations

What It Means: Each data point (observation) is assumed to be independent from the others.

How to Check:

Study Design: Ensure the data-collection process didn’t violate independence (e.g., repeated measures, clustered data).
Durbin-Watson Test: For time-series or sequential data, check for autocorrelation of residuals.
Intraclass Correlation (ICC): For grouped data, a high ICC indicates dependence within groups.

Remedies: Use Generalized Estimating Equations (GEE) or Mixed-Effects Logistic Regression if observations are correlated.

Absence of High Multicollinearity

What It Means: Predictors should not be excessively correlated with each other, as it inflates variance of the parameter estimates.

How to Check:

Variance Inflation Factor (VIF): $\text{VIF} > 5$ (or 10) often indicates problematic collinearity.
Correlation Matrix: Look for predictor pairs with correlations near 1 or -1.
Eigenvalues & Condition Indices: A condition index above 30 is suspicious.

Remedies: Remove or combine redundant predictors, or apply regularization (ridge or lasso).

Correct Link Function

What It Means: Logistic regression assumes a logit link. If the true relationship is closer to a probit or complementary log-log, the model may be misspecified.

How to Check:

Compare Fit: Fit alternative links (e.g., probit) and compare information criteria (AIC/BIC).
Residual Plots: Look for systematic deviation that might indicate a different link is needed.

Mathematical Note: The logit link is $\ln\!\bigl(\frac{\pi}{1-\pi}\bigr)$ . Alternative links just change the transformation of $\pi$ .

No Overdispersion (for Aggregated Binomial Data)

What It Means: When using grouped/binomial data, logistic regression assumes binomial variance. Overdispersion occurs if the variance exceeds what's expected under the binomial assumption.

How to Check:

Deviance / Degrees of Freedom: If the ratio is noticeably greater than 1, consider overdispersion.
Pearson Chi-Square: Compare to degrees of freedom for an overdispersion test.

Remedies: Use a quasi-binomial or beta-binomial approach if overdispersion is detected.

Adequate Sample Size

What It Means: Logistic regression requires enough data relative to the number of parameters. Small sample sizes can lead to instability or bias in estimates.

How to Check:

Events per Variable (EPV): A rule of thumb is at least 10 events (and 10 non-events) per predictor.
Simulation or Power Analysis: Evaluate how sample size affects parameter estimation confidence.

Remedies: Collect more data, reduce the number of predictors, or apply penalized methods (ridge, lasso, or Firth).

Conclusion

Logistic regression stands as a core method for predicting yes/no, success/fail, or 0/1 outcomes. By transforming a linear combination of predictors into a probability through the logit link, it offers both interpretability and solid theoretical grounding. We have seen how to derive its gradients and Hessians for estimation, and why no closed-form solution is possible—necessitating algorithms like Newton-Raphson or gradient-based methods.

On the Bayesian side, logistic regression gains flexibility by allowing you to integrate priors and quantify uncertainties naturally, though again the likelihood’s non-conjugacy prevents a direct analytical posterior. Beyond the math, success hinges on verifying the model’s assumptions: ensuring linearity in log-odds, guarding against perfect separation, checking independence, and verifying sufficient sample size and low multicollinearity. By addressing these points, you can maintain confidence that logistic regression’s straightforward probability outputs and odds-ratio interpretations will deliver meaningful insights and robust predictions in your data projects.