Bayesian Linear Regression

Bayesian InferenceBayesian Linear RegressionBayesian RidgeBayesian LassoRegularizationStatistical ModelingLinear Regression (OLS, Ridge, Lasso, Elastic Net)
LPP Visualization

Bayesian Linear Regression: Fundamentals, Ridge, and Lasso Regularization

Bayesian linear regression provides a probabilistic framework for modeling the relationship between predictors and a response variable. In this framework, model parameters are treated as random variables, and we derive a full posterior distribution for them rather than just point estimates. In the sections below, we first detail the standard Bayesian linear regression model and then extend the discussion to two popular regularization methods: Bayesian Ridge and Bayesian Lasso.


Key Bayesian Regression Framework

Model Setup and Notation

We begin with the familiar linear model:

y=Xβ+ϵ,\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon},

where:

  • y\mathbf{y} is an n×1n \times 1 response vector,
  • X\mathbf{X} is an n×(p+1)n \times (p+1) design matrix (including a column for the intercept),
  • β\boldsymbol{\beta} is a (p+1)×1(p+1) \times 1 vector of unknown coefficients, and
  • the errors ϵ\boldsymbol{\epsilon} are assumed to be independent and normally distributed as
ϵN(0,σ2I).\boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}).

Standard Bayesian Linear Regression

Likelihood

In the Bayesian framework, we combine the data likelihood with a prior distribution on the parameters. In statistical modeling, the likelihood function represents the probability of observing the data y\mathbf{y} given the parameters β\boldsymbol{\beta} and σ2\sigma^2. For our linear model, with independent normally distributed errors, the likelihood function is given by:

L(β,σ2)=p(yβ,σ2)=i=1n12πσ2exp ⁣((yixiTβ)22σ2),L(\boldsymbol{\beta}, \sigma^2) = p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right),

where xiT\mathbf{x}_i^T is the iith row of X\mathbf{X}. This can be simplified as:

p(yβ,σ2)exp ⁣(12σ2yXβ2).p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) \propto \exp\!\Bigl(-\frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2\Bigr).
Prior

In the Bayesian setting, we treat β\boldsymbol{\beta} as a random variable. A common (noninformative or weakly informative) choice is to assume a Gaussian prior on β\boldsymbol{\beta}, which is conjugate to the Gaussian likelihood. This leads to analytic solutions and simplifies computation:

βN(0,σ2τ1I),\boldsymbol{\beta} \sim N\Bigl(\mathbf{0},\, \sigma^2 \tau^{-1} \mathbf{I}\Bigr),

where τ\tau controls the spread or uncertainty of our prior relative to the noise variance σ2\sigma^2.

However, the Gaussian prior is not the only choice. Depending on the application and prior beliefs, different priors can be used:

  • Noninformative Prior: When little prior knowledge is available, one might use a flat or uniform prior over β\boldsymbol{\beta} i.e. p(β)1p(\boldsymbol{\beta}) \propto 1. This approach does not favor any particular values and lets the data speak for itself.

  • Weakly Informative Gaussian Prior: In many cases, a Gaussian prior with a large variance (small τ\tau) is used to provide mild regularization without imposing strong beliefs. This is particularly useful for stabilizing estimates in the presence of multicollinearity.

  • Laplace Prior (Double-Exponential): For scenarios where sparsity is desired, such as in variable selection, a Laplace prior can be imposed on each coefficient. This is the basis for the Bayesian Lasso:

    p(βj)=λ2exp ⁣(λβj).p(\beta_j) = \frac{\lambda}{2} \exp\!\left(-\lambda |\beta_j|\right).
  • Heavy-Tailed Priors: In some applications, it might be appropriate to use heavy-tailed distributions like the Student's t-distribution as a prior for β\boldsymbol{\beta}. Such priors are more robust to outliers in the data and allow for occasional large coefficient values.

The choice of prior should reflect any prior knowledge or assumptions about the parameters. For example, if prior studies suggest that most predictors have a negligible effect on the response, a Laplace prior might be preferred to encourage sparsity. Conversely, if one has reason to believe that the effects are normally distributed around zero with moderate variance, a Gaussian prior is appropriate.

Posterior: Combine Likelihood and the Prior

Bayesian inference uses Bayes’ rule in the form:

p(βy)  =  p(y,β)p(y)=  p(yβ)  ×  p(β)p(y).p(\boldsymbol{\beta} \mid \mathbf{y}) \;=\; \frac{p(\mathbf{y}, \boldsymbol{\beta})}{p(\mathbf{y})}=\; \frac{p(\mathbf{y} \mid \boldsymbol{\beta}) \;\times\; p(\boldsymbol{\beta})}{p(\mathbf{y})}.

Next, we write the marginal likelihood p(y)p(\mathbf{y}) by integrating over all possible β\boldsymbol{\beta}:

p(y)  =  p(y,β)dβ  =  p(yβ)  ×  p(β)dβ.p(\mathbf{y}) \;=\; \int p(\mathbf{y}, \boldsymbol{\beta}) \,\mathrm{d}\boldsymbol{\beta} \;=\; \int p(\mathbf{y} \mid \boldsymbol{\beta}) \;\times\; p(\boldsymbol{\beta}) \,\mathrm{d}\boldsymbol{\beta}.

However, computing p(y)p(\mathbf{y}) (the marginal likelihood) can be challenging because it involves integrating over all possible parameter values. That is why we often write the posterior up to a proportionality constant. The posterior distribution for β\boldsymbol{\beta} is proportional to the product of the likelihood and the prior:

p(βy)    p(yβ)  ×  p(β).p(\boldsymbol{\beta} \mid \mathbf{y}) \;\propto\; p(\mathbf{y} \mid \boldsymbol{\beta}) \;\times\; p(\boldsymbol{\beta}).

Therefore, for our setup the posterior distribution is:

p(βy,σ2)p(yβ,σ2)×p(β).p(\boldsymbol{\beta} \mid \mathbf{y}, \sigma^2) \propto p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) \times p(\boldsymbol{\beta}).

Because both the likelihood and the prior are Gaussian, the posterior for β\boldsymbol{\beta} is also Gaussian. This conjugacy yields an analytic solution for the posterior, allowing us to quantify the uncertainty over the coefficients.


Posterior Distributions: Analytic and Approximate Forms

The posterior distribution arises when we update our beliefs about parameters after observing data, capturing both our prior assumptions and the information contained in the likelihood. In some situations—especially when the chosen prior and likelihood form a convenient pairing—the necessary integrals can simplify, leading to an exact, closed-form expression for the posterior.

In certain cases, the likelihood and prior are chosen to be conjugate. This means the resulting posterior has the same functional form as the prior, making it possible to derive a closed-form (analytic) solution easily. These so-called "conjugate" relationships are not the only way to obtain analytic solutions, but in most practical cases, the marginal likelihood integrals become too complex to solve directly.

Non-Analytic Example: Bayesian Logistic Regression
  1. Model Setup
  • Likelihood

    Suppose y{0,1}n\mathbf{y}\in \{0,1\}^n are binary outcomes and XX is the design matrix. For logistic regression, each yiy_i is modeled with

P(yi=1xi,β)  =  σ ⁣(xiTβ),P\bigl(y_i = 1 \mid \mathbf{x}_i,\boldsymbol{\beta}\bigr) \;=\; \sigma\!\bigl(\mathbf{x}_i^\mathsf{T}\,\boldsymbol{\beta}\bigr),

where σ(z)=11+ez\sigma(z) = \tfrac{1}{1+e^{-z}} is the logistic function. The likelihood for all nn data points is:

p(yβ)  =  i=1nσ ⁣(xiTβ)yi(1σ ⁣(xiTβ))1yi.p(\mathbf{y} \mid \boldsymbol{\beta}) \;=\; \prod_{i=1}^n \sigma\!\bigl(\mathbf{x}_i^\mathsf{T}\boldsymbol{\beta}\bigr)^{\,y_i} \Bigl(1 - \sigma\!\bigl(\mathbf{x}_i^\mathsf{T}\boldsymbol{\beta}\bigr)\Bigr)^{\,1 - y_i}.
  • Prior

    We can again choose a Gaussian prior for β\boldsymbol{\beta}:

p(β)  =  N(μ0,  Σ0).p(\boldsymbol{\beta}) \;=\; \mathcal{N}\bigl(\boldsymbol{\mu}_0,\;\Sigma_0\bigr).
  1. Posterior Challenge
  • Form of the Posterior

    Using Bayes’ rule, the posterior is:

p(βy)    i=1nσ ⁣(xiTβ)yi[1σ ⁣(xiTβ)]1yiLogistic likelihood  ×  exp ⁣(12(βμ0)TΣ01(βμ0))Gaussian prior.p(\boldsymbol{\beta} \mid \mathbf{y}) \;\propto\; \underbrace{\prod_{i=1}^n \sigma\!\bigl(\mathbf{x}_i^\mathsf{T}\boldsymbol{\beta}\bigr)^{\,y_i} \bigl[\,1 - \sigma\!\bigl(\mathbf{x}_i^\mathsf{T}\boldsymbol{\beta}\bigr)\bigr]^{1 - y_i}}_{\text{Logistic likelihood}} \;\times\; \underbrace{\exp\!\Bigl(-\tfrac12\bigl(\boldsymbol{\beta}-\boldsymbol{\mu}_0\bigr)^\mathsf{T}\Sigma_0^{-1}\bigl(\boldsymbol{\beta}-\boldsymbol{\mu}_0\bigr)\Bigr)}_{\text{Gaussian prior}}.

The presence of the logistic function σ()\sigma(\cdot) in the product creates a non-conjugate relationship with the Gaussian prior.

  • No Closed-Form Integral

    Unlike the Gaussian–Gaussian case, the integral

p(y)  =  p(y,β)dβ  =  p(yβ)p(β)dβp(\mathbf{y}) \;=\; \int p(\mathbf{y},\boldsymbol{\beta})\,d\boldsymbol{\beta} \;=\; \int p(\mathbf{y} \mid \boldsymbol{\beta})\,p(\boldsymbol{\beta})\,d\boldsymbol{\beta}

does not simplify into any known closed-form solution. This makes the posterior analytically intractable.

As a result, we often rely on numerical approaches such as Markov Chain Monte Carlo (MCMC), variational inference, or other approximation methods to explore or estimate the posterior distribution. These techniques ensure that even when an exact solution is not feasible, we can still capture the uncertainty about our parameters and make informed inferences from the data.

By working with a full posterior distribution, Bayesian linear regression enables the incorporation of prior information and provides a way to capture uncertainty. Instead of a single point estimate for each βj\beta_j, you get a distribution reflecting all plausible values and their corresponding probabilities.


Bayesian Ridge Regression

In Bayesian Ridge Regression, we use the same Gaussian prior:

βN(0,σ2τ1I).\boldsymbol{\beta} \sim N\Bigl(\mathbf{0},\, \sigma^2 \tau^{-1} \mathbf{I}\Bigr).

Likelihood:
The likelihood function remains:

p(yβ,σ2)exp ⁣(12σ2yXβ2).p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) \propto \exp\!\Bigl(-\frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2\Bigr).

Prior:
The Gaussian prior on the coefficients is:

p(β)exp ⁣(τ2βTβ).p(\boldsymbol{\beta}) \propto \exp\!\Bigl(-\frac{\tau}{2}\boldsymbol{\beta}^T \boldsymbol{\beta}\Bigr).

Posterior:
By Bayes’ rule, the posterior is:

p(βy,σ2)p(yβ,σ2)p(β)exp ⁣(12σ2yXβ2τ2βTβ).p(\boldsymbol{\beta} \mid \mathbf{y}, \sigma^2) \propto p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) \, p(\boldsymbol{\beta}) \propto \exp\!\Bigl(-\frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 - \frac{\tau}{2}\boldsymbol{\beta}^T \boldsymbol{\beta}\Bigr).

Taking the negative logarithm (and ignoring constants) gives the following objective for Maximum A Posteriori (MAP) estimation:

minβ  12σ2yXβ2+τ2βTβ.\min_{\boldsymbol{\beta}} \; \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \frac{\tau}{2}\boldsymbol{\beta}^T \boldsymbol{\beta}.

Multiplying through by 2σ22\sigma^2 (which does not affect the minimizer) leads to:

minβ  yXβ2+σ2τβ2.\min_{\boldsymbol{\beta}} \; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \sigma^2 \tau \, \|\boldsymbol{\beta}\|^2.

Derivation of the MAP Estimate:
Taking the derivative with respect to β\boldsymbol{\beta} and setting it to zero:

2XT(yXβ)+2σ2τβ=0.-2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + 2\sigma^2 \tau \, \boldsymbol{\beta} = 0.

Rearrange to obtain:

XTXβ+σ2τβ=XTy.\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \sigma^2 \tau \, \boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}.

Thus, the MAP estimate for Bayesian Ridge Regression is:

β^BayesRidge=(XTX+σ2τI)1XTy.\hat{\boldsymbol{\beta}}_{\text{BayesRidge}} = \bigl(\mathbf{X}^T\mathbf{X} + \sigma^2 \tau \, \mathbf{I}\bigr)^{-1}\mathbf{X}^T\mathbf{y}.

Bayesian Lasso Regression

Bayesian Lasso introduces sparsity by using a Laplace (double-exponential) prior on the regression coefficients.

Model, Likelihood, and Prior for Bayesian Lasso

Model:
We retain the linear model:

y=Xβ+ϵ,with ϵN(0,σ2I).\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \text{with } \boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}).

Likelihood:
The likelihood function is:

p(yβ,σ2)exp ⁣(12σ2yXβ2).p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) \propto \exp\!\Bigl(-\frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2\Bigr).

Prior:
For Bayesian Lasso, we place a Laplace prior on each coefficient βj\beta_j:

p(βj)=λ2exp ⁣(λβj).p(\beta_j) = \frac{\lambda}{2} \exp\!\left(-\lambda |\beta_j|\right).

For the entire coefficient vector:

p(β)=j=1pλ2exp ⁣(λβj)exp ⁣(λβ1).p(\boldsymbol{\beta}) = \prod_{j=1}^{p} \frac{\lambda}{2} \exp\!\left(-\lambda |\beta_j|\right) \propto \exp\!\Bigl(-\lambda \|\boldsymbol{\beta}\|_1\Bigr).

Posterior and MAP Estimation for Bayesian Lasso

Posterior:
Using Bayes’ rule, the posterior distribution is given by:

p(βy,σ2)p(yβ,σ2)p(β)exp ⁣(12σ2yXβ2λβ1).p(\boldsymbol{\beta} \mid \mathbf{y}, \sigma^2) \propto p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) \, p(\boldsymbol{\beta}) \propto \exp\!\Bigl(-\frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 - \lambda \|\boldsymbol{\beta}\|_1\Bigr).

Taking the negative logarithm (ignoring constants) yields the MAP objective:

minβ  12σ2yXβ2+λβ1.\min_{\boldsymbol{\beta}} \; \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1.

Multiplying through by 2σ22\sigma^2 for clarity, we have:

minβ  yXβ2+2σ2λβ1.\min_{\boldsymbol{\beta}} \; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + 2\sigma^2 \lambda \, \|\boldsymbol{\beta}\|_1.

Note:
Due to the non-differentiability of the L1L_1 norm at zero, there is no closed-form solution for the Bayesian Lasso MAP estimator. In practice, numerical methods such as Gibbs sampling, variational inference, or coordinate descent adapted for the Bayesian framework are used to estimate β\boldsymbol{\beta} or to sample from the full posterior.


Hyperparameter Interpretation

An important aspect of Bayesian linear regression is understanding the role of the hyperparameters. The parameter τ\tau in the Gaussian prior (used in standard Bayesian regression and Bayesian Ridge) controls the spread of the prior distribution on β\boldsymbol{\beta}; a larger τ\tau implies a tighter prior (i.e., more shrinkage), while a smaller τ\tau allows for greater variability in the coefficient estimates. Similarly, in Bayesian Lasso, the hyperparameter λ\lambda governs the strength of the L1L_1 penalty, where a larger λ\lambda encourages more coefficients to be shrunk to exactly zero, thereby promoting sparsity and enhancing variable selection. Tuning these hyperparameters is critical, as they directly affect the balance between model complexity and regularization, ultimately influencing both the predictive performance and interpretability of the model.


Summary of Bayesian Linear Regression Methods

  • Standard Bayesian Linear Regression:

    • Model: y=Xβ+ϵ,ϵN(0,σ2I).\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}).
    • Notation:
      y\mathbf{y} is an n×1n \times 1 response vector, X\mathbf{X} is an n×(p+1)n \times (p+1) design matrix, β\boldsymbol{\beta} is a (p+1)×1(p+1) \times 1 coefficient vector.
    • Prior:
      βN(0,σ2τ1I)\boldsymbol{\beta} \sim N\Bigl(\mathbf{0},\, \sigma^2 \tau^{-1} \mathbf{I}\Bigr).
    • Posterior:
      p(βy,σ2)exp ⁣(12σ2yXβ2τ2βTβ)p(\boldsymbol{\beta} \mid \mathbf{y}, \sigma^2) \propto \exp\!\Bigl(-\frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 - \frac{\tau}{2}\boldsymbol{\beta}^T \boldsymbol{\beta}\Bigr), leading to a closed-form MAP solution akin to ridge regression.
  • Bayesian Ridge Regression:

    • Prior:
      βN(0,σ2τ1I)\boldsymbol{\beta} \sim N\Bigl(\mathbf{0},\, \sigma^2 \tau^{-1} \mathbf{I}\Bigr).
    • MAP Estimator: β^BayesRidge=(XTX+σ2τI)1XTy.\hat{\boldsymbol{\beta}}_{\text{BayesRidge}} = \bigl(\mathbf{X}^T\mathbf{X} + \sigma^2 \tau \, \mathbf{I}\bigr)^{-1}\mathbf{X}^T\mathbf{y}.
    • Benefits:
      Provides shrinkage of coefficients and quantifies uncertainty via the full posterior distribution.
  • Bayesian Lasso Regression:

    • Prior:
      Laplace prior p(β)exp ⁣(λβ1)p(\boldsymbol{\beta}) \propto \exp\!\Bigl(-\lambda \|\boldsymbol{\beta}\|_1\Bigr).
    • MAP Objective: minβ  yXβ2+2σ2λβ1.\min_{\boldsymbol{\beta}} \; \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + 2\sigma^2 \lambda \, \|\boldsymbol{\beta}\|_1.
    • Benefits:
      Encourages sparsity by driving some coefficients to exactly zero, aiding in variable selection. No closed-form solution exists; numerical methods are employed.


Practical Considerations

In practice, Bayesian linear regression models are often implemented using computational techniques such as Markov Chain Monte Carlo (MCMC) methods, including Gibbs sampling or the Metropolis-Hastings algorithm, to sample from the posterior distribution. Software packages such as Stan, JAGS, or PyMC3 provide powerful tools for performing these computations, enabling the estimation of complex Bayesian models. However, practitioners should be aware of the computational challenges associated with MCMC, such as convergence diagnostics and potential slow mixing in high-dimensional parameter spaces. Variational inference is another alternative that approximates the posterior more quickly, although sometimes at the cost of reduced accuracy. These practical considerations are crucial when applying Bayesian methods to real-world datasets.

Bayesian linear regression not only yields point estimates for the coefficients but also provides a full probabilistic description of uncertainty. By incorporating regularization through Bayesian Ridge and Bayesian Lasso, we obtain models that are robust to overfitting and multicollinearity, while also offering valuable insights through the posterior distributions.


Further Reading & Citations

  • Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis. Chapman & Hall/CRC.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  • Park, T., & Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103(482), 681-686.