Why Conditional Expectation is the Best Predictor: Key Results and Applications
Linear RegressionOLSSupervised Learning (Regression/Classification)
Conditional Expectation: Optimality, Proof, and Applications
In predictive modeling—whether using traditional linear regression or modern deep learning architectures like Transformers—a fundamental goal is to minimize the prediction error. One of the key theoretical results is that the best predictor of a random variable Y given features X, when measured by mean squared error (MSE), is the conditional expectation E[Y∣X]. This article presents the theorem, its proof, and explores its connection to the bias–variance tradeoff along with practical use cases.
1. Theorem: Optimality of the Conditional Expectation
Statement:
Let X and Y be random variables defined on a probability space, and let f:X→R be any measurable function. The MSE loss is defined as:
L(f)=E[(Y−f(X))2].
Then the unique minimizer in the L2 sense is:
f∗(X)=E[Y∣X].
In other words,
fargminE[(Y−f(X))2]=E[Y∣X].
2. Proof of the Theorem
Error Decomposition
For any function f, we write:
Y−f(X)=Error around the mean(Y−E[Y∣X])+Deviation of f(X) from the conditional mean(E[Y∣X]−f(X)).
The term E[(Y−E[Y∣X])2] is independent of f and represents the irreducible error. The second term is nonnegative and equals zero if and only if f(X)=E[Y∣X]. Therefore, the unique minimizer is:
f∗(X)=E[Y∣X].
3. Connection to the Bias–Variance Tradeoff
The bias–variance tradeoff provides a framework for understanding the expected prediction error of an estimator. For an estimator f^(X) of f(X)=E[Y∣X], the expected squared error at a point x can be decomposed as: bias, variance, and irreducible error.
Variance:
With a perfect estimator (ignoring finite sample errors), the variance of f∗(x) is zero.
Total Error: E[(f∗(x)−Y)2]=Var(Y∣X=x).
This is the minimum possible error achievable, showing that f∗(x)=E[Y∣X=x] is optimal in the L2 sense.
4. Use Cases and Practical Implications
A. Linear Regression
In linear regression, we restrict the predictor to be linear:
fβ(X)=β0+β1X1+⋯+βpXp.
The Ordinary Least Squares (OLS) estimator finds β that minimizes the empirical MSE, effectively approximating E[Y∣X] within the class of linear functions. Under the Gauss–Markov conditions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE), meaning it is the optimal linear approximation to the true conditional expectation.
B. Deep Learning and Transformer Models
When training deep models like Transformers with MSE loss, the network minimizes:
minθ∑i(Li−fθ(Xi))2,
so that with enough data and capacity, the output fθ(X) approximates E[L∣X].
Uncertainty Modeling:
To simulate realistic outcomes, one can add noise from an empirical residual distribution:
Lt(sample)=Lt+εt,
where Lt≈E[Lt∣history] - the predicted conditional mean and εt is sampled from observed residuals.
C. Counterfactual Outcome Prediction
In counterfactual prediction under dynamic treatment regimes (e.g., in healthcare), models like the G‑Transformer:
Train the continuous covariate predictions with MSE, so the output approximates E[Lt∣history].
Use empirical residuals to simulate realistic variability without imposing parametric assumptions.
Conclusion
The conditional expectation E[Y∣X] uniquely minimizes the MSE, making it the optimal predictor in the L2 sense. This foundational result not only supports the optimality of linear regression (e.g., OLS, BLUE, BLUP) but also informs modern deep learning practices.
When deep networks, such as Transformers, are trained with MSE loss, they effectively learn to approximate the conditional expectation. This insight is deeply connected to the bias–variance tradeoff, as using the conditional expectation eliminates bias and leaves only the irreducible error.