Why Conditional Expectation is the Best Predictor: Key Results and Applications

Linear RegressionOLSSupervised Learning (Regression/Classification)

Conditional Expectation: Optimality, Proof, and Applications

In predictive modeling—whether using traditional linear regression or modern deep learning architectures like Transformers—a fundamental goal is to minimize the prediction error. One of the key theoretical results is that the best predictor of a random variable YY given features XX, when measured by mean squared error (MSE), is the conditional expectation E[YX]\mathbb{E}[Y \mid X]. This article presents the theorem, its proof, and explores its connection to the bias–variance tradeoff along with practical use cases.


1. Theorem: Optimality of the Conditional Expectation

Statement:
Let XX and YY be random variables defined on a probability space, and let f:XRf:\mathcal{X}\to\mathbb{R} be any measurable function. The MSE loss is defined as:

L(f)=E[(Yf(X))2].\mathcal{L}(f) = \mathbb{E}\Bigl[(Y - f(X))^2\Bigr]. Then the unique minimizer in the L2L^2 sense is: f(X)=E[YX].f^*(X) = \mathbb{E}[Y \mid X].

In other words,

argminf  E[(Yf(X))2]=E[YX].\underset{f}{\arg\min}\;\mathbb{E}\Bigl[(Y - f(X))^2\Bigr] = \mathbb{E}[Y \mid X].


2. Proof of the Theorem

Error Decomposition

For any function ff, we write:

Yf(X)=(YE[YX])Error around the mean+(E[YX]f(X))Deviation of f(X) from the conditional mean.Y - f(X) = \underbrace{\bigl(Y - \mathbb{E}[Y \mid X]\bigr)}_{\text{Error around the mean}} + \underbrace{\bigl(\mathbb{E}[Y \mid X] - f(X)\bigr)}_{\text{Deviation of } f(X) \text{ from the conditional mean}}.

Squaring both sides gives:

(Yf(X))2=(YE[YX])2+(E[YX]f(X))2+2(YE[YX])(E[YX]f(X)).(Y - f(X))^2 = \Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)^2 + \Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)^2 + 2\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr).

Taking Expectations and Eliminating the Cross Term

Taking expectations:

E[(Yf(X))2]=E[(YE[YX])2]+E[(E[YX]f(X))2]+2E[(YE[YX])(E[YX]f(X))].\mathbb{E}\Bigl[(Y - f(X))^2\Bigr] = \mathbb{E}\Bigl[\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)^2\Bigr] + \mathbb{E}\Bigl[\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)^2\Bigr] + 2\,\mathbb{E}\Bigl[\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)\Bigr].

now, by the linearity of conditional expectation,

E[YE[YX]X]=E[YX]E[YX]=0,\mathbb{E}\Bigl[Y - \mathbb{E}[Y \mid X] \,\Big|\, X\Bigr] = \mathbb{E}[Y \mid X] - \mathbb{E}[Y \mid X] = 0,

the cross-term becomes zero:

E[(YE[YX])(E[YX]f(X))]=0.\mathbb{E}\Bigl[\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)\Bigr] = 0.

Thus,

E[(Yf(X))2]=E[(YE[YX])2]+E[(E[YX]f(X))2].\mathbb{E}\Bigl[(Y - f(X))^2\Bigr] = \mathbb{E}\Bigl[\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)^2\Bigr] + \mathbb{E}\Bigl[\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)^2\Bigr].

Minimization

The term E[(YE[YX])2]\mathbb{E}\bigl[(Y - \mathbb{E}[Y \mid X])^2\bigr] is independent of ff and represents the irreducible error. The second term is nonnegative and equals zero if and only if f(X)=E[YX]f(X) = \mathbb{E}[Y \mid X]. Therefore, the unique minimizer is: f(X)=E[YX].f^*(X) = \mathbb{E}[Y \mid X].


3. Connection to the Bias–Variance Tradeoff

The bias–variance tradeoff provides a framework for understanding the expected prediction error of an estimator. For an estimator f^(X)\hat{f}(X) of f(X)=E[YX]f(X) = \mathbb{E}[Y \mid X], the expected squared error at a point xx can be decomposed as: bias, variance, and irreducible error.

Etrain[(f^(x)Y)2X=x]=(Etrain[f^(x)]E[YX=x])2Bias2+Var(f^(x))Variance+Var(YX=x)Irreducible Error.\mathbb{E}_{\text{train}}\left[(\hat{f}(x)-Y)^2 \mid X=x\right] = \underbrace{\Bigl(\mathbb{E}_{\text{train}}[\hat{f}(x)] - \mathbb{E}[Y \mid X=x]\Bigr)^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{Variance}} + \underbrace{\text{Var}(Y \mid X=x)}_{\text{Irreducible Error}}.

Total Expected Error Decomposition

For an estimator f^(x)\hat{f}(x) of f(x)=E[YX=x]f(x)=\mathbb{E}[Y \mid X=x], the expected squared prediction error is:

E[(f^(x)Y)2]=(E[f^(x)]E[YX=x])2Bias2+Var(f^(x))Variance+Var(YX=x)Irreducible Error.\mathbb{E}\Bigl[(\hat{f}(x)-Y)^2\Bigr] = \underbrace{\Bigl(\mathbb{E}[\hat{f}(x)] - \mathbb{E}[Y \mid X=x]\Bigr)^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{Variance}} + \underbrace{\text{Var}(Y \mid X=x)}_{\text{Irreducible Error}}.

Optimality of the Conditional Expectation

Using f(x)=E[YX=x]f^*(x)=\mathbb{E}[Y \mid X=x]:

  • Bias:
    Bias(f(x))=E[f(x)]E[YX=x]=0.\text{Bias}\Bigl(f^*(x)\Bigr) = \mathbb{E}[f^*(x)] - \mathbb{E}[Y \mid X=x] = 0.
  • Variance:
    With a perfect estimator (ignoring finite sample errors), the variance of f(x)f^*(x) is zero.
  • Total Error:
    E[(f(x)Y)2]=Var(YX=x).\mathbb{E}\Bigl[(f^*(x)-Y)^2\Bigr] = \text{Var}\Bigl(Y \mid X=x\Bigr). This is the minimum possible error achievable, showing that f(x)=E[YX=x]f^*(x)=\mathbb{E}[Y \mid X=x] is optimal in the L2L^2 sense.

4. Use Cases and Practical Implications

A. Linear Regression

In linear regression, we restrict the predictor to be linear: fβ(X)=β0+β1X1++βpXp.f_\beta(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p. The Ordinary Least Squares (OLS) estimator finds β\beta that minimizes the empirical MSE, effectively approximating E[YX]\mathbb{E}[Y \mid X] within the class of linear functions. Under the Gauss–Markov conditions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE), meaning it is the optimal linear approximation to the true conditional expectation.

B. Deep Learning and Transformer Models

When training deep models like Transformers with MSE loss, the network minimizes: minθi(Lifθ(Xi))2,\min_\theta\, \sum_{i}\Bigl(L_i - f_\theta(X_i)\Bigr)^2, so that with enough data and capacity, the output fθ(X)f_\theta(X) approximates E[LX]\mathbb{E}[L \mid X].

  • Uncertainty Modeling:
    To simulate realistic outcomes, one can add noise from an empirical residual distribution: Lt(sample)=L^t+εt,L_t^{(\text{sample})} = \widehat{L}_t + \varepsilon_t, where L^tE[Lthistory]\widehat{L}_t \approx \mathbb{E}[L_t \mid \text{history}] - the predicted conditional mean and εt\varepsilon_t is sampled from observed residuals.

C. Counterfactual Outcome Prediction

In counterfactual prediction under dynamic treatment regimes (e.g., in healthcare), models like the G‑Transformer:

  • Train the continuous covariate predictions with MSE, so the output approximates E[Lthistory]\mathbb{E}[L_t \mid \text{history}].
  • Use empirical residuals to simulate realistic variability without imposing parametric assumptions.

Conclusion

The conditional expectation E[YX]\mathbb{E}[Y \mid X] uniquely minimizes the MSE, making it the optimal predictor in the L2L^2 sense. This foundational result not only supports the optimality of linear regression (e.g., OLS, BLUE, BLUP) but also informs modern deep learning practices. When deep networks, such as Transformers, are trained with MSE loss, they effectively learn to approximate the conditional expectation. This insight is deeply connected to the bias–variance tradeoff, as using the conditional expectation eliminates bias and leaves only the irreducible error.