Conditional Expectation: Optimality, Proof, and Applications

In predictive modeling—whether using traditional linear regression or modern deep learning architectures like Transformers—a fundamental goal is to minimize the prediction error. One of the key theoretical results is that the best predictor of a random variable $Y$ given features $X$ , when measured by mean squared error (MSE), is the conditional expectation $\mathbb{E}[Y \mid X]$ . This article presents the theorem, its proof, and explores its connection to the bias–variance tradeoff along with practical use cases.

1. Theorem: Optimality of the Conditional Expectation

Statement:
Let $X$ and $Y$ be random variables defined on a probability space, and let $f:\mathcal{X}\to\mathbb{R}$ be any measurable function. The MSE loss is defined as:

$\mathcal{L}(f) = \mathbb{E}\Bigl[(Y - f(X))^2\Bigr].$ Then the unique minimizer in the $L^2$ sense is: $f^*(X) = \mathbb{E}[Y \mid X].$

In other words,

$\underset{f}{\arg\min}\;\mathbb{E}\Bigl[(Y - f(X))^2\Bigr] = \mathbb{E}[Y \mid X].$

2. Proof of the Theorem

Error Decomposition

For any function $f$ , we write:

$Y - f(X) = \underbrace{\bigl(Y - \mathbb{E}[Y \mid X]\bigr)}_{\text{Error around the mean}} + \underbrace{\bigl(\mathbb{E}[Y \mid X] - f(X)\bigr)}_{\text{Deviation of } f(X) \text{ from the conditional mean}}.$

Squaring both sides gives:

$(Y - f(X))^2 = \Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)^2 + \Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)^2 + 2\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr).$

Taking Expectations and Eliminating the Cross Term

Taking expectations:

$\mathbb{E}\Bigl[(Y - f(X))^2\Bigr] = \mathbb{E}\Bigl[\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)^2\Bigr] + \mathbb{E}\Bigl[\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)^2\Bigr] + 2\,\mathbb{E}\Bigl[\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)\Bigr].$

now, by the linearity of conditional expectation,

$\mathbb{E}\Bigl[Y - \mathbb{E}[Y \mid X] \,\Big|\, X\Bigr] = \mathbb{E}[Y \mid X] - \mathbb{E}[Y \mid X] = 0,$

the cross-term becomes zero:

$\mathbb{E}\Bigl[\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)\Bigr] = 0.$

Thus,

$\mathbb{E}\Bigl[(Y - f(X))^2\Bigr] = \mathbb{E}\Bigl[\Bigl(Y - \mathbb{E}[Y \mid X]\Bigr)^2\Bigr] + \mathbb{E}\Bigl[\Bigl(\mathbb{E}[Y \mid X] - f(X)\Bigr)^2\Bigr].$

Minimization

The term $\mathbb{E}\bigl[(Y - \mathbb{E}[Y \mid X])^2\bigr]$ is independent of $f$ and represents the irreducible error. The second term is nonnegative and equals zero if and only if $f(X) = \mathbb{E}[Y \mid X]$ . Therefore, the unique minimizer is: $f^*(X) = \mathbb{E}[Y \mid X].$

3. Connection to the Bias–Variance Tradeoff

The bias–variance tradeoff provides a framework for understanding the expected prediction error of an estimator. For an estimator $\hat{f}(X)$ of $f(X) = \mathbb{E}[Y \mid X]$ , the expected squared error at a point $x$ can be decomposed as: bias, variance, and irreducible error.

$\mathbb{E}_{\text{train}}\left[(\hat{f}(x)-Y)^2 \mid X=x\right] = \underbrace{\Bigl(\mathbb{E}_{\text{train}}[\hat{f}(x)] - \mathbb{E}[Y \mid X=x]\Bigr)^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{Variance}} + \underbrace{\text{Var}(Y \mid X=x)}_{\text{Irreducible Error}}.$

Total Expected Error Decomposition

For an estimator $\hat{f}(x)$ of $f(x)=\mathbb{E}[Y \mid X=x]$ , the expected squared prediction error is:

$\mathbb{E}\Bigl[(\hat{f}(x)-Y)^2\Bigr] = \underbrace{\Bigl(\mathbb{E}[\hat{f}(x)] - \mathbb{E}[Y \mid X=x]\Bigr)^2}_{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{Variance}} + \underbrace{\text{Var}(Y \mid X=x)}_{\text{Irreducible Error}}.$

Optimality of the Conditional Expectation

Using $f^*(x)=\mathbb{E}[Y \mid X=x]$ :

Bias:
$\text{Bias}\Bigl(f^*(x)\Bigr) = \mathbb{E}[f^*(x)] - \mathbb{E}[Y \mid X=x] = 0.$
Variance:
With a perfect estimator (ignoring finite sample errors), the variance of $f^*(x)$ is zero.
Total Error:
$\mathbb{E}\Bigl[(f^*(x)-Y)^2\Bigr] = \text{Var}\Bigl(Y \mid X=x\Bigr).$ This is the minimum possible error achievable, showing that $f^*(x)=\mathbb{E}[Y \mid X=x]$ is optimal in the $L^2$ sense.

4. Use Cases and Practical Implications

A. Linear Regression

In linear regression, we restrict the predictor to be linear: $f_\beta(X) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p.$ The Ordinary Least Squares (OLS) estimator finds $\beta$ that minimizes the empirical MSE, effectively approximating $\mathbb{E}[Y \mid X]$ within the class of linear functions. Under the Gauss–Markov conditions, the OLS estimator is the Best Linear Unbiased Estimator (BLUE), meaning it is the optimal linear approximation to the true conditional expectation.

B. Deep Learning and Transformer Models

When training deep models like Transformers with MSE loss, the network minimizes: $\min_\theta\, \sum_{i}\Bigl(L_i - f_\theta(X_i)\Bigr)^2,$ so that with enough data and capacity, the output $f_\theta(X)$ approximates $\mathbb{E}[L \mid X]$ .

Uncertainty Modeling:
To simulate realistic outcomes, one can add noise from an empirical residual distribution: $L_t^{(\text{sample})} = \widehat{L}_t + \varepsilon_t,$ where $\widehat{L}_t \approx \mathbb{E}[L_t \mid \text{history}]$ - the predicted conditional mean and $\varepsilon_t$ is sampled from observed residuals.

C. Counterfactual Outcome Prediction

In counterfactual prediction under dynamic treatment regimes (e.g., in healthcare), models like the G‑Transformer:

Train the continuous covariate predictions with MSE, so the output approximates $\mathbb{E}[L_t \mid \text{history}]$ .
Use empirical residuals to simulate realistic variability without imposing parametric assumptions.

Conclusion

The conditional expectation $\mathbb{E}[Y \mid X]$ uniquely minimizes the MSE, making it the optimal predictor in the $L^2$ sense. This foundational result not only supports the optimality of linear regression (e.g., OLS, BLUE, BLUP) but also informs modern deep learning practices. When deep networks, such as Transformers, are trained with MSE loss, they effectively learn to approximate the conditional expectation. This insight is deeply connected to the bias–variance tradeoff, as using the conditional expectation eliminates bias and leaves only the irreducible error.