$\textbf{Gauss–Markov Theorem}$
$\textbf{STATEMENT}$ Suppose we have in matrix notation,
${\underline {y}}=X{\underline {\beta }}+{\underline {\varepsilon }},\quad ({\underline {y}},{\underline {\varepsilon }}\in \mathbb {R} ^{n},{\underline {\beta }}\in \mathbb {R} ^{K}{\text{ and }}X\in \mathbb {R} ^{n\times K})$
expanding to,
$\displaystyle y_{i}=\sum _{j=1}^{K}\beta _{j}X_{ij}+\varepsilon _{i}\quad \forall i=1,2,\ldots ,n $
where $\beta _{j}$ are non-random but $\textbf{unobservable}$ parameters, $X_{ij}$ are non-random and observable (called the "explanatory variables"), $\varepsilon _{i}$ are random, and so $y_{i}$ are random. The random variables $\varepsilon _{i}$ are called the "disturbance", "noise" or simply "error" (will be contrasted with "residual" later in the article). Note that to include a constant in the model above, one can choose to introduce the constant as a variable $\beta _{K+1}$ with a newly introduced last column of X being unity i.e., $X_{i(K+1)}=1$ for all $i$.
The Gauss–Markov assumptions concern the set of error random variables, $\varepsilon _{i}$:
They have mean zero: $\mathbb {E} [\varepsilon _{i}]=0.$
They are homoscedastic, that is all have the same finite variance: ${\text{Var}}(\varepsilon _{i})=\sigma ^{2}<\infty ,$ and
Distinct error terms are uncorrelated: ${\text{Cov}}(\varepsilon _{i},\varepsilon _{j})=0,\forall i\neq j.$
A linear estimator of $\beta _{j}$ is a linear combination
${\widehat {\beta }}_{j}=c_{1j}y_{1}+\cdots +c_{nj}y_{n}$
in which the coefficients $c_{ij}$ are not allowed to depend on the underlying coefficients $\beta _{j}$, since those are not observable, but are allowed to depend on the values $X_{ij}$, since these data are observable. (The dependence of the coefficients on each $X_{ij}$ is typically nonlinear; the estimator is linear in each $y_{i}$ and hence in each random ${\displaystyle \varepsilon ,}$ which is why this is "linear" regression.) The estimator is said to be $\textbf{unbiased}$ if and only if
$\mathbb {E} \left[{\widehat {\beta }}_{j}\right]=\beta _{j}$
regardless of the values of $X_{ij}$. Now, let $\displaystyle \sum \nolimits _{j=1}^{K}\lambda _{j}\beta _{j}$ be some linear combination of the coefficients. Then the mean squared error of the corresponding estimation is
$\mathbb {E} \left[\left(\sum _{j=1}^{K}\lambda _{j}\left({\widehat {\beta }}_{j}-\beta _{j}\right)\right)^{2}\right],$
in other words it is the expectation of the square of the weighted sum (across parameters) of the differences between the estimators and the corresponding parameters to be estimated. (Since we are considering the case in which all the parameter estimates are unbiased, this mean squared error is the same as the variance of the linear combination.) The $\textbf{best linear unbiased estimator (BLUE)} $ of the vector $\beta$ of parameters $\beta _{j}$ is one with the smallest mean squared error for every vector $\lambda$ of linear combination parameters. This is equivalent to the condition that
${{\text{Var}}({\widetilde {\beta }})-{\text{Var}}({\widehat {\beta }})}$
is a positive semi-definite matrix for every other linear unbiased estimator ${\widetilde {\beta }}$.
The $\textbf{ordinary least squares estimator (OLS)}$ is the function
${\widehat {\beta }}=(X'X)^{-1}X'y$
of $y$ and $X$ (where $X'$ denotes the transpose of $X$) that minimizes the sum of squares of residuals (misprediction amounts):
$\displaystyle \sum _{i=1}^{n}\left(y_{i}-{\widehat {y}}_{i}\right)^{2}=\sum _{i=1}^{n}\left(y_{i}-\sum _{j=1}^{K}{\widehat {\beta }}_{j}X_{ij}\right)^{2}.$
The theorem now states that the OLS estimator is a BLUE.
$\textbf{PROOF}$ The main idea of the proof is that the least-squares estimator is uncorrelated with every linear unbiased estimator of zero, i.e., with every linear combination $a_{1}y_{1}+\cdots +a_{n}y_{n}$ whose coefficients do not depend upon the unobservable $\beta$ but whose expected value is always zero.
Let ${\displaystyle {\tilde {\beta }}=Cy}$ be another linear estimator of ${\displaystyle \beta }$ with ${\displaystyle C=(X'X)^{-1}X'+D}$ where ${\displaystyle D}$ is a ${\displaystyle K\times n}$ non-zero matrix. As we're restricting to unbiased estimators, minimum mean squared error implies minimum variance. The goal is therefore to show that such an estimator has a variance no smaller than that of ${\displaystyle {\widehat {\beta }},}$ the OLS estimator. We calculate:
${\displaystyle {\begin{aligned}\mathbb {E} [{\tilde {\beta }}]&=\mathbb {E} [Cy]\\&=\mathbb {E} \left[\left((X'X)^{-1}X'+D\right)(X\beta +\varepsilon )\right]\\&=\left((X'X)^{-1}X'+D\right)X\beta +\left((X'X)^{-1}X'+D\right)\mathbb {E} [\varepsilon ]\\&=\left((X'X)^{-1}X'+D\right)X\beta &&\mathbb {E} [\varepsilon ]=0\\&=(X'X)^{-1}X'X\beta +DX\beta \\&=(I_{K}+DX)\beta .\\\end{aligned}}}$
Therefore, ${\displaystyle {\tilde {\beta }}}$ is unbiased if and only if ${\displaystyle DX=0}$. Then:
${\displaystyle {\begin{aligned}{\text{Var}}({\tilde {\beta }})&={\text{Var}}(Cy)\\&=C{\text{ Var}}(y)C'\\&=\sigma ^{2}CC'\\&=\sigma ^{2}\left((X'X)^{-1}X'+D\right)\left(X(X'X)^{-1}+D'\right)\\&=\sigma ^{2}\left((X'X)^{-1}X'X(X'X)^{-1}+(X'X)^{-1}X'D'+DX(X'X)^{-1}+DD'\right)\\&=\sigma ^{2}(X'X)^{-1}+\sigma ^{2}(X'X)^{-1}(DX)'+\sigma ^{2}DX(X'X)^{-1}+\sigma ^{2}DD'\\&=\sigma ^{2}(X'X)^{-1}+\sigma ^{2}DD'&&DX;=0\\&={\text{Var}}({\widehat {\beta }})+\sigma ^{2}DD'&&\sigma ^{2}(X'X)^{-1}={\text{Var}}({\widehat {\beta }})\end{aligned}}}$
Since $DD'$ is a positive semidefinite matrix, ${\displaystyle {\text{Var}}({\tilde {\beta }})}$ exceeds ${\displaystyle {\text{Var}}({\widehat {\beta }})}$ by a positive semidefinite matrix.
Let ${\displaystyle l^{t}{\tilde {\beta }}}$ be another linear unbiased estimator of ${\displaystyle l^{t}\beta }$:
${\displaystyle {\begin{aligned}{\text{Var}}(l^{t}{\tilde {\beta }})&=l^{t}{\text{Var}}({\tilde {\beta }})l\\&=\sigma ^{2}l^{t}(X'X)^{-1}l+l^{t}DD^{t}l\\&={\text{Var}}(l^{t}{\widehat {\beta }})+(D^{t}l)^{t}(D^{t}l)&&\sigma ^{2}l^{t}(X'X)^{-1}l={\text{Var}}(l^{t}{\widehat {\beta }})\\&={\text{Var}}(l^{t}{\widehat {\beta }})+\|D^{t}l\|\\&\geqslant {\text{Var}}(l^{t}{\widehat {\beta }})\\\end{aligned}}}$
Moreover equality holds if and only if ${\displaystyle D^{t}l=0}$. We calculate
${\displaystyle {\begin{aligned}l^{t}{\tilde {\beta }}&=l^{t}\left(((X'X)^{-1}X'+D)y\right)&&{\text{ from above}}\\&=l^{t}(X'X)^{-1}X'y+l^{t}Dy\\&=l^{t}{\widehat {\beta }}+(D^{t}l)^{t}y\\&=l^{t}{\widehat {\beta }}&&D;^{t}l=0\end{aligned}}}$
This proves that the equality holds if and only if ${\displaystyle l^{t}{\tilde {\beta }}=l^{t}{\widehat {\beta }}}$ which gives the uniqueness of the OLS estimator as a BLUE. Q.E.D.