Theorem: Let $Y=X\beta+\varepsilon$ where $$Y\in\mathcal M_{n\times 1}(\mathbb R),$$ $$X\in \mathcal M_{n\times p}(\mathbb R),$$ $$\beta\in\mathcal M_{n\times 1}(\mathbb R ),$$ and $$\varepsilon\in\mathcal M_{n\times 1}(\mathbb R ).$$
We suppose that $X$ has full rank $p$ and that $$\mathbb E[\varepsilon]=0\quad\text{and}\quad \text{Var}(\varepsilon)=\sigma ^2I.$$ Then, the least square estimator (i.e. $\hat\beta=(X^TX)^{-1}X^Ty$) is the best unbiased estimator of $\beta$, that is for any linear unbiased estimator $\tilde\beta$ of $\beta$, it hold that $$\text{Var}(\tilde\beta)-\text{Var}(\hat\beta)\geq 0.$$
Proof
Let $\tilde\beta$ a linear unbiased estimator, i.e. $$\tilde\beta=AY\ \ \text{for some }A_{n\times p}\quad\text{and}\quad\mathbb E[\tilde\beta]=\beta\text{ for all }\beta\in\mathbb R ^p.$$
Questions :
1) Why $\mathbb E[\tilde\beta]=\beta$ for all $\beta$, I don't really understand this point. To me $\beta$ is fixed, so $\mathbb E[\tilde\beta]=\beta$ for all $\beta$ doesn't have really sense.
2) Actually, what is the difference between the least square estimator and the maximum likelihood estimator. They both are $\hat\beta=(X^TX)^{-1}X^Ty$, so I don't really see (if they are the same), why we give two different name.