I am looking into a question about variance induction on an incremental dataset.
To begin with, dataset $D_{n-1}$ contains elements $\{x_1, ..., x_{n-1}\}$, and we have got the values of:
- mean $\bar{x}_{n-1}$
- variance $\sigma^2_{n-1}$
If we add in a new element $x_n$ to get a new dataset $D_{n}$ containing $\{x_1, ..., x_{n-1}, x_n\}$, and assume we have computed its value of:
- mean $\bar{x}_n$ (e.g. by formula $\bar{x}_n = \frac{n-1}{n}\bar{x}_{n-1} + \frac{1}{n}x_n$)
Then which one option is the variance $\sigma^2_n$? ...
By a Python testing script, I have ruled out all other options and validated that the correct answer is:
$\sigma^2_n = \frac{n-1}{n}\sigma^2_{n-1} + \frac{1}{n}(x_n-\bar{x}_{n-1})(x_n - \bar{x}_n)$
However, I need a little help to prove it analytically.
Let me know if you need more details, and I highly appreciate your help.