Let's say I have a data set of $10,20,30$. My mean and variance here are mean= $20$ and variance = $66.667$. Is there a formula that lets me calculate the new variance value if I was to remove $10$ and add $50$ to the data set turning it into $20,30,50$?
-
you may find this relevant :https://math.stackexchange.com/questions/102978/incremental-computation-of-standard-deviation – Sean Lee Feb 14 '19 at 13:30
-
@SeanLee That link focuses on if we were just adding data to the dataset, but what about removing? – dude8998 Feb 14 '19 at 13:42
-
I haven't worked it out in detail, but I believe if you know how to calculate the incremental SD by adding data, this formulation should also allow you to calculate the incremental SD by removing data. – Sean Lee Feb 14 '19 at 13:44
-
Yeah, I tried to derive but I just can't wrap my head around it, would help if someone else worked it out – dude8998 Feb 14 '19 at 13:52
2 Answers
Suppose there are $n$ values in the data set and we replace a value $x$ with a new value $x'$.
First calculate the new mean $M'$:
$M' = M + \frac{x'-x}{n}$
where $M$ is the old mean. Then calculate the new variance:
$V' = V + (M'-M)^2 + \frac{(x'-M')^2-(x-M')^2}{n}$
where $V$ is the old variance. $(M'-M)^2$ is the change due to the movement of the mean and $\frac{(x'-M')^2-(x-M')^2}{n}$ is the change due to the replacement of $x$ by $x'$.
In your example, $n=3$, $x=10$, $x'=50$ so:
$M' = 20 +\frac{50-10}{3}=\frac{100}{3}$
$V' = \frac{200}{3} + \frac{40^2}{9} + \frac{50^2-70^2}{27} = \frac{1400}{9}$
- 15,626
Denote the running SD (of window length $n$) at the $k$-th time step as $s_{k:n+k-1}$, and the corresponding running mean as $\bar{X}_{k:n+k-1}$ (The subscript specifies the datapoints that we are taking in our calculations, which will be relevant for later).
What you're asking, is essentially, for every time step, that given $s_{k:n+k-1}$ to:
- Calculate a temporary SD $s_{k+1:n+k-1}$ first by removing the "old" data point
- Use $s_{k+1:n+k-1}$ to calculate the new SD $s_{k+1:n+k}$
The rest follows directly from incremental computation of standard deviation:
and it is easy to show that the summation term above is equal to $0$ which gives $$ s^2_n = \frac{(n - 2)s^2_{n - 1} + (n - 1)(\bar X_{n - 1} - \bar X_n)^2 + (X_n - \bar X_{n})^2}{n - 1}. $$
Or if I were to write it in the notation that I have introduced, where I treat $X_k$ as the "new" datapoint (although it's the datapoint we want to remove):
$$ s^2_{k:n+k-1} = \frac{(n - 2)s^2_{k+1:n+k-1} + (n - 1)(\bar X_{k+1:n+k-1} - \bar X_{k:n+k-1})^2 + (X_k - \bar X_{k:n+k-1})^2}{n - 1}. $$
The following step would just be simple algebra:
$$ s^2_{k+1:n+k-1} = \frac{(n-1) s^2_{k:n+k-1} - (n - 1)(\bar X_{k+1:n+k-1} - \bar X_{k:n+k-1})^2 - (X_k - \bar X_{k:n+k-1})^2}{n-2} $$
Now, since we have $s^2_{k+1:n+k-1}$, we can calculate $s^2_{k+1:n+k}$, which is what we want. Of course, we just apply the formula that we were given again:
$$ s^2_{k+1:n+k} = \frac{(n - 2)s^2_{k+1:n+k-1} + (n - 1)(\bar X_{k+1:n+k-1} - \bar X_{k+1:n+k})^2 + (X_{k+n} - \bar X_{k+1:n+k})^2}{n - 1}. $$
And we have obtained the running SD (or Variance) which you want. I believe you've already figured out how to calculate the running means, so I won't go through that.
- 1,335