4

Say I have a set S of values, and want to store in a database some summary information about that set, so that later when I acquire a new value v I can make a reasonable estimate of what the summary information would be about the set S ∪ {v} --- although by now I no longer have access to the original members of S. I'd like the summary information to include the mean and variance of these sets, and as minimal additional information as needed. A natural idea for the additional information would be S's cardinality. But I'm willing to save more complicated information about S if needed. My main constraint is to minimize the size of the retained information.

If I only cared about the mean of the sets, then storing the mean plus cardinality of S would obviously be enough. I could update with a new value by just taking a weighted average of the old mean (times the old cardinality) and the new value. But I'd like to be able to keep track of the variance of the sets too. A good estimate is enough; I don't need to be able to reconstruct what the exact mean and variance of S ∪ {v} would be.

I expect that even asking this displays how naive I am about statistics, but I'd appreciate any help. I don't know where to look for answers.

dubiousjim
  • 181
  • 1
  • 6

2 Answers2

4

Following that link about moving variance in my comment, I came upon this: Welford's online algorithm for calculating variance, which seems to supply what I'm looking for.

Here's the algorithm:

new_count = old_count + 1
d1 = new_value - old_mean
new_mean = old_mean + d1/new_count
d2 = new_value - new_mean
new_sum_squares = old_sum_squares + d1*d2

From a saved (count, mean, sum_squares), the population variance can be computed as sum_squares/count.

Given an initial value v, one can start with:

count = 1
mean = v
sum_squares = 0

If you want a weighted mean and variance, you can modify the algorithm like this:

new_count = old_count + new_weight
d1 = (new_value - old_mean)*new_weight

The other lines stay the same. (Here the mean of values a,b with weights x,y, respectively, is (ax+by)/(x+y); and the weighted sum_squares is x(a-mean)^2 + y(b-mean)^2.)

dubiousjim
  • 181
  • 1
  • 6
4

This problem was discussed, with proof and some alternate methods over on math.stackexchange.