Please, could someone recommend a paper or blog post that describes the online k-means algorithm.
1 Answers
The original MacQueen k-means publication (the first to use the name "kmeans") is an online algorithm.
MacQueen, J. B. (1967). "Some Methods for classification and Analysis of Multivariate Observations". Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1. University of California Press. pp. 281–297
After assigning each point, the mean is incrementally updated.
As far as I can tell, it was also meant to be a single pass over the data only, although it can be trivially repeated multiple times to reassign points until convergence.
MacQueen usually takes fewer iterations than Lloyds to converge if your data is shuffled. On ordered data, it can have problems. On the downside, it requires more computation for each object, so each iteration takes slightly longer.
When you implement a parallel version of k-means, make sure to study the update formulas in MacQueens publication. They're useful.
- 8,134
- 1
- 16
- 31