I'm trying to understand K-Means||, a scalable version of K-Means++, which itself is an "improved" version of the clustering algorithm K-Means.
Please find here the link to K-Means|| paper http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
Without detail What does "sampling with an oversampling factor mean? If $P(x)$ defines a probability, and $l$ is an integer $\ge 2$ say, what does sampling from $l\cdot P(x)$ mean?
Within context
You need to know:
- K-Means++ samples from a distribution $P(x)=\frac{d^2(x,C)}{\phi_X(C)}$ where $d^2(x,C)$ is the distance between the point $x$ and its nearest cluster in clustering $C$, and $\phi_X(C)$ is the cost function of the given clustering $C$, corresponding to the square distance between every point x and its nearest cluster $C$. At the end of the day, $\phi_X(C) = \sum_{x \in X} d^2(x,C)$ and $p_x$ defines a correct probability distribution.
- K-Means|| says it samples "each point $x \in X$ independently with probability $P(x) = l \cdot \frac{d^2(x,C)}{\phi_X(C)}$ where $l$ is said to be an oversampling factor.
I would love to understand what "an oversampling factor" genuinely means. To my mind, it seems weird to sample from something that is not a proper probability since it does not sum to one, but to $l$. Another concept I don't know must be implied here. (e.g. if $P(x_0) = 0.6$ then with $l=2$ what would sampling from $P(x)$ mean?
Ready to explain again if I'm not being clear,
Thanks a lot