4

I'm trying to understand K-Means||, a scalable version of K-Means++, which itself is an "improved" version of the clustering algorithm K-Means.

Please find here the link to K-Means|| paper http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

Without detail What does "sampling with an oversampling factor mean? If $P(x)$ defines a probability, and $l$ is an integer $\ge 2$ say, what does sampling from $l\cdot P(x)$ mean?

Within context

You need to know:

  1. K-Means++ samples from a distribution $P(x)=\frac{d^2(x,C)}{\phi_X(C)}$ where $d^2(x,C)$ is the distance between the point $x$ and its nearest cluster in clustering $C$, and $\phi_X(C)$ is the cost function of the given clustering $C$, corresponding to the square distance between every point x and its nearest cluster $C$. At the end of the day, $\phi_X(C) = \sum_{x \in X} d^2(x,C)$ and $p_x$ defines a correct probability distribution.
  2. K-Means|| says it samples "each point $x \in X$ independently with probability $P(x) = l \cdot \frac{d^2(x,C)}{\phi_X(C)}$ where $l$ is said to be an oversampling factor.

I would love to understand what "an oversampling factor" genuinely means. To my mind, it seems weird to sample from something that is not a proper probability since it does not sum to one, but to $l$. Another concept I don't know must be implied here. (e.g. if $P(x_0) = 0.6$ then with $l=2$ what would sampling from $P(x)$ mean?

Ready to explain again if I'm not being clear,

Thanks a lot

2 Answers2

0

I think that "oversampling factor" is easier to understand if you read section 3.4 of the paper. Consider a1, the first point of one of the optimum clusters. Its probability, p1, is "oversampled" by a factor l. The consequence is that the probability of sampling a2 (which is noted as p2) is lower than with an uniform distribution (as the sum of probabilities equals 1). Same thing for a3, a4,... incrementally. The motivation of this algorithm is that, once you have picked up a1, you want to lower the probability to pick up the next centroid a2 in the same cluster as a1.

Cheers, Kimchitsigai

Kimchi
  • 1
0

I have recently been learning about the K-Means|| implementation, and I think it's good to note a clarification made in this paper Good Seedings for k-Means, where they limit the probability of sampling a data point $x \in X$ to $min(1, \frac{\ell d(x, C)^2}{\phi_X(C)})$.

Btw, just for verify something I have been thinking, is the expected number of samples drawn from $X$ $\ell$ and if so, why it is equal to the sum of the probabilities? Like,

$$ E \left(P(X) \right) = \sum_{i=1}^n P(x_i) = \sum_{i=1}^n \ell \frac{d(x_i, C)^2}{\sum_{i=1}^n d(x_i, C)^2} = \ell \frac{1}{\phi_X(C) } \sum_{i=1}^n d(x_i, C)^2 = \ell \frac{\phi_X(C)}{\phi_X(C)} = \ell $$

Regards

kubote
  • 1