Mutual Information for clustering

Question

I'm working on a document clustering application and decided to use Normalized Mutual Information as one of the measures of effectivenes. But I don't really understand how to implement this in that situation. In http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html the the formula is transformed to (185), and in this publication (www.shi-zhong.com/papers/comptext2.pdf‎, page 8, formula 17) it looks slightly different, n(h,l) is not divided by total number of documents N. So, which formula is correct? I would be very grateful for possibly simple explantation.

yes, but no answer so far, so I thought to post it here as well. — user1315305, Jul 07 '13 at 13:12

score 7 · Accepted Answer · answered Jul 07 '13 at 13:33

Each clustering algorithm $C=\{C_1,\dots,C_k\}$ defines the probability distribution $P_{\mathcal C}$

$$P_{\mathcal C}(i)=\frac{n_i}{N},$$

where $n_i$ is the number of points in the $i$-th cluster $C_i$ and $n$ is the total number of points in the data cloud. Different cluster algorithms can determine different numbers of clusters, of course.

For any distributions $P_{\mathcal C_1}=(p_1,\cdots,p_n)$ and $P_{\mathcal C_2}=(q_1,\cdots,q_m)$ the mutual information $I(p,q)$ is just

$$I(p,q)=\sum_{i,j}R(i,j)\log\frac{R(i,j)}{P_{\mathcal C_1}(i)P_{\mathcal C_2}(j)},$$

denoting by $R(i,j)$ the joint probability distribution. Using the definition for $P_{\mathcal C_i}$ and $R$, i.e.

$$P_{\mathcal C_1}(i)=\frac{n_i}{N} $$ $$P_{\mathcal C_2}(j)=\frac{m_j}{N} $$ $$R(i,j)=\frac{n_{i,j}}{N}:=\frac{|n_i\cap m_j|}{N}$$

we arrive at

$$I(P_{\mathcal C_1},P_{\mathcal C_2})=\sum_{i,j}\frac{n_{i,j}}{N}\log\frac{\frac{n_{i,j}}{N}}{\frac{n_i}{N}\frac{m_j}{N}}=\sum_{i,j}\frac{n_{i,j}}{N} \log\frac{N n_{i,j}}{n_i m_j},$$

as in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html, formula (185).

The same formula for the mutual information contained in http://machinelearning.wustl.edu/mlpapers/paper_files/StrehlG02.pdf and cited in http://www.shi-zhong.com/papers/comptext2.pdf does not contain the factor $\frac{1}{N}$, as you correctly remark.

I would use the above formulation for the mutual information, as it implements the correct probabilitstic view for the univariate and joint distributions.

Remark Note that in http://machinelearning.wustl.edu/mlpapers/paper_files/StrehlG02.pdf the assertion at pag. 589 "I(X,Y) is a metric" is wrong. The mutual information $I(X,Y)$ is equivalent to a Kullback Leibner divergence and it is no metric (or distance). The information value (or variation of information) is a metric, instead. Please look at http://en.wikipedia.org/wiki/Variation_of_information for more details.

Thank you, that makes sense now. I will do as you suggested. — user1315305, Jul 07 '13 at 13:58
I would use the above formulation for the mutual information But which? Both of the formulae are "above" — Sibbs Gambling, Dec 23 '14 at 01:40
Formula 185 in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html — Avitus, Dec 23 '14 at 11:29

Mutual Information for clustering

1 Answers1