I'm working on a document clustering application and decided to use Normalized Mutual Information as one of the measures of effectivenes. But I don't really understand how to implement this in that situation. In http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html the the formula is transformed to (185), and in this publication (www.shi-zhong.com/papers/comptext2.pdf, page 8, formula 17) it looks slightly different, n(h,l) is not divided by total number of documents N. So, which formula is correct? I would be very grateful for possibly simple explantation.
- 4,027
- 28
- 51
- 153
-
have you tried http://stats.stackexchange.com/ ? – cactus314 Jul 07 '13 at 13:06
-
yes, but no answer so far, so I thought to post it here as well. – user1315305 Jul 07 '13 at 13:12
1 Answers
Each clustering algorithm $C=\{C_1,\dots,C_k\}$ defines the probability distribution $P_{\mathcal C}$
$$P_{\mathcal C}(i)=\frac{n_i}{N},$$
where $n_i$ is the number of points in the $i$-th cluster $C_i$ and $n$ is the total number of points in the data cloud. Different cluster algorithms can determine different numbers of clusters, of course.
For any distributions $P_{\mathcal C_1}=(p_1,\cdots,p_n)$ and $P_{\mathcal C_2}=(q_1,\cdots,q_m)$ the mutual information $I(p,q)$ is just
$$I(p,q)=\sum_{i,j}R(i,j)\log\frac{R(i,j)}{P_{\mathcal C_1}(i)P_{\mathcal C_2}(j)},$$
denoting by $R(i,j)$ the joint probability distribution. Using the definition for $P_{\mathcal C_i}$ and $R$, i.e.
$$P_{\mathcal C_1}(i)=\frac{n_i}{N} $$ $$P_{\mathcal C_2}(j)=\frac{m_j}{N} $$ $$R(i,j)=\frac{n_{i,j}}{N}:=\frac{|n_i\cap m_j|}{N}$$
we arrive at
$$I(P_{\mathcal C_1},P_{\mathcal C_2})=\sum_{i,j}\frac{n_{i,j}}{N}\log\frac{\frac{n_{i,j}}{N}}{\frac{n_i}{N}\frac{m_j}{N}}=\sum_{i,j}\frac{n_{i,j}}{N} \log\frac{N n_{i,j}}{n_i m_j},$$
as in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html, formula (185).
The same formula for the mutual information contained in http://machinelearning.wustl.edu/mlpapers/paper_files/StrehlG02.pdf and cited in http://www.shi-zhong.com/papers/comptext2.pdf does not contain the factor $\frac{1}{N}$, as you correctly remark.
I would use the above formulation for the mutual information, as it implements the correct probabilitstic view for the univariate and joint distributions.
Remark Note that in http://machinelearning.wustl.edu/mlpapers/paper_files/StrehlG02.pdf the assertion at pag. 589 "I(X,Y) is a metric" is wrong. The mutual information $I(X,Y)$ is equivalent to a Kullback Leibner divergence and it is no metric (or distance). The information value (or variation of information) is a metric, instead. Please look at http://en.wikipedia.org/wiki/Variation_of_information for more details.
- 14,348
- 1
- 31
- 52
-
-
1
I would use the above formulation for the mutual informationBut which? Both of the formulae are "above" – Sibbs Gambling Dec 23 '14 at 01:40 -
1Formula 185 in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html – Avitus Dec 23 '14 at 11:29