1

I currently am working with a set of biological data and, at this time, have no dependent variable that I am trying to predict with these variables. However, I am wondering if there is a way to quantify the significance of these variables within the data set (i.e., how much information/variance they contribute to the data set) even though I do not have a variable I am trying to predict. I found methodology using a principal component analysis to determine a sub-group of variables that contribute the most information to the data set (Olliffe, 1972 - Method B4; King 1999 - Method B4). However, I have no way to quantify their significance. Any suggestions? Thank you in advance!

1 Answers1

0

Just some random thoughts:

Thought 1

Well, since you mentioned information or variance contributed to the data set $D=\{x_1,\ldots,x_n\}$ with features $F=(y_1,\ldots,y_m)$, you can literally just calculate it. For instance: $$ I_{1,D}(k) = E(D) - E(D_k) $$ where $x_i\in\mathbb{R}^m$, $E(D)$ is the joint entropy estimated from the full dataset, and $E(D_k)$ is the entropy of the dataset with feature $k$ removed. So this is the information entropy of the data with the feature minus it without the feature.

However, high dimensional entropies can be hard to compute. Another idea is to use the variance, as you mentioned. Let $C(D)$ be the covariance matrix of $D$. Then similar to above: $$ I_{2,D}(k) = ||C(D)||_p - ||C(D_k)||_p $$ where $p\in\{1,2,\infty,F\}$ (for instance) defines a matrix norm. This measures the difference between the covariance in the whole dataset and that of the dataset without the $k$th feature. One could also use the determinant instead of a matrix norm (which may give it greater connections to the entropy!)

Thought 2

Another idea is just to measure the total mutual information of each variable to each other one, and invert it. The idea would be that the most informative variables would be the ones with the least information shared with all the other variables For instance, $$ I_{3,D}(k) = \left[ \sum_i \mathcal{I}(y_k,y_i) \right]^{-1} $$ where $\mathcal{I}(y_k,y_i)$ is mutual information, or you could take the total information in the variable, and remove all the mutual informations between it and all the other variables: $$ I_{4,D}(k) = \exp\left(mH(y_k) - \sum_i \mathcal{I}(y_k,y_i)\right) $$ so you take the information entropy of feature $k$, $H(y_k)$, and subtract all the information shared between $y_k$ and the other features. Note that it can be rewritten as $$ I_{4,D}(k) = \exp\left( H(y_k) - \sum_{i\ne k} H(y_i) + H(y_k,y_i) \right) $$ so we take the information inside the feature, minus (1) the information in the other features and (2) the joint information held by the feature pairs. The exp is just to make the measure positive.

These are all measuring the "importance" of the variable by checking how non-redundant they are with the other variables.

If you want a "variance" measure rather than an information-theoretic one, you could use: $$ I_{5,D}(k) = \exp\left( 2\mathbb{V}[y_k] - \sum_{i} \text{Cov}[y_i,y_k] \right) $$

Thought 3

Also, here's something you could also consider doing with PCA. Let $v_i$ be the principal components with eigenvalues $\tilde{\lambda}_i$. I'll assume $v_i$ is normalized and let $\lambda_j=\tilde{\lambda}_j\left[\sum_k \lambda_k\right]^{-1}$ Then $v_i(k)$ is the $i$th eigenvector's $k$th component, i.e. the "importance" of feature $k$ in the PCA axis $i$. Notice that $\lambda_j$ is the variance explained by $v_j$.

Then, define $$ I_{6,D}(k) = \sum_i |v_i(k)| \lambda_i^p $$ for e.g. $p=1$ or $2$. Essentially, each PCA axis has an importance given by $\lambda_i$. So we take the contribution of feature $k$ to axis $i$, and weight it by the importance of that axis.

user3658307
  • 10,843