0

Suppose that I have two sparse matrices $H_{1}$ and $H_{2}$ which contain only $0$ and $1$ values, so for example can be of the form

$$H_{1} = \begin{bmatrix} 1 & 0 & 1\\ 0 & 0 &1 \\ 1& 1 &1 \\ 0& 0 & 1\\ 1& 0 &0 \\ 0& 1 & 0 \end{bmatrix}$$

$$H_{2} = \begin{bmatrix} 0 & 0 & 1\\ 1 & 1 &1 \\ 1& 0 &0 \\ 0& 0 & 0\\ 0& 1 &0 \\ 0& 1 & 0 \end{bmatrix}$$

My goal is to measure the distance between those two matrices in the most accurate way with the use of a distance measure (probabilistic, or whatever else measure):

I was wondering if there exists a distance (I do not mind having all the distance properties) that is suitable for such comparisons, of sparse matrices with elements only $0$ and $1$.

There is a dozen of distances between matrices as can be seen here https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/matrdist.htm and here Distance/Similarity between two matrices, also I thought that we might be able to construct a distance with the use of correlations between the columns of the two matrices, i.e.

$$dist_{cor} = \frac{1}{3}\sum_{i=1}^{3}cor(H_{1}[,i],H_{2}[,i])$$

So, my question sums up if there exist measures (distance measures) for calculating such particular distances, or a literature that I can look into or ideas to construct a measure??

Jonathan1234
  • 1,103
  • 2
    (1) Im afraid your question is kinda unclear. As you noted, there are plenty of different metrics on the space of matrices Without knowing what you want to use it for it is impossible to know which one will be useful to you. (2) your "correlation distance" is not really a metric at all. At the very least at metric should satisfy $\text{dist}(H,H)=0$, i.e. the distance of a matrix to itself should be zero. Your metric does not satisfy that property, but it can certainly be modified to do so. – Simon Jul 09 '21 at 11:16
  • @Simon my goal is to use it in an Approximate Bayesian Computation framework. Where let's say $H_{1}$ is my true data set, and $H_{2}$ is a simulated data set, where in both data sets each entry comes from a Bernoulli simulation. And I want to quantify how close the true data set $H_{1}$ and simulated data set $H_{2}$ are. I do not want it necessary to be a metric, I just seek a way to quantify their distance/dissimilarity – Jonathan1234 Jul 09 '21 at 18:24

1 Answers1

2

As the @Simon suggests, "accuracy" of a metric depends on what you're interested in.

If you are interested in simply measuring the distance between sparse matrices, the Hamming distance would be a natural choice.

However, from your comment it seems like you are are looking for a distance measure between two distributions; a true data distribution (Bernouli) and a generated data distribution.

If your true data distribution is a Bernouli distribution with known mean $\mu_{ij}$ where $i,j$ are the indices entries of $H$ (assuming they are sampled i.i.d.), then you could obtain the empirical estimates

$$\hat{\mu_{ij}} = \frac{1}{N}\sum_n^N H_{ij} $$

from $n$ simulated data points and estimate the cross entropy per entry

$$S_{ij}(p,q) =- \sum_i^N p(H_{ij}) \log q(H_{ij})$$

where $p(H_{ij}) = \mu_{ij}(1 - \mu_{ij})$ and $q(H_{ij}) = \hat{\mu}_{ij}(1 - \hat{\mu}_{ij})$. Averaging over the entries of $S_{ij}$ should then be a good indicator of how close your simulated distribution is to the true one.

Rooler
  • 108