3

Given two bit vectors A and B of length $N$ the Jaccard similarity between them is $\|A \land B\|/\|A \lor B\|$.

Select $M$ bit vectors from the set of $2^N-1$ possible values, excluding all-zeroes. What is the largest $M$ for a given $N$ such that the $M(M-1)/2$ pairwise Jaccard similarities are all distinct? What is a concrete example of a subset with that property?

The number of distinct Jaccard similarity values for $N$ is the number of fractions in Farey series of order $N$, which sets the upper bound.

Experimental results

A brute-force search starting with $N=2$ finds $M = 2, 3, 4, 5, 5, 6, 7, \ldots$. I found no likely matches in OEIS. An example of 7 bit vectors for $N=8$ is:

00000001
00001110
00010111
00110111
01110111
01111111
11111111

I have lower bounds for a few larger cases. For $N=10$, $M \ge 8$ with decimal values {1, 31, 63, 99, 191, 447, 511, 1022}. For $N=64$, $M \ge 27$ with {75616855644387687, 2459624554201425171, 1521930251796536, 17293822019346890751, 18157352612976647935, 2404968724985570464, 9007705569197305865, 16717132001596030797, 16134634947158732799, 1291115287526644499, 3849546270639050489, 13168489964796115711, 144115188075864354, 16131190038204382975, 1316340271877180188, 7489449894117883466, 15037035512890333456, 8638537081332189361, 6036230857943676155, 13051322593571675920, 72092782704984705, 17867483455257025887, 17985060916847511162, 17221764975064768383, 1073193546549267, 10470171935520675435, 11344355769778337692}. The last was found using a random-based first-fit search.

Driving Problem

In chemistry, similar molecules tend to have similar properties. There are many ways to characterize "similar". One way is to use a bit vector (typically with $N=1024$ or $N=2048$) called a "fingerprint", where a given bit it set due to the presence of one or more chemical features in a molecule. The Jaccard similarity between two corresponding fingerprints - called the "Tanimoto" in my field - is used as a proxy for chemical similarity. It is easy to understand, simple to use, and gives reasonably good predictive ability.

There are many ways to analyze a set of fingerprints, for example, to cluster them, or find a dissimilar subset. The large number of ties in the NxN comparison means that the final result is not exactly determined. Different implementations of the same algorithm will likely give different results. Even multiple runs of the same program might give different result if, for example, an RNG is used to break ties and different seeds are used.

This indeterminacy makes it more difficult to validate basic correctness.

On the other hand, if the data set has no ties then the results should be well-defined, which lead me on this path.

The $M=27$ case above is good enough for my current purposes. I still have an intellectual itch to find out the exact value for $M$ and to find larger data sets, especially for $N=1024$.

  • Interesting. You may be interested by information contained in this question of mine here – Jean Marie Jan 28 '22 at 10:25
  • 1
    Thanks for the pointer. I looked at it. My last matrix algebra course was 32 years ago - at this point I'm only able to nod sagely when I read "positive semi-definite." Meaning, I didn't know how it could apply. :S – Andrew Dalke Jan 28 '22 at 18:56
  • Reminder: "positive semi-definite" is a property of a (symmetric) matrix which means "all its eigenvalues are real and $\ge 0$". – Jean Marie Jan 28 '22 at 19:12
  • Is there a connection with this "safe opening with multiple locks/keys" puzzle https://math.stackexchange.com/q/581461 ? or with this question: https://math.stackexchange.com/q/190767 ? – Jean Marie Jan 30 '22 at 08:44
  • 1
    If there is, I don't see it. At the very least, the growth curves in those two links don't match what I see experimentally, and I don't see how either have an upper-bound related to the Farey sequence. – Andrew Dalke Jan 30 '22 at 21:32

1 Answers1

1

Thank you for the interesting problem! I would like to show some analyses I have done. First, I relax the problem to find $M$ binary vectors that have distinct $(\|A \land B\|, \|A \lor B\|)$ pairs. Let us call this pair "Jaccard pair". Clearly, the solution to this relaxed version will give us an upper bound, which is expected to be a tighter one than the "# fractions in Farey series" mentioned by OP.

Let us fix one vector $A$ first, then we have equivalence classes w.r.t the Jaccard pair we just defined. Specifically, we have the equivalence class $C_{ij; A} = \{B: (\|A \land B\|, \|A \lor B\|) = (i, j)\}$ for all possible $(i, j)$ pairs. We have the size of each equivalence class, $|C_{ij; A}| = \binom{\|A\|}{i} \binom{N - \|A\|}{j}$. In each equivalence class, we may choose at most one vector.

If we go back to the original problem, then I think a very important first step is to notice the case $\|A \land B\| = 0$, since the Jaccard similarity is always $0$ when $\|A \land B\| = 0$. This implies that among all the $M$ vectors, at most one pair can be non-intersecting, which may significantly reduce the search space.

Vezen BU
  • 2,320
  • You're welcome! I had never considered the equivalence class approach - nice! My more recent attempt to reduce search space was to keep track of the symmetry classes of each bit position. Eg, given 0011, 0010 the symmetry classes are AABC so the only new terms to consider are in {00,01,11}x{0,1}x{0,1} minus the all-zero term. It never quite worked. I've also considered ordering the rows by $|A|$, from largest to smallest. – Andrew Dalke Jan 28 '22 at 18:52