Vector space model cosine tf-idf for finding similar documents

Question

Have corpus of over million documents

For a given document want to find similar documents using cosine as in vector space model

$d_1 \cdot d_2 / ( ||d_1|| ||d_2|| )$

All tf have been normalized using augmented frequency, to prevent a bias towards longer documents as in this tf-idf:

$tf(t,d)=0.5+0.5\frac{f(t,d)}{\mathrm{max}\{f(t,d): t\in d\}}$

Have pre-calculated all $||d||$
Have the values for the denominator pre-calculated
So for a given $d1$ need to score over 1 million $d2$
Have a threshold of 0.6 cosine for similarity

I can observe that for a given $||d_1||$ there is a fairly narrow range of $||d_2||$ for cosine $\ge$ 0.6
For example in one search for similar for a cosine of $\ge$ 0.6 and a $||d_1||$ of 7.7631 then $||d_2||$ range from 7.0867 to 8.8339
Where outside the threshold of cosine 0.6 $||d_2||$ range from to 0.7223 to 89.3395
This was with standard tf document normalization
It is looking at a LOT of $||d_2||$ that don't have a chance of being a cosine 0.6 match

Finally the question:
For a give $||d_1||$ and cosine of >= 0.6 how can determine the range of $||d_2||$ that have a chance?
Which $||d_2||$ can I safely eliminate?

I also know the number of terms in $d_1$ and $d_2$ if there is term count range.

Via experimentation
$||d2|| > .8 ||d1||$ and $||d2|| < ||d1|| / .8 $ seems to be safe but hopefully there is range that is proven to be safe

Created some test cases with a very some unique terms, some not so unique, and some common. Sure enough you can take the most unique term and increase that frequency in the compare. The numerator will (dot product) go up and so will ||compare|| and will get a cosine very close to 1.

Kind of related and NOT the question.
I am also using the tf-idf to group documents into groups. The customer base I am selling into are used to near near dup groups. There I am taking a related approach in I look as the smallest term count and evaluate it against term count up to 3x. So a term count of 10 looks at 10 thru 30 (4-9 already had their shot at 10). Here I can afford to miss one have it picked up in another. I am 10% done and the biggest ratio is 1.8.

Please identify the flaws in this analysis
As pointed out by AN6U5 there is a flaw in this analysis
It is no longer a cosine if the document is normalized on weighted
And as pointed out by Mathew also cannot conclude d1⋅d2≤d1⋅d1
I am still hoping for something to give me a hard bound but people that seems to know this stuff are telling me no
I don't want to change the question so just ignore this
I will do some analysis and maybe post a separate question on document normalization
For the purpose of this question assume the document is normalized on raw tf
Sorry but I am just not good with what ever markup is used to make the equations
So in my notation
||d1|| = sqrt(sum(w1 x w1))
d1 dot d2 = sum(w1 X w2)
Assume d1 is the shorter document
The very best d1 dot d2 that can be achieved is d1 dot d1
If d1 is marry 100 paul 20
And d2 is marry 100 paul 20 peter 1
Normalized
d1 is marry 1 paul 1/5
d2 is marry 1 paul 1/5 peter 1/100
Clearly marry and paul have the same idf in both documents
The best possible d1 dot d2 is d1 dot d1
The maximum possible match to d1 is d1
cos = d1 dot d1 / ||d1|| ||d2||
square both sides
cos X cos = (d1 dot d1) X (d1 dot d1) / ( (d1 dot d1) X (d2 dot d2) ) cos X cos = (d1 dot d1) / (d2 dot d2)
take the square root of both side
cos = ||d1|| / ||d2||
is ||d2|| not bounded by the cos?
If I just use ||d2|| >= cos ||d1|| and ||d2|| <= ||d1|| / cos I get the computational speed I need

AN6U5 · Answer 1 · 2015-10-13T00:50:31.787

Unfortunately, the math simplifies to show that you can't rigorously justify restricting the cosine similarity comparison of the vectors based on their lengths.

The key point is that the cosine similarity metric normalizes based on length, so that only the unit vectors are considered. I know this isn't necessarily the answer that you wanted, but the math clearly shows that the cosine similarity metrics is agnostic to vector length.

Lets look at the math in more detail:

You are applying a cosine similarity metric and requiring that that metric be larger than 0.6:

$$similarity=\cos{(\theta)}=\frac{\mathbf{A}\cdot\mathbf{B}}{||A|| ||B||}\geq0.6$$.

But the scalar lengths on the bottom can be distributed into the cross products above (distributive property):

$$\frac{\mathbf{A}\cdot\mathbf{B}}{||A|| ||B||} = \frac{\mathbf{A}}{||A||}\cdot\frac{\mathbf{B}}{||B||}=\hat{\mathbf{A}}\cdot\hat{\mathbf{B}}$$.

Now $\hat{\mathbf{A}}$ and $\hat{\mathbf{B}}$ are vectors that point in the same direction as $\mathbf{A}$ and $\mathbf{B}$ but they have been normalized to length one. So the definition of the cosine similarity metric is the take the original vectors, normalize them to length one, and then measure the dot product of the unit vectors.

Therefor:

$$similarity=\cos{(\theta)}=\frac{\mathbf{d1}\cdot\mathbf{d2}}{||d1|| ||d2||}=\hat{\mathbf{d1}}\cdot\hat{\mathbf{d2}}\geq0.6$$

depends only on the orientation of the vectors and not on their magnitude (i.e. length).

Reconciling this with what you are doing:

Despite what the results of the linear algebra show, you might still be seeing a statistically significant result. Practically speaking you may be finding that the statistics show that length restrictions are valid for your data. For instance, you might be finding that tweets never share a cosine similarity $\geq0.6$ when compared with Tolstoy's "War and Peace". If your statistics look good for using $||d2|| > .8 ||d1||$ and $||d2|| < ||d1|| / .8$ then I suggest you go with it as these type of canopy restrictions are very useful in saving computing time.

You can perhaps reconcile what you have been doing with distance metrics by also considering the Euclidean distance. Where as the cosine similarity only returns a value between -1 and 1 based on the angle between the two vectors, the Euclidean distances will return values that depend on the lengths of the two vectors. In some sense, you are combining aspects of the Euclidean distance with cosine similarity.

It makes fairly good sense to require that the relative lengths be within 25% of one another in the sense that this combines an aspect of Euclidean distance to create group-by canopies, which cuts computation time, then length agnostic cosine similarity can be used as the final determinant.

Note that 1/.8 =1.25, so d2>=.8d1 is a tighter restriction than d2<=d1/.8. I suggest using d2>=.75d1 and d2<=1.25d1 as this is symmetric.

Hope this helps!

Matthew Gray · Answer 2 · 2015-10-13T00:13:42.093

First, let's try to get some intuition why this would work. $||d_i||$ seems to serve as a word rarity measure, which seems plausible as something to filter on. If documents use dissimilar numbers of rare words, it'll be difficult for them to line up on the cosine similarity measure. But it seems unlikely to me that this cutoff will only depend on $||d_i||$, rather than also on the structure in the tf or idf weights that go into $||d_i||$.

To work through some algebra, let me introduce a few more terms (and rename some to shorter ones):

Let $d_1$ be a vector of tf weights $[t_1, t_2, ...]$ element-wise multiplied by a vector of idf weights $[w_1, w_2, ...]$ to get the final weights $[d_1, d_2, ...]$. We know that $0.5\le t_i\le 1$ and $0\le w_i\le 6$ (because of the corpus size and assuming we're using base 10, it doesn't matter if we're not). Let $D_1=||d_1||$.

Knowing $d_1$, we want to construct a delta vector $x$, such that $d_1+x$ has the minimal (or maximal) $X$ subject to the constraints that:

$X=\sqrt{\sum_i w_i^2 (t_i+x_i)^2}$

$0.6D_1X\le \sum_i w_i^2t_i(t_i+x_i)$ (1)

$0.5\le t_i+x_i \le 1$

Because we didn't use the raw tf weight for $x$, $x_i=0\ \forall i$ is in the solution space. I'm also ignoring the more complicated constraint that at least one $d_i+x_i=1$, because we can't express that linearly. We'll leave it as is and hope that the optimizer ends up setting one of them to 1.

Intuitively, it seems like the set of possible $x$ should be convex, but even if so we're already in the realm of quadratically constrained programming. Note that we can solve for minimal $X^2$ instead of minimal $X$, because $X>0$, but we probably can't use this methodology to maximize $X$ (i.e. minimize $-X$). But thankfully this'll be easily solvable if $P$ is positive semidefinite. So what is $P$? We need to rewrite (1) in the correct form, which starts by squaring both sides:

$0\ge 0.36D_1^2\sum_i w_i^2 (t_i+x_i)^2-\sum_{i,j}w_i^4t_it_j(t_i+x_i)(t_j+x_j)$

We can rewrite this as $0\ge x^TPx+q^Tx+r$ where $P_{i,j}=0.36D_1^2-w_i^2t_it_j$ if $i=j$ and $-w_i^2t_it_j$ otherwise.

It's non-obvious to me that $P$ has to be positive semidefinite, but that'll be easy to check for any individual $d_1$. If so, pop this into a QP solver and you'll get a lower bound on $X$. If not, we're in trouble.

Can we also get a practical upper bound? I'm not sure. Obviously there's some finite upper bound, since we can calculate the maximum possible $X$ from the idf vector $w$ easily. But the fact that the minimum tf weight is 0.5 instead of 0 is throwing off my intuitions about how to create an adversarial $x$ with maximum $X$, and so the best approach that I'm coming up with is gradient descent, which may or may not find the actual global maximum but will probably be close.

score 0 · Answer 3 · answered Oct 18 '15 at 02:19

I post an answer but clearly I will award the bonus to someone else

I think there is a maximum numerator if the document tf is normalized

d1⋅d2/(||d1||||d2||)

Assume d1 has same or less terms (or just take the d with less terms)
The maximum possible normalized tf is 1
So the maximum possible numerator sum(tf1,i * idf,i * 1 * idf,i)

||d2|| = sum(tf1,i * idf,i * 1 * idf,i) / ||d1|| / .6

As for a minimum I am working on that but clearly there is a minimum.
If you are going to match you are going to have ||d||

Vector space model cosine tf-idf for finding similar documents

3 Answers3