Questions tagged [similarity]
278 questions
40
votes
5 answers
What are some standard ways of computing the distance between documents?
When I say "document", I have in mind web pages like Wikipedia articles and news stories. I prefer answers giving either vanilla lexical distance metrics or state-of-the-art semantic distance metrics, with stronger preference for the latter.
Matt
- 821
- 1
- 8
- 12
40
votes
4 answers
Applications and differences for Jaccard similarity and Cosine Similarity
Jaccard similarity and cosine similarity are two very common measurements while comparing item similarities. However, I am not very clear in what situation which one should be preferable than another.
Can somebody help clarify the differences of…
shihpeng
- 563
- 1
- 4
- 8
38
votes
4 answers
When to use cosine simlarity over Euclidean similarity
In NLP, people tend to use cosine similarity to measure document/text distances. I want to hear what do people think of the following two scenarios, which to pick, cosine similarity or Euclidean?
Overview of the task set: The task is to compute…
Logan
- 503
- 1
- 4
- 8
37
votes
6 answers
Sentence similarity prediction
I'm looking to solve the following problem: I have a set of sentences as my dataset, and I want to be able to type a new sentence, and find the sentence that the new one is the most similar to in the dataset. An example would look like:
New…
lte__
- 1,379
- 5
- 19
- 29
35
votes
8 answers
Best practical algorithm for sentence similarity
I have two sentences, S1 and S2, both which have a word count (usually) below 15.
What are the most practically useful and successful (machine learning) algorithms, which are possibly easy to implement (neural network is ok, unless the architecture…
DaveTheAl
- 533
- 1
- 5
- 12
30
votes
1 answer
Adaboost vs Gradient Boosting
How is AdaBoost different from a Gradient Boosting algorithm since both of them use a Boosting technique?
I could not figure out actual difference between these both algorithms from a theory point of view.
CodeMaster GoGo
- 808
- 1
- 7
- 15
25
votes
5 answers
Clustering based on similarity scores
Assume that we have a set of elements E and a similarity (not distance) function sim(ei, ej) between two elements ei,ej ∈ E.
How could we (efficiently) cluster the elements of E, using sim?
k-means, for example, requires a given k, Canopy…
vefthym
- 503
- 1
- 6
- 13
17
votes
4 answers
Alternatives to TF-IDF and Cosine Similarity when comparing documents of differing formats
I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows:
1) Process…
Richard Knoche
- 171
- 1
- 1
- 3
14
votes
1 answer
How to measure the similarity between two images?
I have two group images for cat and dog. And each group contain 2000 images for cat and dog respectively.
My goal is try to cluster the images by using k-means.
Assume image1 is x, and image2 is y.Here we need to measure the similarity between any…
jason
- 329
- 2
- 4
- 9
12
votes
1 answer
MinHashing vs SimHashing
Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here:
https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/
could yield three clusters ({A}, {B,C,D} and {E}), for instance, if…
cjauvin
- 451
- 4
- 7
12
votes
3 answers
Which supervised learning algorithms are available for matching?
I'm working on a non-profit where we try to help potential university applicants by matching them with alumni that want to share their experience/wisdom and, at the moment, it is happening manually. So I'll have two tables, one with students and one…
k1nd3rm4x1
- 123
- 1
- 1
- 5
12
votes
3 answers
Why use cosine similarity instead of scaling the vectors when calculating the similarity of vectors?
I'm watching a NLP video on Coursera. It's discussing how to calculate the similarity of two vectors. First it discusses calculating the Euclidean distance, then it discusses the cosine similarity. It says that cosine similarity makes more sense…
Allure
- 285
- 2
- 7
11
votes
5 answers
Cosine similarity vs The Levenshtein distance
I wanted to know what is the difference between them and in what situations they work best?
As per my understanding:
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the…
Pluviophile
- 4,203
- 14
- 32
- 56
10
votes
3 answers
Vector space model cosine tf-idf for finding similar documents
Have corpus of over million documents
For a given document want to find similar documents using cosine as in vector space model
$d_1 \cdot d_2 / ( ||d_1|| ||d_2|| )$
All tf have been normalized using augmented frequency, to prevent a bias…
paparazzo
- 188
- 14
10
votes
3 answers
Create most "average" cosine similarity observation
For a recommendation system I'm using cosine similarity to compute similarities between items. However, for items with small amounts of data I'd like to bin them under a general "average" category (in the general not mathematical sense). To…
eric chiang
- 233
- 2
- 7