Highest Voted 'similarity' Questions - Data Science Stack Exchange

40

votes

5 answers

What are some standard ways of computing the distance between documents?

When I say "document", I have in mind web pages like Wikipedia articles and news stories. I prefer answers giving either vanilla lexical distance metrics or state-of-the-art semantic distance metrics, with stronger preference for the latter.

asked Jul 05 '14 at 16:10

Matt

821
1
8
12

40

votes

4 answers

Applications and differences for Jaccard similarity and Cosine Similarity

Jaccard similarity and cosine similarity are two very common measurements while comparing item similarities. However, I am not very clear in what situation which one should be preferable than another. Can somebody help clarify the differences of…

similarity

asked Feb 12 '15 at 07:08

shihpeng

563
1
4
8

38

votes

4 answers

When to use cosine simlarity over Euclidean similarity

In NLP, people tend to use cosine similarity to measure document/text distances. I want to hear what do people think of the following two scenarios, which to pick, cosine similarity or Euclidean? Overview of the task set: The task is to compute…

machine-learning nlp clustering similarity

asked Feb 12 '18 at 13:31

Logan

503
1
4
8

37

votes

6 answers

Sentence similarity prediction

I'm looking to solve the following problem: I have a set of sentences as my dataset, and I want to be able to type a new sentence, and find the sentence that the new one is the most similar to in the dataset. An example would look like: New…

python nlp scikit-learn similarity text

asked Oct 22 '17 at 07:36

lte__

1,379
5
19
29

35

votes

8 answers

Best practical algorithm for sentence similarity

I have two sentences, S1 and S2, both which have a word count (usually) below 15. What are the most practically useful and successful (machine learning) algorithms, which are possibly easy to implement (neural network is ok, unless the architecture…

nlp clustering word2vec similarity

asked Nov 23 '17 at 14:40

DaveTheAl

533
1
5
12

30

votes

1 answer

Adaboost vs Gradient Boosting

How is AdaBoost different from a Gradient Boosting algorithm since both of them use a Boosting technique? I could not figure out actual difference between these both algorithms from a theory point of view.

algorithms similarity ensemble-modeling boosting

asked Oct 04 '18 at 14:25

CodeMaster GoGo

808
1
7
15

25

votes

5 answers

Clustering based on similarity scores

Assume that we have a set of elements E and a similarity (not distance) function sim(ei, ej) between two elements ei,ej ∈ E. How could we (efficiently) cluster the elements of E, using sim? k-means, for example, requires a given k, Canopy…

clustering algorithms similarity

asked May 16 '14 at 14:26

vefthym

503
1
6
13

17

votes

4 answers

Alternatives to TF-IDF and Cosine Similarity when comparing documents of differing formats

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows: 1) Process…

nlp text-mining similarity cosine-distance

asked Jan 02 '17 at 20:41

Richard Knoche

171
1
1
3

14

votes

1 answer

How to measure the similarity between two images?

I have two group images for cat and dog. And each group contain 2000 images for cat and dog respectively. My goal is try to cluster the images by using k-means. Assume image1 is x, and image2 is y.Here we need to measure the similarity between any…

machine-learning k-means similarity image

asked Apr 05 '19 at 00:36

jason

329
2
4
9

12

votes

1 answer

MinHashing vs SimHashing

Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here: https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/ could yield three clusters ({A}, {B,C,D} and {E}), for instance, if…

clustering similarity

asked Jun 11 '15 at 21:21

cjauvin

451
4
7

12

votes

3 answers

Which supervised learning algorithms are available for matching?

I'm working on a non-profit where we try to help potential university applicants by matching them with alumni that want to share their experience/wisdom and, at the moment, it is happening manually. So I'll have two tables, one with students and one…

machine-learning beginner similarity supervised-learning recommender-system

asked Jun 21 '16 at 15:43

k1nd3rm4x1

123
1
1
5

12

votes

3 answers

Why use cosine similarity instead of scaling the vectors when calculating the similarity of vectors?

I'm watching a NLP video on Coursera. It's discussing how to calculate the similarity of two vectors. First it discusses calculating the Euclidean distance, then it discusses the cosine similarity. It says that cosine similarity makes more sense…

machine-learning nlp clustering similarity

asked Sep 13 '22 at 09:31

Allure

285
2
7

11

votes

5 answers

Cosine similarity vs The Levenshtein distance

I wanted to know what is the difference between them and in what situations they work best? As per my understanding: Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the…

similarity metric cosine-distance

asked Nov 18 '19 at 08:52

Pluviophile

4,203
14
32
56

10

votes

3 answers

Vector space model cosine tf-idf for finding similar documents

Have corpus of over million documents For a given document want to find similar documents using cosine as in vector space model $d_1 \cdot d_2 / ( ||d_1|| ||d_2|| )$ All tf have been normalized using augmented frequency, to prevent a bias…

text-mining similarity

asked Oct 09 '15 at 16:31

paparazzo

188
14

10

votes

3 answers

Create most "average" cosine similarity observation

For a recommendation system I'm using cosine similarity to compute similarities between items. However, for items with small amounts of data I'd like to bin them under a general "average" category (in the general not mathematical sense). To…

recommender-system similarity

asked Jul 01 '14 at 13:44

eric chiang

233
2
7

Questions tagged [similarity]