Questions tagged [distance]

For question regarding distance between distributions or variables, such as Euclidean distance between points in n-space.

145 questions
46
votes
6 answers

When would one use Manhattan distance as opposed to Euclidean distance?

I am trying to look for a good argument on why one would use the Manhattan distance over the Euclidean distance in machine learning. The closest thing I found to a good argument so far is on this MIT lecture. At 36:15 you can see on the slides the…
29
votes
1 answer

What is Hellinger Distance and when to use it?

I am interested in knowing what really happens in Hellinger Distance (in simple terms). Furthermore, I am also interested in knowing what are types of problems that we can use Hellinger Distance? What are the benefits of using Hellinger Distance?
Smith Volka
  • 685
  • 2
  • 6
  • 13
12
votes
1 answer

Finding linear transformation under which distance matrices are similar

I have $n$ sets of vectors, where each set $S_i$ contains $k$ vectors in $\mathbb{R}^d$. I know there is some unknown linear transformation $W$ under which the distance matrix $D_i$ (a $k\times k$ matrix) is approximately "the same" (i.e. has a low…
9
votes
1 answer

Why is the cosine distance used to measure the similatiry between word embeddings?

While computing the similarity between the words, cosine similarity or distance is computed on word vectors. Why aren't other distance metrics such as Euclidean distance suitable for this task. Let us consider 2 vectors a and b. Where, a = [-1,2,-3]…
Ashwin Geet D'Sa
  • 1,217
  • 2
  • 11
  • 20
8
votes
1 answer

Cosine Distance > 1 in scipy

I am working on a recommendation engine, and I have chosen to use SciPy's cosine distance as a way of comparing items. I have two vectors: a = [2.7654870801855078, 0.35995355443076027, 0.016221679989074141, -0.012664358453398751,…
redgem
  • 183
  • 1
  • 1
  • 4
8
votes
2 answers

Fixing data inconsistencies

I'm trying to analyze some data I have but there is a lot of inconsistencies in my data. I have a SQL table that I'm trying to analyze. The table is a table of universities with the following structure: name:string, city:string, state:string,…
bl0b
  • 183
  • 4
8
votes
3 answers

Levenshtein distance vs simple for loop

I have recently begun studying different data science principles, and have had a particular interest as of late in fuzzy matching. For preface, I'd like to include smarter fuzzy searching in a proprietary language named "4D" in my workplace, so…
7
votes
2 answers

Coordinate System's influence on $L$ distances (Manhattan and Euclidean)

I don't understand this picture, which says if we change the coordinate system, we would have the same result for $L_2$ distance, whereas, our result would differ for $L_1$ distance. What does it mean by coordinate system? $(0,0)$ if yes, the…
Fatemeh Asgarinejad
  • 1,198
  • 1
  • 10
  • 18
7
votes
5 answers

Is there a way to measure correlation between two similar datasets?

Let's say that I have two similar datasets with the same size of elements, for example 3D points : Dataset A : { (1,2,3), (2,3,4), (4,2,1) } Dataset B : { (2,1,3), (2,4,6), (8,2,3) } And the question is that is there a way to measure the…
xtluo
  • 233
  • 1
  • 3
  • 11
7
votes
4 answers

What methods exist for distance calculation in clustering? when we should use each of them?

What methods exist for distance calculation in clustering? like Manhattan, Euclidean, etc.? Plus, I don't know when I should use them. I always use Euclidean distance.
parvij
  • 791
  • 5
  • 18
6
votes
2 answers

How do I test a difference between two proportions representing fatality rate for Covid 19 in Philippines and World (except Philippines)?

I'm trying to analyse if the fatality rate from my country (A third world country) vary significantly from the world's fatality rate. So I'd basically have two samples, labeled (Philippines) and (World excluding the Philippines) then i can compute…
6
votes
3 answers

Clustering algorithm for a distance matrix

I have a similarity matrix between N objects. For each N objects, I have a measure of how similar they are between each others - 0 being identical (the main diagonal) and increasing values as they get less and less similar. Something like that…
6
votes
2 answers

Alternative distance to Dynamic Time Warping

I am performing a comparison among time series by using Dynamic Time Warping (DTW). However, it is not a real distance, but a distance-like quantity, since it doesn't assure the triangle inequality to hold. Reminder:d:MxM->R is a distance if for all…
Ripstein
  • 208
  • 2
  • 12
6
votes
1 answer

Can I use euclidean distance for Latent Dirichlet Allocation document similarity?

I have a Latent Dirichlet Allocation (LDA) model with $K$ topics trained on a corpus with $M$ documents. Due to my hyper parameter configurations, the output topic distributions for each document is heavily distributed on only 3-6 topics and all the…
PyRsquared
  • 1,666
  • 1
  • 12
  • 18
6
votes
1 answer

Improve k-means accuracy

Our weapons: I am experimenting with k-means and Hadoop, where I am chained to these options for various reasons (e.g. Help me win this war!). The battlefield: I have articles, which belong to c categories, where c is fixed. I am vectorizing the…
gsamaras
  • 291
  • 6
  • 15
1
2 3
9 10