4

Let's have a random sample of points in an euclidean $n$-space: assume a iid sample from a standard normal distribution.

To each point $p$, I assign the number $N(p)$ defined as "how many times does $p$ occur in $10$ nearest neighbors of some other point"?

I would like to understand the distribution of $N(p)$ and how it depends on dimension.

Some experimental results from python are quite surprising to me. In low dimensions, for $n=2,3,4$ this distribution is close to normal, while in higher dimensions it's very skewed, with most point being no nearest neighbours to anything else, and a few points being nearest neighbours to a lot of other points.

I always generate 1000 exampels here (not that it matters), here are some histograms:

I have some intuition / idea that the relation with dimension may boil down to something like this:

  • in dim $1$, a point can be nearest neighbor of at most 2 points, one on the left and one on the right
  • in dim $2$, a point can be nearest neighbor of 5 or 6 points, forming a regualr hexagon around it and so on.

I also checked if those points that are NN of many other points are more like in cluster center -- which is intuitively expected. The answer is yes, here is an example in dimension 32:

However, I'm not sure why is the dependency on dimension so crucial?

The motivation comes from searching for nearest neighbours where the data represent some real objects, and nearest neighbours represent search of a user.

Should I use rather smaller dimension, if I want to support the idea that each element should get a chance of being found?

Thanks for any insight.

(Related: https://ai.stackexchange.com/questions/40525/nearest-neighbour-search-in-high-dimension-retrieves-certain-points-too-often)

Peter Franek
  • 11,890
  • An exploratory idea: try to see what happens if the entries of the diagonal covariance matrix are all $1/n$ instead of $1$ for all dimensions. In this way you force the average squared distance from the origin to be $1$; indeed $E[|X|_2^2]=E[X_1^2]+...+E[X_n^2]=n\cdot 1/n=1$. – Snoop May 17 '23 at 23:29
  • 2
    Nearest neighbours has counter-intuitive behaviour in high dimensions. As an example with $n=5000$, the typical distance between any two points is about $100$ and all but a vanishing small proportion of distances will be between $96$ and $104$. So the nearest neighbours will not be much closer than the furthest neighbours. (This is caused by the expected distance being close to $\sqrt{2n}$ and the standard deviation being about $1$) – Henry May 18 '23 at 00:01
  • @Snoop you mean, I just multiply the data by $1/\sqrt{n}$? That has no impact on nearest neighbor distribution. – Peter Franek May 18 '23 at 07:10
  • @Henry Yes thanks for the comment. But this itself still doesn't imply why some of the points are so "dominant" and most others are not. Do you think that those points that are NN of many others, are near the origin? – Peter Franek May 18 '23 at 07:11
  • @PeterFranek with my $n=5000$ example, all but a vanishing small proportion of points will be between $67$ and $74$ from the origin, so you are unlikely find any "near the origin" in a sample: the expected distance from the origin is a little more than $\sqrt{n}$ and the standard deviation a little less than $\frac1{\sqrt{2}}$ – Henry May 18 '23 at 08:09
  • @Henry Yes but still it seems (see the newly added image) that those that are at least nearer to origin are those that are many times among NN. – Peter Franek May 18 '23 at 08:10
  • 1
    You may be right, some will be slightly nearer to the origin than others and that might slightly affect their distance to others, enough to make them more likely to be nearest neighbours to more other points without being near. – Henry May 18 '23 at 08:15
  • Could you comment on the downvote and close-vote pls? Can I improve the question somehow? – Peter Franek May 18 '23 at 10:46
  • I asked this, with some extra details, on ai stackexchange which may be more suitable for this question https://ai.stackexchange.com/questions/40525/nearest-neighbour-search-in-high-dimension-retrieves-certain-points-too-often – Peter Franek May 21 '23 at 10:03

0 Answers0