2

This post and this post indicate that when the dimensionality gets higher, KDTree gets slower.

Per scikit-learn doc, the KDTree becomes inefficient as D(dimensions) grows very large (say, $D>20$): this is one manifestation of the so-called “curse of dimensionality”.

Is there some theoretical verification or explanation of why KDTree gets slower when the dimensionality gets higher?

D.W.
  • 167,959
  • 22
  • 232
  • 500
singularli
  • 101
  • 3

1 Answers1

3

The problem arises due to curse of dimensionality. If the dimensionality is $k$ , the number of points in the data, $N$ should be $N ≫ 2^k$. Otherwise, when $k-d$ trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than brute force search. So for high value of $k$, approximate nearest-neighbour methods should be used instead.

most of the points in the tree will be evaluated

When the dimensionality increases to high value, the volume of the space increases so fast that the available data become sparse.Organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient. Hence $k-d$ trees become ineffiecient for high dimensions.