Kd-trees excluding some splitting dimensions

Question

I have a 12-dimensional state-space and would like to use a kd-tree to partition my data, so that nearest neighbour operations can be performed quickly. Unfortunately I have the issue that three of the states are angular values which wrap-around (360 degrees and 0 degrees are the same), which will mislead the nearest neighbour search on this dimension as two values which fall on different sides of a partition could in reality be close. Is it valid, when constructing the tree and performing the nearest neighbour searches, just to 'switch/split on' the nine dimensions which don't wrap around, and rely on the fact that the nearest neighbour considering all twelve of the states will be found when the metric comparisons and branch exploration operations are performed on the way back up the tree?

In my mind this seems okay, but I just wanted to sanity check as it's very possible that I'm missing something... Alternatively, if there are any data structures that you think would be better suited to this problem, please let me know! Thank you for any help.

D.W. · Answer 1 · 2024-01-23T06:27:56.070

I suggest that you tweak the distance metric slightly, and then tweak the split function in your k-d tree to match it. You can define any distance metric that you want, to capture/measure similarity between two points. And you can modify a standard k-d tree so it works well with the revised distance function.

When computing the distance between two angles $r,s$, a helpful definition is

$$d(r,s) = \min(|r-s|,|r+360-s|,|r-360-s|),$$

or concisely, $d(r,s) = |r-s| \pmod{360}$. Here I am assuming $r,s$ are measured in degrees; if they are measured in radians, replace $360$ by $2\pi$.

The total distance can be obtained by computing the the distance for each coordinate/dimension, squaring all those distances, summing up all of those squares, and taking the square root of that sum. In other words,

$$D(x,y) = \bigg( \sum_i d_i(x_i,y_i)^2 \bigg)^{1/2},$$

where $d_i(\cdot,\cdot)$ is a distance metric that is appropriate for the $i$th dimension: $d_i(x_i,y_i) = |x_i-y_i|$ if the $i$th dimension contains ordinary data, or the angular distance metric listed above if the $i$th dimension contains angular data.

Then, I suggest you define the "nearest neighbor" using this distance measure $D(\cdot,\cdot)$.

Next, you need to adapt the k-d tree so it will be useful for finding the nearest neighbor.

My thought would be to build a k-d tree on all 12 dimensions, but change how the "split" works for the 3 angular dimensions, to accommodate the wrap-around semantics.

Normally, we split based on a threshold $\tau$. All data points $x$ with $x_i \le \tau$ go to one group, and all data points with $x_i > \tau$ go to the other group. This splits the range $(-\infty,+\infty)$ of all possible values into two subranges: $(-\infty,\tau]$ and $(\tau,+\infty)$.

With angular data, one plausible approach would be to split based on a threshold $\tau$, but adapting the comparison to take into account wraparound. Specifically, all data points $x$ with $\tau-180 < x_i \le \tau$ or $x_i > \tau+180$ go to one group, and all data points with $\tau < x_i \le \tau+180$ or $x_i < \tau-180$ go to the other group. (Here I assume $x_i$ is an angle measured in degrees. If it is measured in radians, replace $180$ with $\pi$.) In other words, the threshold $\tau$ basically splits the range $[0,360)$ of all possible angles into two subranges: $(\tau-180 \bmod 360, \tau]$ and $(\tau, \tau+180 \bmod 360]$, with all values taken modulo 360 to take into account wraparound.

Then, for each dimension, the k-d tree would use either a normal split or a wraparound split, according to whether that dimension refers to normal data or angular data.

Finally, you can adapt standard algorithms for building a k-d tree to work with this modification, and you can adapt standard algorithms for finding the nearest neighbor by traversing the k-d tree to work with this modified distance metric and modified data structure.

A minor side note: If you are finding nearest neighbors for nearest-neighbor classification or a similar purpose, it is often helpful to standardize all of the coordinates (subtract off the mean for that coordinate, divide by the standard deviation), before computing the distance. If you do that, you'll need to divide $360$ by the standard deviation in the above distance metric for angles.

Kd-trees excluding some splitting dimensions

1 Answers1