Fast construction of a static KD-tree without duplicates

Question

From what I know, the classic way of constructing a KD-tree is with alternating dimensions and finding median at each level. In my dataset, I have a lot of duplicated points, and I want to incorporate the duplicates' filtering inside the creating of the KD-tree. However, I am not sure how to proceed.

Obvious option: just delete duplicates before creation of the tree and then create the static tree. This is my backup option but I am looking for something smarter/faster.
If such preprocessing is done to filter duplicates, is it not a good idea to also pre-sort the arrays at every dimension instead of finding median? The complexity reported everywhere is $O(kNlogN)$, so I do not see it faster than classic $O(NlogN)$ when finding median. However, I have seen some people claiming that pre-sorting is faster. Is that true?
Maybe there is some other efficient way how to handle removal of duplicates directly while constructing a Kd-tree? Maybe with iterative insertion to a Kd-tree instead of classic building procedure? But then pre-processing should include not only deletion of the duplicate points but also some ordering of the remaining points that would keep the tree balanced.

Any advise is appreciated.

score 1 · Answer 1 · answered Aug 25 '23 at 13:45

You can use hashing to detect duplicates at insertion time.

Assume you have your elements in an array $A[1,\dots,n]$. When inserting $A[i]$, you can apply an hash function $f$ to $A[i]$.

If there is no collision, you push the pair $(A[i], i)$ in your hash table $T$ at position $i$, i.e. $T\bigl[f\bigl(A[i]\bigr)\bigr]\leftarrow (A[i],i \, )$ and this operation requires a time proportional to the time required by computing $f(A[i])$.
If a collision happens, i.e. $T\bigl[f\bigl(A[i]\bigr)\bigr]\neq \emptyset$, you have to check if the elements hashed in that position are equal to $A[i]$ or not, and you can access it through their indices stored within the values. You can keep $T\bigl[f\bigl(A[i]\bigr)\bigr]$ sorted with respect to $i$ so that is will be easy to look for $A[i]$ with a binary search in $O\Bigl(\log \Bigl( \bigl|T[f\bigl(A[i]\bigr)]\bigr|\Bigr)\Bigr)$. If $A[i]$ is found then it is not inserted again, otherwise it is inserted in $T[f\bigl(A[i]\bigr)]$ in the right position (found through binary search).

The expected number of collisions of different elements depends on the chosen hash function and on the hash table's dimension. If the hash function is chosen in a good way so that the elements are distributed evenly among $T$, then the expected number of elements in a cell of $T$ is $\frac{\#\text{different elements in A}}{|T|}$.

Fast construction of a static KD-tree without duplicates

1 Answers1