From what I know, the classic way of constructing a KD-tree is with alternating dimensions and finding median at each level. In my dataset, I have a lot of duplicated points, and I want to incorporate the duplicates' filtering inside the creating of the KD-tree. However, I am not sure how to proceed.
- Obvious option: just delete duplicates before creation of the tree and then create the static tree. This is my backup option but I am looking for something smarter/faster.
- If such preprocessing is done to filter duplicates, is it not a good idea to also pre-sort the arrays at every dimension instead of finding median? The complexity reported everywhere is $O(kNlogN)$, so I do not see it faster than classic $O(NlogN)$ when finding median. However, I have seen some people claiming that pre-sorting is faster. Is that true?
- Maybe there is some other efficient way how to handle removal of duplicates directly while constructing a Kd-tree? Maybe with iterative insertion to a Kd-tree instead of classic building procedure? But then pre-processing should include not only deletion of the duplicate points but also some ordering of the remaining points that would keep the tree balanced.
Any advise is appreciated.