一般KNN适用于2维数据,那么更高维怎么处理呢? KD-Tree算法更适用,但请注意,因为维度诅咒(curse of dimension),此方法也并非高效 在我做我的毕设时,我发现KD Tree也是一种不错的方法,可惜出现了篇幅问题所以不推荐,但已经写完了总结所以还是想留下作为纪念
However, the KNN method is not without limitations. When applied to large datasets or high dimensional data, it becomes computationally expensive to calculate the distances between training data and test data, which can be lead to overfitted. To address these issues, we will further introduce K-D tree Algorithm. A K-Dimensional Tree (KD Tree) is a space-partitioning data structure for organizing points in a k-dimensional space. Each node in the tree is a k-dimensional point. Every non-leaf node acts as a hyperplane, perpendicular to a chosen axis, that divides the space into two half-spaces. The splitting dimension (axis) for each node is typically chosen by cycling through the dimensions as one moves down the tree.
Given a k-dimension dataset [\mathbb{T}={x_1,x_2,\cdots,x_N}, x_i = (x_i^{1},x_i^{2},\cdots, x_i^{k})^{T}], i = 1,2,\cdots,N], the nodes at depth $k^{\text{th}}$ split the space along the$k^{\text{th}}$ dimension. In order to construct a balanced K-D Tree, each node should split the space such that there are an equal number of nodes in the left subspace as the right subspace. Therefore we need to pick the median among the nodes for the current dimension and makes it the subroot. Although other splitting methods exist, the median-based approach is efficient and produces a well-balanced tree.
This process is applied recursively: at depth $j^{\text{th}}$ node, the splitting dimension is chosen as $l = j(mod k) +1$, and the data are divided into two subsets based on the median value in that dimension. The recursion continues until no further points remain in either subset, at which point the construction of the K-D tree is complete.
When using a K-D tree to find the nearest neighbors of a target point, the search begins at the root and proceeds down the tree to locate the region in which the target point lies. At each level, the algorithm compares the target point’s coordinate in the current splitting dimension with the node’s coordinate, moving left or right accordingly until reaching a leaf node. The point in this leaf node is initially recorded as the best (nearest) candidate. The algorithm then backtracks through the tree, updating the list of nearest neighbors whenever a closer point is found. At each step, it checks whether there could be even closer points in the opposite subtree by constructing a hypersphere centered at the target point with a radius equal to the distance to the farthest point currently in the best list. If this hypersphere intersects the splitting hyperplane, the algorithm explores the opposite branch; otherwise, it continues moving upward. This recursive backtracking process continues until all relevant nodes have been examined, ensuring that the final points in the best list are the true nearest neighbors.
Compared to the KNN method, K-D Tree is more efficient than KNN because it partition the feature space so we can rule out whole partitions that are further away than our closest k neighbors.
Due to the curse of dimensionality, we cannot avoid the reaction time and behave well on both methods on high dimension data. There are further methods from K-D tree: Ball-Tree and BBF algorithm, but we will not include in our research.