Menu

最近在学习非含参回归方法应用于高维数据集,比如连续,分类,函数型数据,非含参方法包括了kernel method,spline method,而我主要学习的是kernel method的KNN method,它是一种经典非监督学习的聚类方法,精准但并非高效,也可以算作模式识别(pattern recognition)的一种 我也只是总结网上的资料,仅供参考

The K-Nearest-Neighbors (KNN) method works by identifying the “k” most similar data points (the nearest neighbors) to a given input and basing its prediction on them. For classification, it predicts the majority class among these neighbors; for regression, it predicts the average of their target values.

The classification of a new point can change dramatically based on the choice of K. In this example, the point is classified as Category A when $K=1$ but as Category B when $K=3$. This happens because the set of nearest neighbors changes with K, and the decision is based on a majority vote among these neighbors. The parameter K controls the number of neighboring points considered in the decision, which is why it is a critical hyperparameter in the KNN method.

To identify the nearest neighbors, the K-NN algorithm is based on a distance metric. The most common family of distances is the Minkowski distance, defined by the general formula: [d(x, y) = \left( \sum_{i=1}^{n} x_i - y_i ^p \right)^{1/p}].
The Manhattan Distance calculates the distance by summing the absolute differences along each dimension, analogous to grids. This is a special case for Minkowski distance for [p=1: d(x, y) = \left( \sum_{i=1}^{n} x_i - y_i \right)] As the most widely used metric (and the default in Python’s Scikit-learn library), this measures the “straight-line” distance between points in space: [d(x, y) = \left( \sum_{i=1}^{n} x_i - y_i ^2 \right)^{1/2}] which is $p=2$ in Minkowski distance metric. What if p tends to infinity? Then it is named as the Chebyshev Distance which only consider the maximum differences across any single dimension, making it useful in chess-like movements: [d(x, y) = \max_{i} x_i - y_i ] In our experiment, we will use Euclidean distance metric to calculate the neighbors distance.

Selecting an optimal value for K is crucial for the performance of the KNN algorithm. A common approach is to test a range of values, starting from K=1, and evaluate their performance by estimating the uncertainty. We need to choose the less uncertainty for our matching K.

To systematically determine the best K, there is one statistical method for selecting K: k-fold cross-validation. This technique involves partitioning the dataset into k subsets (folds), iteratively training the model on k-1 folds while using the remaining fold for validation, and averaging the performance across all iterations. The K value that produces the highest average accuracy is selected for the final model. While there is no universal rule for the maximum K, values beyond 20 often become computationally expensive while offering diminishing returns, especially for smaller datasets.

To summarize the K-Nearest-Neighbors algorithm for regression, we begin by selecting an optimal value for K. For a given target point, the algorithm calculates the distance to all other data points, identifying the K closest neighbors. The model then makes its prediction by averaging the values of these closest points. This local averaging is what allows it to be so flexible and data-driven.