K Nearest Neighbor or KNN Algorithm And It's Essence in ML

August 9, 2023

Algorithms drive the machine learning world.

They're often praised for their predictive capabilities and spoken of as hard workers that consume huge amounts of data to produce instant results.

Among them, there's an algorithm often labeled as lazy. But it's quite a performer when it comes to classifying data points. It's called the k-nearest neighbors algorithm and is often quoted as one of the most important machine learning algorithms.

The k-nearest neighbor algorithm is a supervised machine learning algorithm used to solve classification and regression problems. However, it's mainly used for classification problems. A simple KNN example would be feeding the neural network or NN model a training dataset of cats and dogs and testing it on an input image. Based on the similarity between the two animal groups, the KNN classifier would predict whether the object in the image is a dog or a cat. 

KNN is a lazy learning and non-parametric algorithm.

It's called a lazy learning algorithm or lazy learner because it doesn't perform any training when you supply the training data. Instead, it just stores the data during the training time and doesn't perform any calculations. It doesn't build a model until a query is performed on the dataset. This makes KNN ideal for data mining.

Did you know? The "K" in KNN is a parameter that determines the number of nearest neighbors to include in the voting process.

It's considered a non-parametric method because it doesn’t make any assumptions about the underlying data distribution. Simply put, KNN tries to determine what group a data point belongs to by looking at the data points around it.

Consider there are two groups, A and B.

To determine whether a data point is in group A or group B, the algorithm looks at the states of the data points near it. If the majority of data points are in group A, it's very likely that the data point in question is in group A and vice versa.

In short, KNN involves classifying a data point by looking at the nearest annotated data point, also known as the nearest neighbor.

Don't confuse K-NN classification with K-means clustering. KNN is a supervised classification algorithm that classifies new data points based on the nearest data points. On the other hand, K-means clustering is an unsupervised clustering algorithm that groups data into a K number of clusters.

How does KNN work?

As mentioned above, the KNN algorithm is predominantly used as a classifier. Let's take a look at how KNN works to classify unseen input data points.

Unlike classification using artificial neural networks, the k-nearest neighbors algorithm is easy to understand and implement. It's ideal in situations where the data points are well-defined or non-linear.

In essence, KNN performs a voting mechanism to determine the class of an unseen observation. This means that the class with the majority vote will become the class of the data point in question.

If the value of K is equal to one, then we'll use only the nearest neighbor to determine the class of a data point. If the value of K is equal to ten, then we'll use the ten nearest neighbors, and so on. To put that into perspective, consider an unclassified data point X. There are several data points with known categories, A and B, in a scatter plot.

Suppose the data point X is placed near group A.

As you know, we classify a data point by looking at the nearest annotated points. If the value of K is equal to one, then we'll use only one nearest neighbor to determine the group of the data point.

In this case, the data point X belongs to group A as its nearest neighbor is in the same group. If group A has more than ten data points and the value of K is equal to 10, then the data point X will still belong to group A as all its nearest neighbors are in the same group.

Suppose another unclassified data point, Y, is placed between group A and group B. If K is equal to 10, we pick the group that gets the most votes, meaning that we classify Y as the group which it has the most number of neighbors. For example, if Y has seven neighbors in group B and three neighbors in group A, it belongs to group B.

The fact that the classifier assigns the category with the highest number of votes is true regardless of the number of categories present.

You might be wondering how the distance metric is calculated to determine whether a data point is a neighbor.

There are four ways to calculate the distance between the data point and its nearest neighbor: Euclidean distance, Manhattan distance, Hamming distance, and Minkowski distance. Out of the three, Euclidean distance is the most commonly used distance function or metric.

K-nearest neighbor algorithm pseudocode

Programming languages like Python and R are used to implement the KNN algorithm. The following is the pseudocode for KNN:

  1. Load the data
  2. Choose K value
  3. For each data point in the data:
    • Find the Euclidean distance to all training data samples
    • Store the distances on an ordered list and sort it
    • Choose the top K entries from the sorted list
    • Label the test point based on the majority of classes present in the selected points
  4. End

To validate the accuracy of the KNN classification, a confusion matrix is used. Statistical methods, such as the likelihood-ratio test, are also used for validation.

In the case of KNN regression, the majority of steps are the same. Instead of assigning the class with the highest votes, the average of the neighbors’ values is calculated and assigned to the unknown data point.

Geometrical distances used in the KNN algorithm 

KNN model uses a standardized geometrical approach to identify the category of the input. 

  • Euclidean distance: This is the distance between the input variable and the feature dataset that has been predetermined. The superimposition of these data points in a hyperplane gives us an idea of Euclidean distance. In other words, you can imagine a 3D plane on top of the original dataset where the distance between variables lies in a straight line.
  •  Manhattan distance: Manhattan distance is a metric that tells you the distance traveled by a particular object rather than calculating the difference between two points.
  • Minkowski distance: It is a common data-analysis distance metric that can be a combination of the terms mentioned above.
  • Hamming distance: Hamming distance is used to compare two binary arrays of data, by calculating the difference between the bits positions of two strings. It is used to calculate distance between two new words, that are fixed in length.

Why use the KNN algorithm?

Classification is a critical problem in data science and machine learning. The KNN is one of the oldest yet accurate algorithms for pattern classification and text recognition.

Here are some of the areas where the k-nearest neighbor algorithm can be used:

  • Credit rating: The KNN algorithm helps determine an individual's credit rating by comparing them with the ones with similar characteristics.
  • Loan approval: Similar to credit rating, the k-nearest neighbor algorithm is beneficial in identifying individuals who are more likely to default on loans by comparing their traits with similar individuals.
  • Data preprocessing: Datasets can have many missing values. The KNN algorithm is used for a process called missing data imputation that estimates the missing values.
  • Pattern recognition: The ability of the KNN algorithm to identify patterns creates a wide range of applications. For example, it helps detect patterns in credit card usage and spot unusual patterns. Pattern detection is also useful in identifying patterns in customer purchase behavior.
  • Stock price prediction: Since the KNN algorithm has a flair for predicting the values of unknown entities, it's useful in predicting the future value of stocks based on historical data.
  • Recommendation systems: Since KNN can help find users of similar characteristics, it can be used in recommendation systems. For example, it can be used in an online video streaming platform to suggest content a user is more likely to watch by analyzing what similar users watch.
  • Computer vision: The KNN algorithm is used for image classification. Since it’s capable of grouping similar data points, for example, grouping cats together and dogs in a different class, it’s useful in several computer vision applications.
  • KNN in data mining: The KNN classifier is used to identify what cluster a particular data point belongs to by calculating the value of nearby data vectors. Based on the similarities between the two vectors, it classifies the input vector to some value or some predefined variable.

How to choose the optimal value of K

There isn't a specific way to determine the best K value – in other words – the number of neighbors in KNN. This means you might have to experiment with a few values before deciding which one to go forward with.

One way to do this is by considering (or pretending) that a part of the training samples is "unknown". Then, you can categorize the unknown data in the test set by using the k-nearest neighbors algorithm and analyze how good the new categorization is by comparing it with the information you already have in the training data.

When dealing with a two-class problem, it's better to choose an odd value for K. Otherwise, a scenario can arise where the number of neighbors in each class is the same. Also, the value of K must not be a multiple of the number of classes present.

Another way to choose the optimal value of K is by calculating the sqrt(N), where N denotes the number of samples in the training data set.

However, K with lower values, such as K=1 or K=2, can be noisy and subjected to the effects of outliers. The chance of overfitting is also high in such cases.

On the other hand, K with larger values, in most cases, will give rise to smoother decision boundaries, but it shouldn't be too large. Otherwise, groups with a fewer number of data points will always be outvoted by other groups. Plus, a larger K will be computationally expensive.

Advantages and disadvantages of KNN

One of the most significant advantages of using the KNN algorithm is that there's no need to build a model or tune several parameters. Since it's a lazy learning algorithm and not an eager learner, there's no need to train the model; instead, all data points are used at the time of prediction.

Of course, that's computationally expensive and time-consuming. But if you've got the needed computational resources, you can use KNN for solving regression and classification problems. Albeit, there are several faster algorithms out there that can produce accurate predictions.

Here are some of the advantages of using the k-nearest neighbors algorithm:

  • It's easy to understand and simple to implement
  • It can be used for both classification and regression problems
  • It's ideal for non-linear data since there's no assumption about underlying data
  • It can naturally handle multi-class cases
  • It can perform well with enough representative data

Of course, KNN isn't a perfect machine learning algorithm. Since the KNN predictor calculates everything from the ground up, it might not be ideal for large data sets.

Here are some of the disadvantages of using the k-nearest neighbors algorithm:

  • Associated computation cost is high as it stores all the training data
  • Requires high memory storage
  • Need to determine the value of K
  • Prediction is slow if the value of N is high
  • Sensitive to irrelevant features

KNN and the curse of dimensionality

When you have massive amounts of data at hand, it can be quite challenging to extract quick and straightforward information from it. For that, we can use dimensionality reduction algorithms that, in essence, make the data "get directly to the point".

The term "curse of dimensionality" might give off the impression that it's straight out from a sci-fi movie. But what it means is that the data has too many features.

If data has too many features, then there's a high risk of overfitting the model, leading to inaccurate models. Too many dimensions also make it harder to group data as every data sample in the dataset will appear equidistant from each other.

The k-nearest neighbors algorithm is highly susceptible to overfitting due to the curse of dimensionality. However, this problem can be resolved with the brute force implementation of the KNN algorithm. But it isn't practical for large datasets.

KNN doesn't work well if there are too many features. Hence, dimensionality reduction techniques like principal component analysis (PCA) and feature selection must be performed during the data preparation phase.

KNN: the lazy algorithm that won hearts

Despite being the laziest among algorithms, KNN has built an impressive reputation and is a go-to algorithm for several classification and regression problems. Of course, due to its laziness, it might not be the best choice for cases involving large data sets. But it's one of the oldest, simplest, and most accurate algorithms.

Training and validating an algorithm with a limited amount of data can be a Herculean task. But there's a way to do it efficiently. It's called cross-validation and involves reserving a part of the training data as the test data set.

machine learning Machines are learning faster

Unlike humans, machines don’t get tired of learning and are quick at it. Learn more about machine learning and how it works.

K Nearest Neighbor or KNN Algorithm And It's Essence in ML K-nearest neighbor is a supervised machine learning algorithm, also known as a lazy algorithm, used for classification and regression problems. Learn more. https://learn.g2.com/hubfs/k%20nearest%20neighbor.png
Amal Joby Amal is a Research Analyst at G2 researching the cybersecurity, blockchain, and machine learning space. He's fascinated by the human mind and hopes to decipher it in its entirety one day. In his free time, you can find him reading books, obsessing over sci-fi movies, or fighting the urge to have a slice of pizza. https://learn.g2.com/hubfs/_Logos/Amal%20JUpdated.jpeg https://www.linkedin.com/in/amal-joby/