Algorithms drive the machine learning world.
They're often praised for their predictive capabilities and spoken of as hard workers that consume huge amounts of data to produce instant results.
Among them, there's an algorithm often labeled as lazy. But it's quite a performer when it comes to classifying data points. It's called the k-nearest neighbors algorithm and is often quoted as one of the most important machine learning algorithms.
K nearest neighbors (KNN) algorithm is a data-classification method of estimating the likelihood that a data point will become a member of one group based on what group the data point nearest to it belongs to.
The k-nearest neighbor algorithm is a supervised machine learning algorithm used to solve classification and regression problems. However, it's mainly used for classification problems. A simple KNN example would be feeding the neural network or NN model a training dataset of cats and dogs and testing it on an input image. Based on the similarity between the two animal groups, the KNN classifier would predict whether the object in the image is a dog or a cat.
KNN is a lazy learning and non-parametric algorithm.
It's called a lazy learning algorithm or lazy learner because it doesn't perform any training when you supply the training data. Instead, it just stores the data during the training time and doesn't perform any calculations. It doesn't build a model until a query is performed on the dataset. This makes KNN ideal for data mining.
Did you know? The "K" in KNN is a parameter that determines the number of nearest neighbors to include in the voting process.
It's considered a non-parametric method because it doesn’t make any assumptions about the underlying data distribution. Simply put, KNN tries to determine what group a data point belongs to by looking at the data points around it.
Consider there are two groups, A and B.
To determine whether a data point is in group A or group B, the algorithm looks at the states of the data points near it. If the majority of data points are in group A, it's very likely that the data point in question is in group A and vice versa.
In short, KNN involves classifying a data point by looking at the nearest annotated data point, also known as the nearest neighbor.
Don't confuse K-NN classification with K-means clustering. KNN is a supervised classification algorithm that classifies new data points based on the nearest data points. On the other hand, K-means clustering is an unsupervised clustering algorithm that groups data into a K number of clusters.
As mentioned above, the KNN algorithm is predominantly used as a classifier. Let's take a look at how KNN works to classify unseen input data points.
Unlike classification using artificial neural networks, the k-nearest neighbors algorithm is easy to understand and implement. It's ideal in situations where the data points are well-defined or non-linear.
In essence, KNN performs a voting mechanism to determine the class of an unseen observation. This means that the class with the majority vote will become the class of the data point in question.
If the value of K is equal to one, then we'll use only the nearest neighbor to determine the class of a data point. If the value of K is equal to ten, then we'll use the ten nearest neighbors, and so on. To put that into perspective, consider an unclassified data point X. There are several data points with known categories, A and B, in a scatter plot.
Suppose the data point X is placed near group A.
As you know, we classify a data point by looking at the nearest annotated points. If the value of K is equal to one, then we'll use only one nearest neighbor to determine the group of the data point.
In this case, the data point X belongs to group A as its nearest neighbor is in the same group. If group A has more than ten data points and the value of K is equal to 10, then the data point X will still belong to group A as all its nearest neighbors are in the same group.
Suppose another unclassified data point, Y, is placed between group A and group B. If K is equal to 10, we pick the group that gets the most votes, meaning that we classify Y as the group which it has the most number of neighbors. For example, if Y has seven neighbors in group B and three neighbors in group A, it belongs to group B.
The fact that the classifier assigns the category with the highest number of votes is true regardless of the number of categories present.
You might be wondering how the distance metric is calculated to determine whether a data point is a neighbor.
There are four ways to calculate the distance between the data point and its nearest neighbor: Euclidean distance, Manhattan distance, Hamming distance, and Minkowski distance. Out of the three, Euclidean distance is the most commonly used distance function or metric.
Programming languages like Python and R are used to implement the KNN algorithm. The following is the pseudocode for KNN:
To validate the accuracy of the KNN classification, a confusion matrix is used. Statistical methods, such as the likelihood-ratio test, are also used for validation.
In the case of KNN regression, the majority of steps are the same. Instead of assigning the class with the highest votes, the average of the neighbors’ values is calculated and assigned to the unknown data point.
KNN model uses a standardized geometrical approach to identify the category of the input.
Classification is a critical problem in data science and machine learning. The KNN is one of the oldest yet accurate algorithms for pattern classification and text recognition.
Here are some of the areas where the k-nearest neighbor algorithm can be used:
There isn't a specific way to determine the best K value – in other words – the number of neighbors in KNN. This means you might have to experiment with a few values before deciding which one to go forward with.
One way to do this is by considering (or pretending) that a part of the training samples is "unknown". Then, you can categorize the unknown data in the test set by using the k-nearest neighbors algorithm and analyze how good the new categorization is by comparing it with the information you already have in the training data.
When dealing with a two-class problem, it's better to choose an odd value for K. Otherwise, a scenario can arise where the number of neighbors in each class is the same. Also, the value of K must not be a multiple of the number of classes present.
Another way to choose the optimal value of K is by calculating the sqrt(N), where N denotes the number of samples in the training data set.
However, K with lower values, such as K=1 or K=2, can be noisy and subjected to the effects of outliers. The chance of overfitting is also high in such cases.
On the other hand, K with larger values, in most cases, will give rise to smoother decision boundaries, but it shouldn't be too large. Otherwise, groups with a fewer number of data points will always be outvoted by other groups. Plus, a larger K will be computationally expensive.
One of the most significant advantages of using the KNN algorithm is that there's no need to build a model or tune several parameters. Since it's a lazy learning algorithm and not an eager learner, there's no need to train the model; instead, all data points are used at the time of prediction.
Of course, that's computationally expensive and time-consuming. But if you've got the needed computational resources, you can use KNN for solving regression and classification problems. Albeit, there are several faster algorithms out there that can produce accurate predictions.
Here are some of the advantages of using the k-nearest neighbors algorithm:
Of course, KNN isn't a perfect machine learning algorithm. Since the KNN predictor calculates everything from the ground up, it might not be ideal for large data sets.
Here are some of the disadvantages of using the k-nearest neighbors algorithm:
When you have massive amounts of data at hand, it can be quite challenging to extract quick and straightforward information from it. For that, we can use dimensionality reduction algorithms that, in essence, make the data "get directly to the point".
The term "curse of dimensionality" might give off the impression that it's straight out from a sci-fi movie. But what it means is that the data has too many features.
If data has too many features, then there's a high risk of overfitting the model, leading to inaccurate models. Too many dimensions also make it harder to group data as every data sample in the dataset will appear equidistant from each other.
The k-nearest neighbors algorithm is highly susceptible to overfitting due to the curse of dimensionality. However, this problem can be resolved with the brute force implementation of the KNN algorithm. But it isn't practical for large datasets.
KNN doesn't work well if there are too many features. Hence, dimensionality reduction techniques like principal component analysis (PCA) and feature selection must be performed during the data preparation phase.
Despite being the laziest among algorithms, KNN has built an impressive reputation and is a go-to algorithm for several classification and regression problems. Of course, due to its laziness, it might not be the best choice for cases involving large data sets. But it's one of the oldest, simplest, and most accurate algorithms.
Training and validating an algorithm with a limited amount of data can be a Herculean task. But there's a way to do it efficiently. It's called cross-validation and involves reserving a part of the training data as the test data set.
Amal is a Research Analyst at G2 researching the cybersecurity, blockchain, and machine learning space. He's fascinated by the human mind and hopes to decipher it in its entirety one day. In his free time, you can find him reading books, obsessing over sci-fi movies, or fighting the urge to have a slice of pizza.
Unlike humans, machines don’t get tired of learning and are quick at it. Learn more about machine learning and how it works.
Machine learning models are as good as the data they're trained on.
Ever wondered how machines learn from the data we feed them? It’s not a simple case of writing...
Unsupervised learning lets machines learn on their own.