Supervised and Unsupervised Learning Explained

Sean Eugene Chua
Cantor’s Paradise
4 min readSep 13, 2021

--

Image taken from: https://hurwitz.com/dev/wp-content/uploads/2018/02/Machine-Learning-Application-Studying.png

In the field of machine learning, there are two components to consider when creating a model. The first component is the type of data you need to feed the model, and the second is the kind of model you are going to use. Without good or the correct data, your model will not provide the necessary output or insights, and without the right model, the data is ultimately rendered useless. Consequently, this begs the question: “How do I know what kind of data I have? With my data, what model should I choose?” Well, it all lies in the fundamental concepts of supervised and unsupervised learning.

What is Supervised Learning?

As the name suggests, supervised learning is when an ML model is fed training data that must be processed in order to create predictions or estimates. At its core, supervised learning works by providing the model a set of inputs that are labeled for the algorithm to “learn” from.

Supervised learning has one of two possible goals: classification or regression. In solving a classification problem, the machine learning algorithm has to assign certain data points to the correct category. For example, one common classification problem is distinguishing between pictures of cats and dogs. To solve this, the machine must first be fed pictures of cats and dogs with their corresponding “labels.” Using these labels, the algorithm can eventually accurately determine whether a picture it hasn’t seen before contains a cat or a dog with relatively high accuracy.

On the other hand, in solving a regression problem, the machine learning algorithm has to make predictions based on the trends provided by existing data. These methods include linear and logistic regression or multiple regression, which you can know more about by clicking on the hyperlinks to my previous articles.

What is Unsupervised Learning?

In contrast to supervised learning, unsupervised learning is when an ML model is fed training data that do not contain labels in order for it to find patterns or associations within the dataset. In short, we do not necessarily have full control of the ML model’s expected output because we rely on it to deduce such patterns in data on its own.

Examples of unsupervised learning techniques include cluster analysis and anomaly detection. In cluster analysis, the closest data points in a given training set are “grouped” together by the algorithm. One of the most common types of algorithms used in cluster analysis is k-means clustering which divides the data into k groups. The machine “learns” by applying mathematical concepts similar to that of a Voronoi diagram which is rooted in partitioning a plane into a set of regions nearest to a region’s “center.” In anomaly detection, the algorithm aims to detect outliers within a set of data points. This is extremely helpful in analyzing large quantities of data, as it would be impossible for humans to perform this task efficiently enough.

A common technique in anomaly detection is k-nearest neighbor, which is an algorithm that classifies data points based on their distance from neighboring data points and applying “weights” that quantifies how similar a specific point is from a certain group.

Difficulties of Supervised and Unsupervised Learning

Data used for supervised learning is often time-consuming to train, as the machine needs to analyze every single data point it is being fed with. In addition, the correct labels for each data point within the training set must be correct; otherwise, the machine might make undesired predictions due to inaccurate data!

Data used for unsupervised learning must be carefully monitored in that there are no inconsistencies in the output or the data itself. However, if problems arise due to these, resolving the issue might be extremely costly and tedious. In fact, because of the nature of data, in this case, the algorithms used in unsupervised learning are memory-intensive and much harder to scale for larger and larger datasets.

The Middle Ground: Semi-Supervised Learning

In between both supervised and unsupervised learning is semi-supervised learning. In semi-supervised learning, only some data points are labeled while the rest are unlabeled. This is often found to be more favorable as compared to fully supervised or unsupervised learning, as one can use and adjust aspects of an ML model that trains only on the labeled data. As the other data are unlabeled, they can be trained to create a “pseudo-labeled” data set which can then enhance the ML model itself as it trains more and more.

Epilogue

In this article, we discussed supervised and unsupervised learning, as well as semi-supervised learning. In reality, real-world machine learning problems entail very specific types of data for a certain purpose. It is up to data scientists and machine learning experts to discern whether or not supervised, unsupervised, or semi-supervised learning is most appropriate.

I am in no way, shape, or form, a machine learning expert. However, as one of the fundamental concepts of machine learning, supervised and unsupervised learning are vital in choosing the right machine learning model that should be used and the type of data that should be gathered. That being said, I hope that this brief explanation helps in introducing you to the ML world. Thanks for reading!

--

--