There’s a high chance that you have faced situations where your Machine Learning model effectively models existing training data well. But when it comes to testing data, the ML model fails to perform and doesn’t predict data accurately. This situation is quite common and can be effectively combatted using regularisation in machine learning.
Machine Learning professionals are familiar with something called overfitting. When an ML model understands specific patterns and the noise generated from training data to a point that it reduces the model’s ability to distinguish new data from existing training data, it is called overfitting. In the IT industry and Machine Learning domain, better performance of a machine learning model is directly connected to the prevention of overfitting.
Before getting into what is regularisation in Machine learning, we need to know what Underfitting and Overfitting are.
machine learning models learn and train through data that is fed to them. this is when a process known as data fitting comes into play. data fitting is the process where the ml professional plots several data points and draws out the most relevant line that helps them understand the existing relationship between multiple variables. an ml model works best when it identifies all relevant patterns that exist and avoids the random data points or unnecessary patterns that are termed noise.
If ML professionals allow Machine Learning models to view data several times, you will see them pick multiple data patterns including several irrelevant or unnecessary ones. The ML model learns quickly through test datasets and fits well in it. You will see it learning critical data patterns while also picking up the noise within the data. As a result, it fails to predict or identify patterns within other datasets.
Another situation exists where the model attempts to learn using given details and the data noise while trying to fit all data points on the curve. This is known as Overfitting. In a situation where an ML model won’t be allowed to view data sufficiently, the model will not accurately identify any patterns in the test dataset. As a result, it won’t fit properly with the test dataset while failing with new data as well. A situation where the Machine Learning model neither understands the existing relationship between variables given in testing data nor predicts or classifies new data is known as Underfitting.
Instances of Bias occur when algorithms have limited flexibility when it comes to learning from data. Models with these algorithms don’t pay too much attention to existing training data. At the same time, they oversimplify the model itself, ensuring the prediction or validation error and training errors follow a similar pattern. Such a Machine Learning model leads to a higher error rate for test and training data. A higher bias results in underfitting within the ML model.
Variance defines the sensitivity of an algorithm to specific data sets. An ML model with higher variance focuses on training data without generalising. Thus, the prediction error or validation error are quite separate from one another. Such ML models generally perform well when using training data, but show high rates of error when using test data. A higher variance results in overfitting within the ML model.
An optimised Machine Learning model is sensitive to existing data patterns while generalising new data. This can only happen when both Variance and Bias are optimised. This is known as the Bias-Variance Tradeoff which you can achieve through Regression applied to Underfitted or Overfitted models.
When Bias remains high, both training and testing dataset errors are also high. In a high Variance situation, the model reports lower errors on training datasets, but high errors in testing data.
Typically, regularisation means making something acceptable or ‘regular’. With reference to Overfitting or Underfitting, regularisation has a wide range of applications. It refers to the process of regularising or shrinking any coefficients as close to zero as possible. In other words, regularisation prevents overfitting by discouraging the learning of highly complex models.
You May Also Like: Deep Learning vs Machine Learning
The fundamental idea of regularisation is penalising complex ML models or adding terms for complexity that result in larger losses for complex ML models. Consider the following relationship for linear regression for a better understanding:
Y˜ W_0+ W_1 X_1+ W_2 X_(2 )+?+W_P X_P
Here,
Y = the value that should be predicted
X_1, [(...,X)] _P = the features that decide the value of Y
W_1, W_2, [(...,W)] _P = weights attached to all the respective X features
W_0 = the bias
If you wish to fit an ML model which predicts Y accurately, you need to create a loss function with optimised parameters such as weights and bias.
Generally, this loss function you create to be used for linear regression is known as the RSS (Residual Sum of Squares). According to this relation, you can understand the linear regression stated above as:
RSS= ?_(j=1)^m (Y_i-W_0-?_(i=1)^n W_i X_ji )^2
The RSS can also be known as the objective of linear regression without regularisation. The model now learns using this function. Based on training data, the coefficients or weights get adjusted. If there is too much noise in your dataset, you will face problems like overfitting, while the estimated coefficients don’t generalise on new data. This is where ML professionals need Regularisation. It appropriates all these estimates closer to zero by imposing penalties on the coefficient magnitudes. This is done using two fundamental Regularisation Techniques.
For the regularisation of ML models, there are two popular techniques. These are known as
Let’s take a look at both of these in more detail.
Ridge Regression is a regularisation technique that performs L2 regularisation. This means Ridge Regression modifies the RSS with the addition of the shrinkage quantity or penalty which equals the square of the coefficient magnitude. This looks like:
?_(j=1)^m (Y_i-W_0-?_(i=1)^n W_i X_ji )^2+ a?_(i=1)^n W_i^2=RSS+ a?_(i=1)^n W_i^2
Here, this modified loss function is used to estimate the coefficients. Here, you will notice a (alpha) as a parameter, plus shrinkage quantity. Also known as a tuning parameter, this decides the amount you wish to penalise your model. In other words, the tuning parameter can balance how much emphasis is given to RSS minimisation vs the minimisation of the sum of the coefficients’ square.
The a alpha value affects the total estimate which is produced by Ridge Regression. Here’s how it works.
If a=0, there is no effect of the penalty term. This means it will return the residual total of the square as a loss function which we initially choose. This means we’ll get the same coefficients as a simple linear regression.
If a=8, the Ridge Regression coefficient is zero. This is because the new loss function ignores the original loss function and minimises the square of the coefficients. Eventually, this ends up taking the value of the parameter as 0.
If 0<a<8, in a simple linear regression, the coefficient of the Ridge Regression falls between 1 and 0.
For this reason, it is critical to select a strong value for alpha. The coefficients that you are left with after using the Ridge Regression technique are also called the L2 norm.
The Lasso Regression regularisation technique provides regularisation at the L1 norm. It is known to modify the RSS by adding the shrinkage quantity or penalty equal to the sum of the coefficients’ absolute value.
?_(j=1)^m (Y_i-W_0-?_(i=1)^n W_i X_ji )^2+ a?_(i=1)^n |W_i |=RSS+ a?_(i=1)^n |W_i |
Under the Lasso Regression method, coefficients can be estimated using this new loss function. This method is separate from the Ridge Regression method as it ensures regulation through the usage of absolute coefficient values. With the loss function only considering absolute coefficients, the algorithm for optimisation penalises the high coefficients. This methodology is called the L1 norm.
In this method, a (alpha) is used as a tuning parameter. It functions like it does in the Ridge Regression method and offers a tradeoff between the balanced RSS coefficient magnitude.
Similar to the Ridge Regression technique, alpha takes up several values based on the situation.
If a=0, the same coefficients as the simple linear regression will be obtained.
If a=8, the Lasso Regression coefficient is going to be zero,
If 0<a<8, the coefficient of the Lasso Regression lies between 1 and 0.
At a glance, both the Ridge Regression and Lasso Regression techniques are quite similar. However, by diving a little deeper, their fundamental differences are quite obvious.
Also Read: Top Machine Learning Interview Questions and Answers
You can imagine Ridge Regression as attempting to solve an equation in which the sum of the coefficients’ squares is equal to or less than s. Hence, if there are 2 parameters given within a problem, you can express Ridge Regression as:
W_1^2+ W_2^2=s
With the same perspective in mind, think of the Lasso Regression method as attempting to solve an equation where the sum of the coefficients’ modulus is equal to or less than s. Hence, if there are 2 parameters given within a problem, you can express Lasso Regression as:
|W_1 |+ |W_2 |=s
there you have it. this is the fundamental infor mation you should know about regularisation in machine learning. if you wish to know more, enrol in a training course on koenig today and improve the accuracy of your ml regression models
Archer Charles has top education industry knowledge with 4 years of experience. Being a passionate blogger also does blogging on the technology niche.