Understanding Overfitting in Machine Learning
Contents
Overfitting in Machine Learning
In the actual world, there will never be a flawless dataset. Impurities, noisy data, outliers, missing data, or imbalanced data are all present in each dataset. These contaminants lead to various issues that impact the model’s performance and accuracy. Overfitting is one of these issues in machine learning. One issue a model may have is overfitting.If a statistical model has poor generalization ability with unknown data, it is considered overfitted.
The following fundamental terms must be understood before we can comprehend overfitting:
Noise:The presence of useless or nonsensical data in a dataset is known as noise. If it is not eliminated, the model’s performance is impacted.
Bias:A prediction inaccuracy known as bias is introduced into the model when machine learning techniques are oversimplified. Alternatively, it can be defined as the discrepancy between the actual and expected numbers.
Variance:Variance happens when a machine learning model performs well on the training dataset but not on the test dataset.
Generalization:It demonstrates how successfully a model is trained to anticipate unknown data.
What is Overfitting?
Overfitting occurs when a machine learning model learns too much from training data and performs poorly on new data. Since the model is overly complex, it “memorizes” the training data instead of learning generalizable patterns.
- Overfitting and underfitting are the most often occurring machine learning model mistakes that cause poor performance.
- Overfitting occurs when the model fits more data than necessary, attempting to capture every datapoint provided to it. As a result, it begins to capture noise and erroneous data from the dataset, reducing the model’s overall performance.
- An overfitted model performs poorly on the test/unseen dataset and cannot generalize successfully.
- An overfitted model is defined as having low bias and large variance.
Example to Understand Overfitting
We can explain overfitting using a general example. Consider three students, X, Y, and Z, are preparing for an examination. X has only studied three sections of the book and left the rest. Y has a good memory, therefore he memorized the entire book. And the third student, Z, has studied and practiced every question. So, X will only be able to solve the questions on the exam if they are connected to section 3. Student Y will only be able to solve questions that are precisely the same as those in the book. Student Z will be able to answer all of the exam questions properly.
The same is true for machine learning: if the algorithm learns from a tiny portion of the data, it will be unable to capture all of the required data points and hence will be underfit.
Assume the model learns the training dataset similarly to the Y student. They perform well on seen datasets but poorly on unseen data or unknown instances. In such instances, the model is referred to as overfitting.
And if the model does well on both the training and test/unseen datasets, as student Z did, it is considered an excellent fit.
How to detect Overfitting?
- Overfitting in the model is only detectable after testing the data. To identify the problem, we can use a train/test split.
- We can separate our dataset into training and test datasets at random using the train-test split feature. To train the model, about 80% of the whole dataset is used as the training dataset. After training the model, we assess the model using the test dataset, which accounts for 20% of the total dataset.
- Now, the model is probably experiencing overfitting if it works well with the training dataset but not with the test dataset.
- For instance, the model is not functioning correctly if it displays 85% accuracy with training data and 50% accuracy with test data.
Ways to prevent the Overfitting:
Even though overfitting is a machine learning issue that lowers the model’s performance, there are a number of techniques to avoid it. Although the linear model can be used to avoid overfitting, many real-world problems are nonlinear. Preventing the models from overfitting is crucial.
Below are several ways that can be used to prevent overfitting:
- Early Stopping
- Train with more data
- Feature Selection
- Cross-Validation
- Data Augmentation
- Regularization
Early Stopping:
- Before the model learns noise, this strategy stops training. Iteratively train the model in this manner, evaluating its performance at the end of each iteration. Continue until the model’s performance increases with each iteration, up to a predetermined number.
- The model starts to overfit the training data after that, thus we have to halt the process before the learner reaches that point.
- Early stopping is the practice of ending the training process before the model begins to extract noise from the data.
Train with More data:
- More data in the training set increases the likelihood of discovering the input-output relationship, improving model accuracy.
- It helps the algorithm find the signal and reduce errors, but it may not avoid overfitting.
- However, extra data may contribute noise to the model, so we must feed it clean, consistent data.
- With more training data, a model can’t overfit all samples and must generalize well.
Feature Selection:
While building the ML model, we employ several parameters or features to forecast the outcome. Sometimes some features are redundant or less useful for prediction, hence feature selection is used. We select the most important training data features and delete the rest. Additionally, this method simplifies the model and decreases data noise. Some algorithms automatically select features, but we can do it manually.
Cross-Validation:
- Cross-validation is an excellent approach of avoiding overfitting.
- To use the general k-fold cross-validation technique, we separated the dataset into folds, which are subsets of data of k equal sizes.
Data Augmentation:
- To avoid overfitting, data augmentation is an alternative to adding data.This method replaces training data with slightly modified copies of existing data.
- The data augmentation technique makes each model-processed data sample look somewhat different. As a result, each data set appears to be unique to the model, reducing the possibility of overfitting.
Regularization:
- If a complex model overfits, we can minimize features. In simpler models, like the Linear model, overfitting can occur, hence regularization approaches are helpful.
- Regularization is the most common overfitting prevention method. This series of methods forces learning algorithms to simplify models. Regularization marginally increases bias but decreases variance. With a more sophisticated model, the penalizing term has a bigger value, thus we add it to the objective function.