Cross Validation :
A machine learning approach called cross-validation is used to assess a model's performance on different datasets.
When we just have a small amount of data to train the model with, it is quite helpful. Cross-validation seeks to predict how well a model will function when used with fresh, untested data.
Cross-validation operates as follows:
A training set and a validation (or testing) set were created from the original dataset.
The training set is used to develop the model, while the validation set is used to assess it.
This procedure is performed numerous times, using various data partitions each time. For instance, the model is trained and assessed k times in k-fold cross-validation, using exactly one fold as the validation set each time. The data is divided into k equal-sized sections.
To get a final estimation of the model's performance, the outcomes of each validation run are summed.
For a number of reasons, cross-validation is beneficial:
By assessing the model's performance on a different dataset, it is possible to avoid overfitting the model to the training set of data.
Compared to a simple split of the data into training and testing sets, it offers a more precise estimation of the model's performance.
It enables the adjustment of hyperparameters to enhance the performance of the model, such as the regularization parameter in a linear regression model.
In general, cross-validation is a crucial machine learning technique for assessing a model's performance and enhancing its capacity to generalize to new, unexplored data.
It is frequently employed in practice and aids in ensuring the model's dependability and accuracy.
Codeblock E.1. Cross Validation demonstration.
In this illustration, a logistic regression model is built using the iris dataset. Then, we loop over each fold in a KFold object that we've created with k folds.
We divide the data into training and testing sets for each iteration of the loop, train the model on the training set, and assess the model on the testing set.
We print the score for each fold after storing the score and fold index in separate lists.
Following the evaluation of every fold, we calculate the overall average score and plot the fold scores in a bar graph.
We also display a legend and add a horizontal line to show the average score.
With the use of this visualization, we can examine how the model performs on various subsets of the data and see any over- or underfitting problems.
Within the KFold loop is where the cross-validation takes place. The KFold object divides the data into training and testing sets according to the chosen number of folds (k) in each iteration of the loop.
The training and testing set indices for the current fold are returned by the split procedure, which is then used to separate the respective data subsets.
The score is then recorded in the scores list once the model has been trained on the training set and assessed on the testing set.
The final performance metric is calculated as the average of the individual scores after this process is performed for each fold.
This method makes sure that every data point is used for both testing and training, and that the performance metric is an accurate reflection of the model's capacity to generalize to new data.
Click the below button to get access to the above ipynb file.
Download. Download the Cross Validation 1.ipynb files used here.
Another Example is listed below :
Codeblock E.2. Cross Validation demonstration.
In this code, the characteristics in the iris dataset are first standardized using StandardScaler.
After that, we build our logistic regression model with a 1000-fold increase in the max_iter parameter.
Finally, we do 5-fold cross-validation on the standardized data using cross_val_score, and we output the scores for each fold as well as the overall average score.
>>> import numpy as np >>> from sklearn.datasets import load_iris >>> from sklearn.model_selection import cross_val_score >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.preprocessing import StandardScaler |
We begin by importing the required scikit-learn modules and functions:
numpy for numerical operations
load_iris to load the iris dataset
cross_val_score to do cross-validation
LogisticRegression to generate our logistic regression model
StandardScaler to standardize the data.
>>> data = load_iris() >>> X, y = data.data, data.target |
Using load_iris, we load the iris dataset and then assign the target data to y and the feature data to X.
>>> scaler = StandardScaler() >>> X_std = scaler.fit_transform(X) |
To standardize the feature data in X, we build an instance of StandardScaler and employ it.
This is significant because the magnitude of the input features has an impact on logistic regression.
The performance of our model can be enhanced by standardizing the data, which guarantees that each feature has a mean of 0 and a variance of 1.
>>> model = LogisticRegression(max_iter=1000) |
We build a LogisticRegression instance with a 1000 max_iter parameter increase.
This is because the ConvergenceWarning we received in the original code suggested that the default setting of max_iter (which is the maximum number of iterations for the solver to converge) may not be adequate for our data.
>>> k = 5 >>> scores = cross_val_score(model, X_std, y, cv=k) |
We utilize cross_val_score to carry out k-fold cross-validation on our model, with the number of folds for cross-validation set to k=5.
The data is divided into k folds by cross_val_score, which then trains the model on k-1 folds and assesses it on the last fold.
Each fold acts as the test set exactly once during this process, which is repeated k times.
We allocate the resulting array of k scores to scores.
>>> for i, score in enumerate(scores): print(f'Fold {i}: score = {score}') |
The score for each cross-validation fold is printed using a for loop and an enumeration.
>>> avg_score = np.mean(scores) >>> print(f'Average score = {avg_score}') |
Finally, we use np.mean to calculate the average score across all folds and report the result.
This provides a more accurate assessment of the model's performance on fresh, untested data than would be possible with a single train-test split.
Download. Download the Cross Validation 2.ipynb files used here.
---- Summary ----
As of now you know all basics of Cross Validation.
Cross-validation is a technique for assessing how well a machine learning model performs on a different set of data.
The data is divided into k subsets, or "folds."
The model is tested on the last fold after being tested on k-1 folds.
Each fold acts as the testing set once during the course of this operation, which is repeated k times.
The accuracy of the model is then calculated using the average performance scores across all folds.
In order to prevent the model from being overfit to the training set of data, cross-validation is used.
It offers a more trustworthy assessment of the model's performance on fresh, untested data.
etc..
Copyright © 2022-2023. Anoop Johny. All Rights Reserved.