Pipeline :
A pipeline in machine learning is a series of data processing operations that typically includes preprocessing the input, choosing and extracting features, and building the model itself.
By enabling simple experimentation with various combinations of preprocessing methods and models, pipelines can be used to speed up the process of developing and testing machine learning models.
Building a pipeline entails putting together a series of actions, each of which transforms the data.
The Pipeline class in Scikit-Learn is used to design pipelines.
The collection of tuples that make up the Pipeline class each indicate a stage in the pipeline.
Each tuple has two elements: a scikit-learn transformer or estimator object and a string that identifies the step as its first element.
Transformers are things that alter data in some way, such scaling it or changing categorical variables to numerical values.
A decision tree or a support vector machine are examples of estimators, which are objects that learn from the data.
To change the data and then learn from it, a pipeline that combines estimators and transformers can be used.
The fit technique, which applies each transformation in turn and then fits the final estimate to the altered data, can be used to fit the pipeline to the data once it has been built.
Following pipeline fitting, additional data can be transformed using the transform method or predictions can be made using the predict method.
Pipelines are especially helpful for tweaking hyperparameters using grid search cross-validation and applying the same preprocessing processes to both training and testing data.
It is simple to test various combinations of preprocessing methods and models by building a pipeline that includes both the preprocessing processes and the model itself, enabling quicker experimentation and more accurate models.
A demonsration is shown below :
Codeblock E.1. Pipeline demonstration.
In this illustration, Using the Titanic dataset, this code carries out the task of creating a machine learning model to forecast whether a passenger on the Titanic survived or not. The code performs the following:
imports the essential libraries, including seaborn and matplotlib for data visualization, sklearn for developing the machine learning model, and pandas for reading and manipulating data.
utilizes pd.read_csv() to load the Titanic dataset from a URL. The dataset includes details on Titanic passengers, such as their age, gender, class, and whether or not they survived.
Renames a few of the dataset's columns to correspond with the supplied column names.
displays the first few rows, the shape of the dataset, the data types of the columns, descriptive statistics for the numerical columns, and a correlation matrix for the numerical columns, along with some basic information about the dataset.
Figure E.1. The corellation map for the titanic dataset.
Using train_test_split(), divide the dataset into training and testing sets.
Defines the dataset's numerical and category attributes.
The terms "num_transformer" and "cat_transformer" are defined to preprocess the numerical and categorical features, respectively.
The num_transformer utilizes StandardScaler() to scale the features and SimpleImputer() to fill in missing values with the median.
The cat_transformer utilizes OneHotEncoder() to encode categorical characteristics as binary columns and SimpleImputer() to replace missing values with the value that occurs the most frequently.
Use ColumnTransformer() to combine the category and numeric transformers.
The definition of a pipeline (pipeline) for preprocessing the data and fitting a RandomForestClassifier().
Use pipeline to fit the pipeline to the training data.fit().
Pipeline that makes predictions based on testing data.Using predict() and accuracy_score(), the accuracy score is determined.
Use plt.bar() to display the accuracy score and a bar chart of the feature importances.
Figure E.2. The feature importance chart for the features of titanic dataset features.
Click the below button to get access to the above ipynb file.
Download. Download the Pipeline.ipynb files used here.
---- Summary ----
As of now you know all basics of Grid Search CV.
Columns should be renamed to match the supplied column names after loading the Titanic dataset from a URL.
Create training and testing sets from the dataset.
Define the measurable and categorical characteristics.
Utilizing SimpleImputer and StandardScaler, define the preprocessor to handle missing values and scale numerical characteristics.
Using SimpleImputer and OneHotEncoder, define the preprocessor that will encode categorical characteristics.
Using ColumnTransformer, combine the numerical and categorical preprocessors.
Set up a pipeline using Pipeline to preprocess the data and fit a Random Forest Classifier.
Match the training data to the pipeline.
Using the testing results, make a prediction and assess your accuracy.
Show the accuracy rating and a bar chart displaying the relative relevance of the features.
etc..
Copyright © 2022-2023. Anoop Johny. All Rights Reserved.