PythonPythonME

Machine learning

Demonstration:

Project IMLT 2022, Wisconsin Breast Cancer Dataset You have received a breast cancer dataset and the relevant information about the features in this dataset.

Solve the following tasks with regard to this dataset:

1. Which ML algorithm that we have worked on in this course works best on this dataset?

Use performance metrics F1 and AUCROC.
Algorithms to be compared: kNN, Logistic regression, SVM, Decision Tree, Random Forest, AdaBoost, GradientBoost.
Which of these ML algorithms works best with the given breast cancer dataset?
Look at the performance metrics.

2. Calculate and plot feature importances for which ML algorithm it is possible and compare it to the scientific literature.

Are your calculated feature importances in line with the relevant literature or not?( Discuss this in your Jupyter Notebook )
If the calculated feature importances do not correspond to the relevant scientific literature the algorithm might not be the best for the gven dataset.

3. Check the data:

Does it need some pre-processing before you can run the ML algorithms on it?
Are there correlations to be mitigated?

4. Present all your results, including the literature search, in a Jupyter Notebook.

Breast Cancer Dataset Information Taken from the Wisconsin Breast Cancer Database Predict whether a person is having Breast Cancer or not: Benign or Malignant.

Dataset Information:-

1. Sample code number: id number

2. Clump Thickness: 1 - 10

3. Uniformity of Cell Size: 1 - 10

4. Uniformity of Cell Shape: 1 - 10

5. Marginal Adhesion: 1 - 10

6. Single Epithelial Cell Size: 1 - 10

7. Bare Nuclei: 1 - 10

8. Bland Chromatin: 1 - 10

9. Normal Nucleoli: 1 - 10

10. Mitoses: 1 - 10

11. Class: (2 for benign, 4 for malignant)

Literature review : link https://pages.cs.wisc.edu/~olvi/uwmp/cancer.html

This method has so far correctly diagnosed 176 new cases in a row (119 benign, 57 malignant).
Only eight of those cases received a "suspicious" designation from Xcyt (that is, an estimated probability of malignancy between 0.3 and 0.7).

4.) Literature Overview :

This work was conducted by Dr. Wolberg to detect breast masses based Fine Needle Aspiration (FNA) which is a diagnostic procedure used to investigate lumps or masses.

How FNA is conducted :

How FNA is conducted

Figure E.1. How FNA is conducted.

Here 9 characteristics were assessed from the FNA sample which was considered relevent :

Clump Thickness
Uniformity of Cell Size
Uniformity of Cell Shape
Marginal Adhesion
Single Epithelial
Cell Size
Bare Nuclei
Bland Chromatin
Normal Nucleoli
Mitoses

The classifier was produced using multisurface method (MSM) of pattern separation on listed features that diagnosed 97% of new cases, this dataset was termed the Wisconsin Breast Cancer Data.

The goal was to diagnose sample based on digital image of small section of FNA slide. This has led to creation of the software 'Xcyt'.

The diagnosis procedure is as follows :

FNA sample is taken from the sample stained under microscope to detect cell nuclei.
The Xyct software determines individual nuclei with nuclei boundaries manually made using CV method 'snakes appraoch'(2-5 mins).
This is followed by a computation of 9 values of coresponding characteristics, with mean,sd resulting in ~30 nuclear features for sample.
Using a linear classifier we differentiate between benign and malignant samples where mean values of texture is point of concern (Extreme value of 'Area' and 'Smoothness' along with Mean Value of Texture).
The diagnosis is revealed to the patient to allow the patient to self assess if any doubts may arise in the software made assessment in comparison to previous samples.
System correctly diagnosed 176 new patients with 119 benign and 57 malignant with suspicious values found only for 8 cases with probability of maligancy between 0.3 & 0.7.

from the literature study we understand that the nature of the cell nuclei and other nuclear features have been given priority as feature parameters.

A demonsration is shown below :

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw -imlt-project_anoopjohny.ipynb hosted with ❤ by GitHub

Codeblock E.1. Demonstration of the whole project.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

dataset = pd.read_csv("Breast Cancer.csv")

dataset= dataset.drop(['Sample code number'],axis=1)
dataset

The code reads in a breast cancer dataset from a CSV file and removes the 'Sample code number' column from the dataset using the drop() method of pandas dataframe.

The resulting modified dataset is then stored in the dataset variable.

dataset['Class'] = dataset['Class'].replace(2,0) # replace values Class: (2 for benign = 0, 4 for malignant =1)

dataset['Class'] = dataset['Class'].replace(4,1) # replace values Class: (2 for benign = 0, 4 for malignant =1)

The first line of code swaps out all instances of the value 2 in the "Class" column for the value 0. The value 2 in the dataset denotes benign breast cancer cases.

The second line of code swaps out all 4 for 1 for each value in the "Class" column. The dataset's value 4 denotes breast cancer cases that are malignant.

The binary classification problem of benign vs. malignant breast cancer is effectively changed by these two lines of code into a binary classification problem where the positive class is represented by 1 (malignant) and the negative class is represented by 0 (benign).

dataset.shape

dataset.describe()

dataset.info()

# checking for missing values
dataset.isna().any()

dataset.isna().sum()

The dataset's shape, or the number of rows and columns, is returned by the first line of code.

The dataset's descriptive statistics, including count, mean, standard deviation, minimum, and maximum, are returned by the second line of code.

The dataset's information, including the number of non-null entries and the data type of each column, is returned in the third line of code.

The dataset's missing values are checked in the fourth and fifth lines of code.

A boolean value indicating whether or not each value in the dataset is missing is returned by the isna() method.

The any() method then determines whether any columns have any missing data.

The total number of missing data for each column is returned in the sixth line of code.

# visually checking for missing values
import missingno as msno

msno.bar(dataset)

msno.matrix(dataset)

msno.dendrogram(dataset) # the dendogram is generated on basis of nullility which is apparent since no null values exist.

# thus no dendogram is generated.

Checking for missing values in the dataset using a variety of techniques, including visually testing with missingno library functions like bar, matrix, and dendrogram, and checking for null values with isna() and sum().

Since none of the checks revealed any null values, it appears that the dataset contains no missing values.

Along with shape, describe(), and info(), you have also used them to gain an overview of the dataset.

demo 1

Figure E.2. A msno.bar(dataset)graph generated .

demo2

Figure E.3. A msno.matrix(dataset) generated graph.

Figure E.4. The dendogram is generated on basis of nullility which is apparent since no null values exist.

dataset.hist(bins=50, figsize = (20,15))
plt.show()

Figure E.5. Using the function dataset.hist(bins=50, figsize=(20,15)), a histogram is produced for each column of numbers in the dataset.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame(dataset, columns = ['Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class'])

sns.boxplot(x="variable", y="value", data=pd.melt(df))
plt.rcParams["figure.figsize"] = (8,6)
plt.show()

To display the distribution of each numerical variable in the breast cancer dataset, this code generates a boxplot.

The collection contains details on several characteristics of cell nuclei, such as their uniformity in size and shape, marginal adhesion, size of a single epithelial cell, naked nuclei, bland chromatin, normal nucleoli, and mitoses.

Figure E.6. To display the distribution of each numerical variable in the breast cancer dataset, this code generates a boxplot.

The dataset is used to build a new DataFrame object with the given columns of interest using the pd.DataFrame() function.

Using the variable parameter to provide the columns to use for the X-axis and the value parameter to specify the column to use for the Y-axis, the boxplot is created using the sns.boxplot() method.

The DataFrame is reshaped into a long format using the pd.melt() function, where each row corresponds to a single observation and each column to a variable.

The size of the plot is determined by the line plt.rcParams["figure.figsize"] = (8,6), and the plot is shown using plt.show().

fig = plt.figure(figsize=(19, 15))
corr = dataset.corr()

# Colours the rectangles by correlation value
#111 stands for 1x1 grid, first subplot
# Plots colourbar
ax =fig.add_subplot(111)
cax = ax.matshow(corr, vmin=-1, vmax=1)
fig.colorbar(cax)

# Plots x-ticks labels
plt.xticks(range(len(corr.columns)), corr.columns, fontsize=10)

# Plots y-ticks labels
plt.yticks(range(len(corr.columns)), corr.columns, fontsize=20)
plt.title('Correlation Matrix', fontsize=28)

#Plots the correlation matrix
plt.show()

The correlation matrix for the specified dataset is generated as a heatmap by this code. It generates a figure with a 19x15 size setting. The correlation matrix for the dataset is then calculated using the corr() function and stored in the corr variable.

Each square in the correlation matrix represents the correlation between two variables, and a color-coded representation of the matrix is produced using the matshow() function. The color scale's range is determined by the vmin and vmax parameters.

The colorbar() function adds a color bar to the plot's side to display how colors and correlation values correspond.

The range() method specifies the positions of the ticks, and the x-tick and y-tick labels are added using the xticks() and yticks() functions, respectively. The plot is given a title by the title() method, and the plot is shown by show().

Figure E.7. The correlation matrix for the specified dataset is generated as a heatmap by this code.

corr = dataset.corr()

corr.style.background_gradient(cmap='coolwarm')

Using the.corr() function from the pandas library, the following code creates a correlation matrix for the breast cancer dataset.

Then, using the.style.background_gradient() function, it styles the correlation matrix using a gradient from the coolwarm color map.

With negative correlations shown in colors of blue and positive correlations shown in shades of red, this makes it simpler to visually detect the strength of the correlations between pairs of variables.

correlation map

Figure E.8. The correlation matrix for the cancer .

# Create features and labels
features = dataset.drop(['Class'], axis=1)
labels = dataset['Class']

# Create training and test set
from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.25)

With the exception of the Class column, this code creates two variables, features and labels, each of which includes one of the dataset's columns. To isolate the input features from the target variable, this is done.

The data is then divided into training and testing sets using the train_test_split() function from the sklearn.model_selection package.

75% of the original data are in the training set, while 25% are in the testing set. Four variables—features_train, features_test, labels_train, and labels_test—are given the split data.

The machine learning model is trained and evaluated using these variables.

# Importing kNNm the ML algoríthm
from sklearn.neighbors import KNeighborsClassifier

#Setting k=5, common starting point for k
classifier = KNeighborsClassifier(n_neighbors=5)

# Fit data
classifier.fit(features_train, labels_train)

# Predicting with classifier
pred = classifier.predict(features_test)

# Check accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(labels_test, pred)
print('Accuracy: {:.2f}'.format(accuracy))

# Check precision
from sklearn.metrics import precision_score
precision = precision_score(labels_test, pred)
print ('Precision: {:.2f}'.format(precision))

# Check recall
from sklearn.metrics import recall_score
recall = recall_score(labels_test, pred)
print ('Recall: {:.2f}'.format(recall))

# Check F1 score
from sklearn.metrics import f1_score
F1 = f1_score(labels_test, pred)
print ('F1 score: {:.2f}'.format(F1))

# Check with AUCROC
from sklearn.metrics import roc_auc_score
auroc = roc_auc_score(labels_test, pred)
print ('AUROC score: {:.2f}'.format(auroc))

#AUROC curve
# Import ROC curve from library
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(labels_test, pred)

# Defining a function to plot the AUROC-curve
def plot_roc_curve(fpr, tpr):
plt.plot(fpr, tpr, color='orange', label='ROC')
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()

# Calling the function to finally plot the curve
plot_roc_curve(fpr, tpr)

The k-Nearest Neighbors (kNN) technique is implemented in the code above for the breast cancer dataset.

The program imports the required libraries, divides the dataset into training and test sets, applies the kNN classifier to the training set, and forecasts the labels for the test set.

The code then determines a variety of evaluation criteria, including F1 score, AUROC score, recall, accuracy, and precision.

The code then uses the sklearn roc_curve and plot_roc_curve routines to plot the ROC curve.

Figure E.9. The k-Nearest Neighbors (kNN) technique is implemented in the code above for the breast cancer dataset.

Accuracy: 0.99
Precision: 1.00
Recall: 0.98
F1 score: 0.99
AUROC score: 0.99

Using the train_test_split() function from scikit-learn, this code builds a test and training set.

It then applies standardization by scaling the features in the training and test sets using the StandardScaler() function.

In order to ensure that each feature contributes equally to the analysis and modeling process, the characteristics are standardized to bring them all to a same scale.

This is significant because the magnitude of the features can affect the performance of various machine learning algorithms, like k-Nearest Neighbors (k-NN).

#Create the test and training set first:
# Create training and test set

from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.25)

#Then implement the StandardScaler:
# Import the StandardScaler

from sklearn.preprocessing import StandardScaler

x_sc= StandardScaler()
features_train_scaled = x_sc.fit_transform(features_train)
features_test_scaled = x_sc.transform(features_test)

print("\033[1m"+" features_train_scaled : "+"\033[0m")
print("\n")
print(features_train_scaled)
print("\n")
print("\033[1m"+" features_test_scaled "+"\033[0m")
print(features_test_scaled)

The features_train_scaled and features_test_scaled variables contain the standardized features. The print() function is used to print these variables.

The distribution of the two classes in the dataset is depicted in a bar graph by this code. The two classes are represented on the x-axis, while the frequency of each class is shown on the y-axis.

In the dataset's 'Class' column, the pd.value_counts() function counts the number of occurrences for each class, and the plot() method plots the bar graph.

Sorting the classes in decreasing order according to their frequency is ensured by the sort=True option, while the kind='bar' option directs the display of a bar graph.

count_classes = pd.value_counts(dataset['Class'], sort = True)
count_classes.plot(kind = 'bar', rot=0)
plt.title("Class Distribution")
plt.xlabel("Class")
plt.ylabel("Frequency")

Figure E.10. The distribution of the two classes in the dataset is depicted in a bar graph by this code..

The x-axis labels should be horizontal and not rotated, according to the specification rot=0. The title and labels for the graph are set using the title(), xlabel(), and ylabel() functions.

ROC curve fitting

Figure E.1. A ROC curve map of all the algorithms used in Cancer Dataset analysis.

You can download this Ipynb file from here :

Download. Download the Demonstration ipynb file for Cancer Dataset.

---- Summary ----

Some frequently employed loss functions in machine learning are listed below::

The average of the squared discrepancies between the expected and actual values is measured by mean squared error (MSE). MSE is frequently used to solve regression issues.

Content