# Parallel Bootstrap Tutorial

Note

These examples were run on the ESI HPC cluster. This is why we use
`esi_cluster_setup()`

to set up a parallel computing client.
They are perfectly reproducible on any other cluster or local machine
by instead using `slurm_cluster_setup()`

or `local_cluster_setup()`

respectively.

The following Python code demonstrates how to use ACME to perform a parallel bootstrap of the classification accuracy of three different scikit-learn classifiers.

We start by loading the wine dataset
from scikit-learn and splitting it into training and testing sets using the
`train_test_split()`

function from `sklearn.model_selection`

.

```
from sklearn import datasets
from sklearn.model_selection import train_test_split
data = datasets.load_wine()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.5, random_state=42)
```

## Training and Evaluating The Classifiers

Next, we define three scikit-learn classifiers: K nearest neighbors
(`KNeighborsClassifier`

), a neural network model
(`MLPClassifier`

), and a support vector machine
(`SVC`

). We train each of them on the training set and
evaluate their accuracy on the test set using the respective `score`

methods.

```
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
neighbors = KNeighborsClassifier(n_neighbors=3)
NeuralNet = MLPClassifier(random_state=1, max_iter=300)
SVM = SVC(gamma='auto')
neighbors.fit(X_train, y_train)
NeuralNet.fit(X_train, y_train)
SVM.fit(X_train, y_train)
print(f"K nearest neighbors accuracy: {neighbors.score(X_test, y_test):.3f}")
print(f"Naive Bayes accuracy: {NeuralNet.score(X_test, y_test):.3f}")
print(f"Support Vector Machine accuracy: {SVM.score(X_test, y_test):.3f}")
```

Because we evaluated the accuracy on the test data, we only get one accuracy measure per classifier. However, we would like to have a distribution of accuracies to later compare the confidence intervals. To achieve this, we can use bootstrapping.

## Bootstrapping and Confidence Intervals

We define a function `bootstrap_model_accuracy`

that resamples the test
set *with* replacement and calculates the accuracy of each classifier on
the resampled data. We will use ACME to parallelize the bootstrapping
process for efficiency.

```
from sklearn.utils import resample
from acme import cluster_cleanup, esi_cluster_setup, ParallelMap
import numpy as np
def bootstrap_model_accuracy(X_test, y_test, seed):
X_resamp, y_resamp = resample(X_test, y_test, replace=True)
return SVM.score(X_resamp, y_resamp), NeuralNet.score(X_resamp, y_resamp), neighbors.score(X_resamp, y_resamp)
client = esi_cluster_setup(partition="8GBXS", n_workers=10)
nboot = 100
seeds = np.linspace(0, nboot, nboot, dtype=int)
with ParallelMap(bootstrap_model_accuracy, X_test, y_test, seeds, n_inputs=nboot, write_worker_results=False,result_shape=(None,3)) as pmap:
results = pmap.compute()
cluster_cleanup(client) # close the cluster if you don't need it anymore
```

We now have a distribution of accuracies for each classifier. This means we can calculate a confidence interval for each classifier.

```
from scipy.stats import sem, t
def CInt(data, confidence=0.95):
n = len(data)
m = np.mean(data)
std_err = sem(data)
h = std_err * t.ppf((1 + confidence) / 2, n - 1)
return m - h, m + h
print(f"K nearest neighbors CI:{CInt(results[:, 2])[0]:.3f} to {CInt(results[:, 2])[1]:.3f}")
print(f"NeuralNet CI: {CInt(results[:, 1])[0]:.3f} to {CInt(results[:, 1])[1]:.3f}")
print(f"Support Vector Machine CI: {CInt(results[:, 0])[0]:.3f} to {CInt(results[:, 0])[1]:.3f}")
```

We can now go ahead and also plot our bootstrapped results as histograms.

```
# define bin edges for the histogram
bins = np.linspace(0,1,90)
# plot the distribution of the scores
plt.hist(results[:,0],bins=bins,alpha=0.5,label="SVM",density=True)
plt.hist(results[:,1],bins=bins,alpha=0.5,label="NeuralNet",density=True)
plt.hist(results[:,2],bins=bins,alpha=0.5,label="K nearest neighbors",density=True)
plt.xlabel("Accuracy")
plt.title("Distributions of model accuracy for different models")
plt.legend()
```

This is a simple procedure to compare the performance of different classifiers and we could have also achieved the same using a for loop. However, the advantage of using ACME becomes apparent when we are using larger data sets and more complex models. In this case, the bootstrapping process can take a long time and parallelization is necessary to speed up the process. ACME allows us to parallelize the bootstrapping process with just a few lines of code.