Finetune DistilBERT for binary classification on the SMS Spam Collection¶
This can be useful, for instance, when one wants to leverage large pre-trained models on a smaller private dataset, for instance, medical or financial records, and ensure data privacy regarding users' data.
BastionLabTorch is intended for scenarios where we have a data owner, for instance, a hospital, wanting to have third parties train models on their data, e.g. a startup, potentially on untrusted infrastructures, such as in the Cloud.
The strength of BastionLabTorch is that the data owner can have a high level of protection on data shared to a remote enclave hosted in the Cloud, and operated by the startup, thanks to memory isolation and encryption, and remote attestation from the use of secure enclaves.
In this notebook, we will illustrate how BastionLab works. We will use the publicly available dataset SMS Spam Collection to finetune a DistilBERT model on a classification task, to predict whether an email is spam or not.
In this guide, we will cover two phases:
- The offline phase, in which the data owner prepares the dataset and the data scientist prepares the model.
- The online phase, in which dataset and model are uploaded to the secure enclave. In the enclave, the uploaded model will be trained on the dataset. The data scientist can pull the weights once the training is over.
We largely followed this tutorial to prepare the data and pre-train the model we'll use in this example.
Pre-requisites¶
We need to have installed:
- BastionLab
- Hugging Face's Transformers library
- Polars
- IPython kernel for Jupyter
- Jupyter Widgets to enable notebooks extensions
!pip install bastionlab
!pip install transformers polars ipykernel ipywidgets
Offline phase - Model and dataset preparation¶
In this section, data owner and data scientist will prepare their data and model so that they are ready-to-use for the training in BastionLab.
Data owner's side: preparing the dataset¶
In this example, our data owner wants a third party data scientist to train an AI model to detect spam from emails.
Of course, in a real-world scenario, the data owner already posesses the data, but here, we will need to download one! We'll get the SPAM collection dataset and unzip it by running the following code block:
!wget https: // archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip
To upload the dataset to the BastionLab server, the data owner will need to prepare their dataset and make it available in a PyTorch DataSet
object.
import polars as pl
file_path = "./SMSSpamCollection"
# Read CSV file using Polars and rename columns with `text`, `label`
df = pl.read_csv(file_path, has_header=False, sep="\t", new_columns=["label", "text"])
# Transform `spam` labels to `1` and `0` for any other column label
df = df.with_column(
pl.when(pl.col("label") == "spam").then(1).otherwise(0).alias("label")
)
# View the first few elements of the DataFrame
df.head()
label | text |
---|---|
i64 | str |
0 | "Go until juron... |
0 | "Ok lar... Joki... |
1 | "Free entry in ... |
0 | "U dun say so e... |
0 | "Nah I don't th... |
The data owner also needs to preprocess the data. We'll use a DistilBertTokenizer
to obtain tensors ready to be fed to the model.
from transformers import DistilBertTokenizer
import torch
# The Distilbert Tokenizer is loaded from HuggingFace's repository.
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# These two variables store the input_ids and attention_mask per each sentence.
token_id = []
attention_masks = []
# The DataFrame is converted to a dictionary with keys (`text` and `label`).
df_dict = df.to_dict(as_series=False)
samples = df_dict["text"]
labels = df_dict["label"]
# Each row (in other words, sentence) in the DataFrame is passed to the tokenizer
# and we expect two (2) items for each row {input_ids: [], attention_mask: []}.
# Each is appended the to token_id and attention_mask respectively.
for sample in samples:
encoding_dict = tokenizer.encode_plus(
sample[0],
add_special_tokens=True,
max_length=32,
truncation=True,
padding="max_length",
return_attention_mask=True,
return_tensors="pt",
)
token_id.append(encoding_dict["input_ids"])
attention_masks.append(encoding_dict["attention_mask"])
# We create a single tensor from the List[Tensor].
token_id = torch.cat(token_id, dim=0).to(dtype=torch.int64)
# We create a single tensor from the List[Tensor].
attention_masks = torch.cat(attention_masks, dim=0).to(dtype=torch.int64)
# Here, we convert List[int] into a Tensor.
labels = torch.tensor(labels, dtype=torch.int64)
To make the training process faster in this demonstration, we'll only take a subset of the dataset, but you can choose to take the whole dataset if you want.
import numpy as np
# Ratio of the validation data set for the trainer.
test_ratio = 0.2
limit = 64
nb_samples = len(token_id)
# Generates an ndarray of indexes in the range [0, `nb_samples`] with `0`
# and `nb_samples` included
idx = np.arange(nb_samples)
# Shuffle the generated indexes
np.random.shuffle(idx)
# Extract `nb_samples` starting from `test_ratio * nb_samples` for the train_idx
train_idx = idx[int(test_ratio * nb_samples) :][:limit]
# Extract `nb_samples` starting from where `test_ratio * nb_samples` ends
# for the test_idx
test_idx = idx[: int(test_ratio * nb_samples)][:limit]
Finally, we create our training and validation TensorDataset
objects. We'll use them to wrap our Tensor
objects into a PyTorch DataSet
.
from bastionlab.torch.utils import TensorDataset
# The training tensor is converted to a TensorDataset because the
# BastionLab Torch API only accepts datasets.
train_set = TensorDataset(
[token_id[train_idx], attention_masks[train_idx]], labels[train_idx]
)
# The validation tensor is also converted to a TensorDataset because
# the BastionLab Torch API only accepts datasets.
validation_set = TensorDataset(
[token_id[test_idx], attention_masks[test_idx]], labels[test_idx]
)
Data scientist's side: preparing the model¶
On his side, the data scientist must prepare the DistilBERT language model.
One important thing to know about BastionLab is that it supports models with an arbitrary number of inputs, but it only supports models with a single output. This is the first step we need to address as Hugging Face's models typically have several outputs (logits, loss, etc).
We'll use BastionLab's utility wrapper to select only one output of the model. In our case: the one that corresponds with the logits.
import torch.nn as nn
# The class MultipleOutputWrapper is used to select one (1) of
# the model outputs.
# In this example, we select the `loss` output. It wraps around
# a torch Module and redirects calls to the forward method to the
# inner forward method.
class MultipleOutputWrapper(nn.Module):
"""Utility wrapper to select one output of a model with multiple outputs.
Args:
module: A model with more than one output.
output: Index of the output to retain.
"""
def __init__(self, module: nn.Module, output: int = 0) -> None:
super().__init__()
self.inner = module
self.output = output
def forward(self, *args, **kwargs) -> torch.Tensor:
output = self.inner.forward(*args, **kwargs)
return output[self.output]
from transformers import DistilBertForSequenceClassification
# We load the Distilbert Classifier from HuggingFace's repository
# and enable torchscript support.This makes it possible to trace
# the model and send it to the BastionLab Torch service since it
# only accepts TorchScript models.
model = DistilBertForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=2,
output_attentions=False,
output_hidden_states=False,
torchscript=True,
)
model = MultipleOutputWrapper(
model, 0
) # MultipleOutputWrapper() can be loaded from bastionlab.torch.utils
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias'] - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Online phase - dataset and model upload and training¶
Now that that both dataset and model are prepared, we can upload them securely to the secure enclave for the training.
Data owner's side: uploading the dataset¶
We will connect to the BastionLab Torch instance using our Connection
library.
Once connected, we'll use the RemoteDataset()
method to upload the datasets to the BastionLab Torch service. The method needs us to provide a name, and set a Differential Privacy budget. Here we put 1 000 000
an arbitrary number, but as a rule of thumb, it should be much lower, such as 4 or 8.
To learn more about Differential Privacy and why it's important you use it, you can read this article%20is%20a,about%20individuals%20in%20the%20dataset.).
from bastionlab import Connection
# The Data owner privately uploads their model online
client = Connection("localhost").client.torch
remote_dataset = client.RemoteDataset(
train_set, validation_set, name="SMSSpamCollection"
)
Sending SMSSpamCollection: 100%|████████████████████| 35.7k/35.7k [00:00<00:00, 38.2MB/s] Sending SMSSpamCollection (test): 100%|████████████████████| 35.7k/35.7k [00:00<00:00, 34.6MB/s]
Data scientist's side: uploading the model and trigger training¶
It's finally time to train the model!
The data scientist will use the list_remote_datasets
endpoint to get a list of the available datasets on the server that they'll be able to use for training.
client = Connection("localhost").client.torch
# Fetches the list of all `RemoteDataset` on the BastionLab Torch service
remote_datasets = client.list_remote_datasets()
# Here, we print the list of the available RemoteDatasets on the BastionLab Torch service
# It will display in this form `["(Name): nb_samples=int, dtype=str"]`
[str(ds) for ds in remote_datasets]
['SMSSpamCollection (e4377764d92780aca061fb21f3afcb8b3b0d44cc30a27d9e3123d92eb63259e1): size=64, desc=N/A']
The dataset uploaded previously is available as a RemoteDataset
object. It is a pointer to the remote dataset uploaded previously, that contains only metadata and nothing else. This allows the data scientist to play with remote datasets without users' data being exposed in the process.
# Here, only the first element of the dataset is printed
remote_datasets[0]
<bastionlab.torch.learner.RemoteDataset at 0x7ff880525420>
To send the model and necessary training parameters to the server, we'll use the RemoteLearner()
method.
To start training, we'll call the fit
method on the RemoteLearner
object with an appropriate number of epochs and Differential Privacy budget.
An epoch is when all the training data is used at once and is defined as the total number of iterations of all the training data in one cycle for training the machine learning model
Then we'll test the model directly on the server with the test()
method.
Note that behind the scenes, a DP-SGD training loop will be used. To learn more about DP-SGD, click here
Finally, and it's the last step of this tutorial, we'll retrieve a local copy of the trained model once the training is complete. To do so, we'll use the get_model()
method.
from bastionlab.torch.optimizer_config import Adam
# Torch is selected from the created client connection
# BastionLab has multiple services (Polars, Torch, etc)
client = Connection("localhost").client.torch
# A remote learner is created with the code below. It contains
# the DistilBERT model, loss type, the optimizer to use, as well as
# the training dataset to use.
remote_learner = client.RemoteLearner(
model,
remote_datasets[0],
max_batch_size=2,
loss="cross_entropy",
optimizer=Adam(lr=5e-5),
model_name="DistilBERT",
)
# fit() is called to remotely trigger training
remote_learner.fit(nb_epochs=2)
# The trained model is tested with the `accuracy` metric.
remote_learner.test(metric="accuracy")
# The trained model is fetched using the get_model() method
trained_model = remote_learner.get_model()
Sending DistilBERT: 100%|████████████████████| 268M/268M [00:03<00:00, 71.2MB/s] Epoch 1/2 - train: 100%|████████████████████| 32/32 [00:13<00:00, 2.46batch/s, cross_entropy=0.5558 (+/- 0.0000)] Epoch 2/2 - train: 100%|████████████████████| 32/32 [00:11<00:00, 2.70batch/s, cross_entropy=0.5000 (+/- 0.0000)] Epoch 1/1 - test: 100%|████████████████████| 32/32 [00:01<00:00, 26.33batch/s, accuracy=0.8874 (+/- 0.0000)]