Fine-Tuning a Transformer (Hugging Face)
As presented in Tunstall et al: Chapter 02: Text Classification
Instead of freezing the weights in the body of the model and using the hidden states as fixed features, as we do in Feature Extraction with Transformers here we retrain all the model parameters. This requires the classification head to be differentiable so we usually use a neural network for classification. We likely need a GPU as well.
By doing this we adapt the initial hidden states during the training process to decrease model loss and increase performance. This usually outperforms feature extraction.
To do this we can use the Trainer
API from Transformers.
Pre-Training
We start by loading the pretrained model, but this time we do so by loading one with a classification head on top of the model output sthat can be trained with the base model. Specify the number of labels we have to predict as it dictates the number of outputs from the head:
from transformers import AutoModelForSequenceClassification
num_labels = 6
model = (AutoModelForSequenceClassification
.from_pretrained(model_name, num_labels=num_labels)
.to(device))
We need to do a bit more preprocessing beyond that seen in Feature Extraction
emotions_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])
We can define the performance metrics we want to use for training. This function needs to take a prediction object (which will contain the model predictions and the correct labels) as input and return a dictionary with scalar metric values. For example:
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
Training
Now the steps are ready to instantiate the trainer, most important is the TrainingArguments
class which specifies the parameters of the training run:
from transformers import Trainer, TrainingArguments
batch_size = 64
logging_steps = len(emotions_encoded["train"] // batch_size
training_args = TrainingArguments(output_dir="results",
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
load_best_model_at_end=True,
metric_for_best_model="f1",
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps)
trainer = Trainer(model=model, args=training_args,
compute_metrics=compute_metrics,
train_dataset=emotions_encoded["train"],
eval_dataset=emotions_encoded["validation"])
trainer.train()
results = trainer.evaluate()
Analysing Performance
We can use the trainer.predict
method to get a set of useful objects for evaluation, along with the raw predictions for each class:
pred_output = trainer.predict(emotions_encoded["validation"])
pred_output.metrics
y_preds = np.argmax(preds_output.predictions, axis=1)
plot_confusion_matrix(y_preds, y_valid, labels)
print(classification_report(y_valid, y_preds, target_names=labels))
Another way of evaluating is to test the model on a new piece of data. For example we can tokenize the text, pass the tensor through the model, and extract the logits. We need to normalize the logits with a softmax function.
custom_text = "this is my dummy text that I want to try out with my model"
input_tensor = tokenizer.encode(custom_text, return_tensors="pt").to("cuda")
logits = model(input_tensor).logits
softmax = torch.nn.Softmax(dim=1)
# we have a batch size of 1 so we throw away the first dimension
probs = softmax(logits)[0]
#convert to numpy array for processing on the cpu
probs = probs.cpu().detach().numpy()
#we might visualize the probabilities
plt.bar(labels, 100 * probs, color='C0')
plt.title(f'Prediction for: {custom_text}')
plt.ylabel("Class probability (%)")
Error Analysis
A more detailed dive into the models’ predictions is usually required. We can look into the model loss:
from torch.nn.functional import cross_entropy
def forward_pass_with_label(batch):
input_ids = torch.tensor(batch["input_ids"], device=device)
attention_mask = torch.tensor(batch["attention_mask"], device=device)
labels = torch.tensor(batch["label"], device=device)
with torch.no_grad():
output = model(input_ids, attention_mask)
pred_label = torch.argmax(output.logits, axis=-1)
loss = cross_entropy(output.logits, labels, reduction="none")
batch["predicted_label"] = pred_label.cpu().numpy()
batch["loss"] = loss.cpu().numpy()
return batch
emotions_encoded["validation"] = emotions_encoded["validation"].map(
forward_pass_with_label, batched=True, batch_size=16)
emotions_encoded.set_format("pandas")
cols = ["text", "label", "predicted_label", "loss"]
df_test = emotions_encoded["validation"][:][cols]
df_test["label"] = df_test["label"].apply(label_int2str, split="test")
df_test["predicted_label"] = (df_test["predicted_label"]
.apply(label_int2str, split="test"))
display_df(df_test.sort_values("loss", ascending=False).head(10), index=None)
Now we can sort the data frame by losses and examine some of the issues that we’re likely to have:
Wrong labels (eg annotator error or disagreement, inferred labels being wrong). So we’re likely to have wrong labels in the dataset. We can find and fix them by looking at the high loss samples.
Dataset Quirks Messy data is really common with text, special characters or strings can throw off the model. Looking at the strings that are weakly predicted can show these issues and we can look at how we tidy the dataset.
Often finding these issues in the original data and fixing them can lead to better performance gains than more data or larger models.
Conversely, models are really good at finding shortcuts that might backfire in production. EG if you accidentally leave labels like stars in customer reviews in the text. So it’s good to look at the minimum losses and check that the model is not using these kind of mistakes.
Finally we can save our model with the tokenizer:
trainer.save_model("models/distilbert-emotion")
trainer.save_pretrained("models/distilbert-emotion")
Possible Improvements
There are more steps we could try to yield further improvements to our model:
Address class imbalance by up- or down-sampling minority or majority classes. Alternatively weight the classes.
Add more embeddings from different models. You can concatenate the embeddings from different models to create one large input feature. For example ALBERT, GPT-2, ELMo
Use traditional feature engineering, eg add features such as length of document, or presence of particular entities on top of using embeddings.
Don’t rely on default hyperparameters. You can use Optuna to systematically tune hyperparameters. This is investigated later.
Clean up ‘label noise’. Going back to the dataset and cleaning up the labels is essential in NLP development. You can also deploy label smoothing.