Feature Extraction with Transformers (Hugging Face)
As discussed in Tunstall et al: Chapter 02: Text Classification
To use the Transformer as a feature extractor we freeze the body’s weights during training and use the hidden states as features for the classifier. This lets us quickly train a small/shallow model. This model could be a neural classification layer, or a method like Random Forest that doesn’t rely on gradients. The hidden states can be computed relatively fast on a CPU, so doesn’t require hefty GPU access.
The major assumption is that the hidden states capture all the information required for classification. But if necessary information wasn’t required for the pre-training task, it may not be captured. In that scenario you would need to fine-tune instead.
We can use the from_pretrained
method to load a model:
from transformers import AutoModel
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_name).to(device)
Here the AutoModel is just the input encoder that translates the one-hot vectors to embeddings with positional encodings and feeds them through the encoder stack to return the hidden states. The model head that decodes them to the masked token predition is excluded.
Tokenization Revisited
We can extract the last hidden states by tokenizing our input string and passing it to the model:
text = "my test input string"
text_tensor = tokenizer.encode(text, return_tensor="pt").to(device)
text_tensor.shape
output = model(text_tensor)
output.last_hidden_state.shape
Points to note:
return_tensor
here makes sure the tensor is PyTorch compatible (could also be TensorFlow if you want).
The return from the model depends on how it’s configured. The output will be a class that can contain several objects, like hidden states, losses, or attentions.
For BERT, a hidden state is returned for each input. We tend to use the hidden state associated with the [CLS]
token as the input feature, located in the first position in the second dimension of the tensor.
Here’s an example function that would tokenize a batch of the dataset:
def tokenize(batch):
return tokenizer(batch["text"], padding=True, trunctation=True)
padding will pad the examples with zeroes to the longest one in the batch; truncation will truncate to the model’s maximum context size.
If you inspect the result you’ll see you have a dictionary, where each value is a list of lists. Each list in input_ids
will start with 101
, end with 102
and then zeros, reflecting the special tokens:
Special Token | [UNK] | [SEP] | [PAD] | [CLS] | [MASK] |
---|---|---|---|---|---|
Special Token ID | 100 | 102 | 0 | 101 | 103 |
The result will also contain a list of attention mask
arrays that will tell the model to ignore the padded parts of the input.
Note that input tensors are only stacked to matrices when passing them to the model, so the batch size of tokenization and training must match and you can’t shuffle. Otherwise input tensors might have different lengths and fail to stack (remember they are padded to the largest in the batch). When in doubt set batch_size=None
in tokenization so all input tensors will have the same length as they’ll be padded to the global maximum.
We can use the DatasetDict.map
function to apply tokenize across all the splits in the corpus:
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)
NB we need to set batched to true or it will operate individually on each example. Now we can have a look at the features and see that the input_ids
and attention_mask
have been added: emotions_encoded['train'].features
.
Now we can go about getting the hidden states. We can define a forward pass
import numpy as np
def forward_pass(batch):
input_ids = torch.tensor(batch["input_ids"]).to(device)
attention_mask = torch.tensor(batch["attention_mask"]).to(device)
with torch.no_grad():
last_hidden_state = model(input_ids, attention_mask).last_hidden_state
last_hidden_state = last_hidden_state.cpu().numpy()
# Use average of unmasked hidden states for classification
lhs_shape = last_hidden_state.shape
boolean_mask = ~np.array(batch["attention_mask"]).astype(bool)
boolean_mask = np.repeat(boolean_mask, lhs_shape[-1], axis=-1)
boolean_mask = boolean_mask.reshape(lhs_shape)
masked_mean = np.ma.array(last_hidden_state, mask=boolean_mask).mean(axis=1)
batch["hidden_state"] = masked_mean.data
return batch
emotions_encoded = emotions_encoded.map(forward_pass, batched=True,
batch_size=16)
This will add a new hidden_state
feature to our dataset.
Now we have all we need and can create the corresponding arrays in Scikit-Learn format:
X_train = np.array(emotions_encoded["train"]["hidden_state"])
X_test = np.array(emotions_encoded["validation"]["hidden_state"])
Y_train = np.array(emotions_encoded["train"]["label"])
Y_test = np.array(emotions_encoded["validation"]["label"])
X_train.shape, X_test.shape