Feature Extraction with Transformers (Hugging Face)

As discussed in Tunstall et al: Chapter 02: Text Classification

To use the Transformer as a feature extractor we freeze the body’s weights during training and use the hidden states as features for the classifier. This lets us quickly train a small/shallow model. This model could be a neural classification layer, or a method like Random Forest that doesn’t rely on gradients. The hidden states can be computed relatively fast on a CPU, so doesn’t require hefty GPU access.

The major assumption is that the hidden states capture all the information required for classification. But if necessary information wasn’t required for the pre-training task, it may not be captured. In that scenario you would need to fine-tune instead.

We can use the from_pretrained method to load a model:


from transformers import AutoModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_name).to(device)

Here the AutoModel is just the input encoder that translates the one-hot vectors to embeddings with positional encodings and feeds them through the encoder stack to return the hidden states. The model head that decodes them to the masked token predition is excluded.

Tokenization Revisited

We can extract the last hidden states by tokenizing our input string and passing it to the model:


text = "my test input string"
text_tensor = tokenizer.encode(text, return_tensor="pt").to(device)
text_tensor.shape

output = model(text_tensor)
output.last_hidden_state.shape

Points to note:

return_tensor here makes sure the tensor is PyTorch compatible (could also be TensorFlow if you want).

The return from the model depends on how it’s configured. The output will be a class that can contain several objects, like hidden states, losses, or attentions.

For BERT, a hidden state is returned for each input. We tend to use the hidden state associated with the [CLS] token as the input feature, located in the first position in the second dimension of the tensor.

Here’s an example function that would tokenize a batch of the dataset:

def tokenize(batch):
  return tokenizer(batch["text"], padding=True, trunctation=True)

padding will pad the examples with zeroes to the longest one in the batch; truncation will truncate to the model’s maximum context size.

If you inspect the result you’ll see you have a dictionary, where each value is a list of lists. Each list in input_ids will start with 101, end with 102 and then zeros, reflecting the special tokens:

Special Token	[UNK]	[SEP]	[PAD]	[CLS]	[MASK]
Special Token ID	100	102	0	101	103

The result will also contain a list of attention mask arrays that will tell the model to ignore the padded parts of the input.

Note that input tensors are only stacked to matrices when passing them to the model, so the batch size of tokenization and training must match and you can’t shuffle. Otherwise input tensors might have different lengths and fail to stack (remember they are padded to the largest in the batch). When in doubt set batch_size=None in tokenization so all input tensors will have the same length as they’ll be padded to the global maximum.

We can use the DatasetDict.map function to apply tokenize across all the splits in the corpus:

emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

NB we need to set batched to true or it will operate individually on each example. Now we can have a look at the features and see that the input_ids and attention_mask have been added: emotions_encoded['train'].features.

Now we can go about getting the hidden states. We can define a forward pass


import numpy as np

def forward_pass(batch):
  input_ids = torch.tensor(batch["input_ids"]).to(device)
  attention_mask = torch.tensor(batch["attention_mask"]).to(device)

  with torch.no_grad():
      last_hidden_state = model(input_ids, attention_mask).last_hidden_state
      last_hidden_state = last_hidden_state.cpu().numpy()

  # Use average of unmasked hidden states for classification
  lhs_shape = last_hidden_state.shape
  boolean_mask = ~np.array(batch["attention_mask"]).astype(bool)
  boolean_mask = np.repeat(boolean_mask, lhs_shape[-1], axis=-1)
  boolean_mask = boolean_mask.reshape(lhs_shape)
  masked_mean = np.ma.array(last_hidden_state, mask=boolean_mask).mean(axis=1)
  batch["hidden_state"] = masked_mean.data
  return batch

  emotions_encoded = emotions_encoded.map(forward_pass, batched=True,
				      batch_size=16)

This will add a new hidden_state feature to our dataset.

Now we have all we need and can create the corresponding arrays in Scikit-Learn format:

X_train = np.array(emotions_encoded["train"]["hidden_state"])
X_test = np.array(emotions_encoded["validation"]["hidden_state"])

Y_train = np.array(emotions_encoded["train"]["label"])
Y_test = np.array(emotions_encoded["validation"]["label"])

X_train.shape, X_test.shape

Alex's Notes

Feature Extraction with Transformers (Hugging Face)

Tokenization Revisited

Links to this note