Tokenization with Hugging Face
As presented in Tunstall et al: Chapter 02: Text Classification
They use tokenization here to refer to the whole process of going from strings to numerical vectors, including the more familiar idea of eg word and sentence tokenization as a step in this process.
Character Level Tokenization
The first tokenization process they show is character-level tokenization where character types in the document are mapped to unique integers, which are used as their representation:
text = "my document that I need to tokenize."
tokenized_text = list(text) # list of characters!
token2idx = {}
for idx, unique_char in enumerate(set(tokenized_text)):
token2idx[unique_char] = idx
input_ids = [token2idx[token] for token in tokenized_text]
Now every distinct character is represented as a distinct integer id. But we can’t use these directly in ML, as they are on an ordinal scale, so adding and subtracting them creates a new ID that represents another random token. We need to turn them into a 2D tensor of one-hot vectors. Then when we add them we’ll have a new one-hot vector with a ‘1’ at each ID present, zeros everywhere else.
This is easy with PyTorch:
import torch
input_ids = torch.tensor(input_ids)
one_hot_encodings = torch.nn.functional.one_hot(input_ids)
one_hot_encodings.shape
Character tokenization has the advantage that it helps deal with misspellings and rare workds, but all linguistic structures (including words) have to be learned, requiring huge compute. So it’s rarely used in practice.
Word Tokenization
Instead of a character split, we might split on words and map those to integers instead. Then the model doesn’t have to learn the words itself.
The simplest word tokenizer would be to just split on whitespace, eg by applying Python’s split
function: tokens = text.split()
.
But punctuation isn’t handled here so NLP
and NLP.
would be two different tokens. Vocab size can grow to the millions by this method.
Large vocabs are an issue because they require NNs with enormous parameters. EG if we had 1 million word tokens and wanted to compress that in the first layer of our NN to 1-thousand dimensional vectors, the weight matrix would contain 1 million x 1 thousand weights (ie 1 billion), comparable to GPT-2.
There are several familiar approaches to reducing vocab size. EG many word tokenizers handle punctuation more gracefully. We can also stem or lemmatize to normalise word types at the cost of information. Also we can discard all but say the top 100,000 most common words in the corpus and replace them with the ‘UNK’ token, again this loses information.
There’s a third, very popular approach now that tries to marry the best aspects of word tokenization while not throwing away a lot of information: subword tokenization.
Subword Tokenization
The idea is to use characters where we need to to deal with rare character combinations and misspellings, otherwise to keep frequent words and word parts as unique entities.
There are several subword tokenization algorithms: Byte-Pair encoding, WordPiece, Unigram, and SentencePiece. Most adopt a similar strategy:
Simple Tokenization. Corpus is split into words, usually according to whitespace and punctuation rules.
Counting. Words in the corpus are counted and tally is stored.
Splitting. Words in the tally are split into subwords, initially characters.
Subword pairs counting. Subword pairs are counted, using the tally.
Merging. Based on a rule, some subword pairs are merged in the corpus.
Stopping. The process stops when a predefined vocab size is reached.
For more see the Tokenizer summary in the HF docs.
Pretrained Tokenizers
Using HF’s from_pretrained
method lets you experiment with a variety of Transformer models and tokenizers by changing the name in the code.
EG you can experiment with DistilBERT and then when you’re ready swap it for a larger BERT model if you need to. Start with a smaller model to build a working prototype quickly. Once it works, experiment with larger models for performance gains.
Here’s how you can load and inspect a tokenizer:
from transformers import AutoTokenizer
# you can just swap this out to play with models
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
#inspect the tokenizer
tokenizer.vocab_size
tokenizer.special_tokens_map
tokenizer.model_max_length # maximim context size
encoded_str = tokenizer.encode("this is a complextest")
encoded_str # [101,2023,2003,1037,8552,22199,102]
for token in encoded_str:
print(token, tokenizer.decode([token]))