Hugging Face: An Introduction
As presented in Tunstall et al: Natural Language Processing with Transformers
Hugging Face was a solution to the problem of applying a new machine learning architecture to a new task. Typically that would involve:
Implement the model architecture in code (PyTorch,TensorFlow).
Load the pretrained weights (if available) from a random server.
Preprocess the inputs, pass them through the model, apply some task-specific post-processing.
Implement data loaders and define loss functions and optimizers to train the model.
It might take days of engineering to adapt a new model, even if the researchers had released their code.
Hugging Face tackles this by providing a common API to a wide range of transformer models, as well as code and tools to adapt them to new use-cases. It integrates with TensorFlow, PyTorch etc.
It provides task-specific ‘heads’ so you can fine-tune transformers on down-stream tasks such as classification, NER, QA etc, together with modules to train them.
Introduction to the Library
Hugging Face Transformers has a layered API which allows you to interact with the library at different levels of abstraction.
The highest level API is the pipeline API which abstracts away all the steps needed to convert raw text into a set of predictions from a fine-tuned model.
We instantiate a pipeline by providing the name of the task we are interested in. For example:
from transformers import pipeline
import pandas as pd
text = "my sample text, I love it!"
classifier = pipeline("sentiment-analysis")
ner_tagger = pipeline("ner", aggregation_strategy="simple")
sentiment = classifier(text)
sentiment_df = pd.DataFrame.from_records(sentiment)
ner_tags = ner_tagger(text)
entity_df = pd.DataFrame.from_records(ner_tags)
There will be a default model, fine-tuned for the task. In the sentiment analysis case the Standford sentiment treebank.
Once we have a pipeline we can make a prediction. Each pipeline takes a string, or list of strings as input and returns a list of predictions. The prediction is a Python dictionary. We can use pandas, as above, to display our predictions as a nice data frame.
Pipelines are also shown for question answering, summarisation, translation, and text generation.
These initial demos are all using publicly available, fine-tuned models (often the default one, though the model can be easily overrided when calling the pipeline
method. Generally though you’ll want to fine-tune on your own data, which is what the book focuses on.
The Hugging Face Ecosystem
There is now a whole ecosystem around Hugging Face with many libraries and tools. There are two main parts:
A family of libraries that provide the code.
The Hub provides the pretrained weights, datasets, evaluation metrics etc.
The hub contains over 10,000 models, ie pretrained model weights. Beyond that it includes the Tokenizers library that lets us load a tokenizer in the same way as we load model weights.
The Datasets library provides a standard interface for thousands of datasets from the Hub.
The Datasets library manages RAM use through memory mapping, and is interoperable with NumPy and Pandas.
The Datasets library provides scripts for many metrics too.