Tunstall et al: Chapter 02: Text Classification
Metadata
Title: Text Classification
Number: 2
Book: Tunstall et al: Natural Language Processing with Transformers
Intro
Until 2018, the DL approach to tackling text classification involved finding a suitable neural architecture for the task and training it from scratch on a dataset of labeled data. This had three main drawbacks:
You needed a lot of labeled data to train accurate CNNs or RNNs
Training from scratch was time consuming and expensive
The trained model couldn’t be easily adapted to other tasks (eg different labels)
These limitations are overcome by transfer learning. We’ll focus here on a famous Transformer model called BERT (Bidirectional Encoder Representations from Transformers), and its use in Hugging Face.
Inspecting Datasets
The Hugging Face Datasets hub stores a lot of NLP benchmark datasets and metrics. The library accompanying it is designed to load and process them efficiently and simplify interoperability between NumPy, Pandas, PyTorch, and TensorFlow.
We can get a list of them by doing: from datasets import list_datasets
and then datasets = list_datasets()
Or more involved:
from datasets import list_datasets, load_dataset
metadata = list_datasets(with_details=True)[datasets.index("emotion")]
print(metadata.description)
print(metadata.citation)
emotions = load_dataset("emotion") # returns DatasetDict
train_ds = emotions["train"] # returns Dataset
len(train_ds)
train_ds[:5] # first five items
train_ds.column_names
train_ds.features # see the types and values of the column
train_ds['text'][:5] # just the first five texts
Note load_dataset can be used to load data from a CSV file: load_dataset("csv", data_files="my_texts.csv")
For more on loading local files see the docs
The dataset returned will be a DatasetDict
, similar to a Python dictionary, with keys corresponding to a split. For examples emotions["train"]
. The value at each key index will be a Dataset
object, again like a Python dict.
datasets.Dataset
is like an ordinary Python container, you can index it by integer, query its length etc. It’s based on Apache Arrow, a typed columnar format.
Integration with Pandas and Data Exploration
Sometimes we want to use a high-level API like Pandas to visualize or analyze the dataset. We do this with the set_format
method which changes the output format of the Dataset object. It doesn’t change the underlying data format so you can change it again later when you want a different output.
import pandas as pd
#previous code assumed
emotions.set_format(type="pandas")
df = emotions["train"][:]
# this is needed if we want to display the label name
# rather than an integer representation
def label_int2str(row, split):
return emotions[split].features["label"].int2str(row)
df["label_name"] = df["label"].apply(label_int2str, split="train")
display_df(df.head(), index=None)
We can do some simple visualizations to inspect the data:
import matplotlib.pyplot as plt
# Bar plot of the category counts
df["label_name"].value_counts(ascending=True).plot.barh()
plt.title("Category Counts")
# Box plot of the lengths of our texts
df["Words Per Document"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Document", by="label_name", grid=False, shofliers=False, color="black")
plt.subtitle("")
plt.xlabel("")
Tokenization
The chapter then presents a deeper dive into Tokenization with Hugging Face
Training the Classifier
Finally the chapter presents how to go about Text Classification with Hugging Face