Alex's Notes

Tunstall et al: Chapter 02: Text Classification

Metadata

Intro

Until 2018, the DL approach to tackling text classification involved finding a suitable neural architecture for the task and training it from scratch on a dataset of labeled data. This had three main drawbacks:

  • You needed a lot of labeled data to train accurate CNNs or RNNs

  • Training from scratch was time consuming and expensive

  • The trained model couldn’t be easily adapted to other tasks (eg different labels)

These limitations are overcome by transfer learning. We’ll focus here on a famous Transformer model called BERT (Bidirectional Encoder Representations from Transformers), and its use in Hugging Face.

Inspecting Datasets

The Hugging Face Datasets hub stores a lot of NLP benchmark datasets and metrics. The library accompanying it is designed to load and process them efficiently and simplify interoperability between NumPy, Pandas, PyTorch, and TensorFlow.

We can get a list of them by doing: from datasets import list_datasets and then datasets = list_datasets()

Or more involved:

from datasets import list_datasets, load_dataset

metadata = list_datasets(with_details=True)[datasets.index("emotion")]

print(metadata.description)
print(metadata.citation)

emotions = load_dataset("emotion") # returns DatasetDict

train_ds = emotions["train"] # returns Dataset
len(train_ds)
train_ds[:5] # first five items
train_ds.column_names
train_ds.features # see the types and values of the column

train_ds['text'][:5] # just the first five texts

Note load_dataset can be used to load data from a CSV file: load_dataset("csv", data_files="my_texts.csv")

For more on loading local files see the docs

The dataset returned will be a DatasetDict, similar to a Python dictionary, with keys corresponding to a split. For examples emotions["train"]. The value at each key index will be a Dataset object, again like a Python dict.

datasets.Dataset is like an ordinary Python container, you can index it by integer, query its length etc. It’s based on Apache Arrow, a typed columnar format.

Integration with Pandas and Data Exploration

Sometimes we want to use a high-level API like Pandas to visualize or analyze the dataset. We do this with the set_format method which changes the output format of the Dataset object. It doesn’t change the underlying data format so you can change it again later when you want a different output.

import pandas as pd
#previous code assumed

emotions.set_format(type="pandas")
df = emotions["train"][:]

# this is needed if we want to display the label name
# rather than an integer representation
def label_int2str(row, split):
  return emotions[split].features["label"].int2str(row)

df["label_name"] = df["label"].apply(label_int2str, split="train")
display_df(df.head(), index=None)

We can do some simple visualizations to inspect the data:

import matplotlib.pyplot as plt

# Bar plot of the category counts
df["label_name"].value_counts(ascending=True).plot.barh()
plt.title("Category Counts")

# Box plot of the lengths of our texts
df["Words Per Document"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Document", by="label_name", grid=False, shofliers=False, color="black")
plt.subtitle("")
plt.xlabel("")

Tokenization

The chapter then presents a deeper dive into Tokenization with Hugging Face

Training the Classifier

Finally the chapter presents how to go about Text Classification with Hugging Face