Chollet: Chapter 12: Generative Deep Learning
Metadata
Title: Generative Deep Learning
Number: 12
Core Ideas
The potential of AI to emulate human thought goes beyond passive tasks like object recognition, it extends to creative activities. A lot of creation depends on pattern recognition and technical skill. ML models can learn the statistical latent space of images, music, language, and then sample from this space, creating new works similar to those seen during training. That sampling is just a mathematical operation, not a creative act, but it can work in tandem with creative humans to generate meaning.
Text Generation
The first example is using RNNs to generate text sequences. The same idea could be applied to generate paintings stroke by stroke, or music note by note. It’s not just applicable in artistic contexts, gmail’s smart reply is based on this sort of language modelling.
The universal way to generate sequence data in deep learning is to train a model (usually a Transformer, or a RNN) to predict the next token or next few tokens in a sequence, using the previous tokens as input.
In the case of text, any network that can model the probability of the next token given the prior tokens is called a language model. A language model captures the statistical structure of language, its latent space.
Once you have a trained langauge model you sample from it (generate new sequences). You feed in an initial string of text, the conditioning data, ask it to generate the next token or set of tokens, then feed that back and keep going. You can generate sequences of arbitrary length.
The key issue is sampling strategy. Naively we could just use greedy sampling, picking the maximum probability token from the language. But this just leads to boring, repetitive sequences.
So we want some kind of stochastic sampling strategy, where words are chosen randomly on a probability basis.
The issue becomes about controlling the amount of randomness in the sampling process. Too much randomness and the entropy is too high, the sequence loses interest. Too little randomness and the entropy is too low, the sequence becomes boring.
One way is to just sample from the ‘real’ probability distribution seen during training. (ie the output of a softmax layer). But there are other ways to do it and its good to explore to see what produces interesting results.
In order to control this, we have the concept of softmax temperature, which characterizes the entropy of the probability distribution used for sampling. It characterizes how surprising or predictable the next token chosen will be.
Given a temperature
value, we get a new probability distribution from the softmax output as follows:
import numpy as np
def reweight_distribution(original_distribution, temperature=0.5):
distribution = np.log(original_distribution) / temperature
distribution = np.exp(distribution)
return distribution / np.sum(distribution)
pp. 370-5 walks through generating movie reviews and the effect of different temperatures, always experiment with different sampling strategies. Industrial models like GPT-3 are deeper stacked versions of this toy model (with more decoder layers) and massive training corpora.
Just remember that all you’re ever doing is sampling data from a statistical model of which words come after which words. It’s all form and no substance. Language models cannot communicate, cannot act in the world, cannot be social, has no thoughts to process with words. It just captures the statistical structure of the observable artifacts - books, movie reviews, tweets - that we generate as we use language to live and work. The fact that these artifacts have a stastical structure is a side effect of how we use language.
Image Generation
The chapter uses the DeepDream algorithm to generate images. This adapts filter-visualization where you run a convnet in reverse, doing gradient ascent on the input to maximize the activation of a specific filter in the upper layer.
With image generation, instead of maximizing a specific filter you maximize the activation of entire layers, mixing together visualizations of large numbers of features.
You start from an existing image, distorting the elements of the image in artistic fashion. The input images are processed at different scales or octaves.
To build a DeepDreams model, you start with a pre-trained convnet, like Inception. You use that to build a feature extractor.
You define a loss, we’ll try to maximize a weighted mean of the L2 norm of the activations of a set of high-level layers. You run a gradient ascent process for each octave of interest.
There is some more fiddly stuff to improve the quality of the resulting image.
Style Transfer
The next case study is neural style transfer, now a mainstay of photo apps, which consists of applying the style of a reference image while conserving the content of a target image.
Style here means textures, colours, and visual patterns in the image, while content is the higher-level macrotexture of the image. This was a long occupation of the computer vision community, but NNs gave really great performance on the task.
The key notion is the loss function, which specifies that we wish to preserve the content of the original image while adopting the style of the reference. Mathematically this looks like:
loss = (distance(style(reference_image) - style(combination_image)) +
distance(content(original_image) - content(combination_image)))
distance
is a function like L2 norm, content
and style
are functions that take an image and produce a representation of its content and style respectively. The fundamental observation of Gatys et al was that NNs could define those content and style functions.
The convnet activations represent information about the image at different layers of abstraction. Higher layers contain increasingly abstract, global information. The content of an image, global and abstract, is captured in the representations of the upper layers.
So a good candidate for content loss is the L2 norm between the activations of an upper layer in the convnet.
The style loss is a bit trickier, defined over multiple layers of the convnet. You try to capture the appearance of the style reference image at all spatial scales extracted by the convnet. The paper uses the Gram matrix of a layer’s activations, which is the inner product of the feature maps of that layer. This is like a map of the correlations between the layer’s features - the patterns found at a particular scale. The patterns correspond to textures seen at that scale.
pp. 385 - 390 walk through an example of setting up a style transfer algorithm in Keras.
Variational Autoencoders
The most popular application of creative AI is image generation, producing images that ‘look like the real thing’. There are two main techniques currently, variational autoencoders (VAEs) and Generative adverserial networks (GANs).
You can apply VAEs and GANs to other spaces, sound, text etc. But the best results so far have been on images. [NB Chollet doesn’t mention that text is a discrete, categorical space so text generation will likely need different techniques to continuous spaces like images].
Latent Space
The key idea of image generation is to develop a low-dimensional latent space of representations where any point can be mapped to a “valid” image.
The module capable of taking a latent point and mapping it to a generated image is called a generator in GANs or decoder in VAEs.
Once you’ve learned the latent space, you can just sample from it, and generate images from it. The generated images are the points in-between those seen in the training set.
VAEs learn latent spaces that are well-structured, where specific directions encode a meaningful axis of variation int he data. GANs generate images that can be highly realistic, but the latent space they come from may not be so well-structured.
The latent space is not dissimilar to the space of word embeddings in NLP. Given a latent space, or an embedding space, certain directions in that space may encode interesting axes of variation in the original data.
For example, in the latent space of faces, there may be a ‘smile vector’ - such that a point moved along that vector is the same point, but smiling.
If you find those vectors you can edit images in a meaningful way.