Compared to discriminative models which learn the conditional distribution P(Y|X=x) directly
(such as kNN, SVM, Neural Networks, etc.), generative models learn the joint distribution P(X,Y)
which assumes the data is generated by the underlying distribution P(X,Y).
Examples of generative models include
Naive Bayes, Gaussian Mixture Models, and Bayesian Networks.
One of the interesting fields of machine learning in AI is the capability for creative activities, this can
include create new images, texts, and even musics and videos. This is based on the assumption that
everything around us has some statistical structure and is generated by some unknown and complex
underlying process, and the goal will
be using machine learning to learn this underlying structure and then sample from the statistical
latent space to create new data. Generative models are for these types of tasks - learn the generating
process for the training data and be able to sample from the latent space to give us new data that
we might have never seen before.
The following pictures are generated by an algorithm known as neural style transfer (as the name
suggests, use deep learning algorithms to apply the style of a picture to another picture and therefore
generate new pictures). As shown below, three types of styles (sky, beach, snow) from three different
reference images are applied to the picture
of a city view (the content image). The new generated images still have the city view content but the
background (style) of the
images are now being changed according to the styles of the reference images.
This is done by using a simple Convolutional Neural Network (ConvNet), where the upper-level activations
of the ConvNet learn the more global & abstract content of the input image and the internal correlations
between activations learn the style of the reference image (as style is more local on textures, colours,
etc.).
Briefly speaking, autoencoders are data compression algorithms that are used to learn efficient
latent representation of the data (dimension reduction) by training the model to remove the "noise"
from the data. As the example shown below, an autoencoder consists of the following parts:
The first example uses a simple single layer neural network for both encoder and decoder. Original images are in size 50*50*3 (RGB images) which have a dimension of 7,500. The encoded latent representation has a dimension of 256 which means the compression ratio is 7500/256 = 29.3. A comparison between the original and reconstructed images are shown below. For illustrations, two different autoencoders are built on two different datasets, the first one being the Anime dataset and the second one being the Fruit dataset.
Anime dataset. Loss on the test set is 0.55 and it can be seen that the reconstructed images did not capture the original images well, but interestingly the reconstructed images have shown various styles of images in the dataset.
Fruit dataset. Loss on the test set is 0.40 and again it can be seen that the reconstructed images did not capture the original images well. Also note that the fruit images have less details than the anime images which possibly means they are easier to encode.
A single layer autoencoder with linear activation function and squared error loss is actually nearly equivalent (same spanned subspace) to Principal Component Analysis (PCA) which projects the data into a subspace. Note that here the autoencoder is not exactly linear as the activation functions are non-linear (ReLu and Sigmoid functions).
The second example uses a 3-layer neural network for both encoder and decoder. Again the encoded latent representation has a dimension of 256 which means the compression ratio is the same as the simple autoencoder. A comparison between the original and reconstructed images are shown below.
Anime dataset. Loss on the test set is still 0.55 and no improvement can be seen for the reconstructed images. Perhaps because the latent dimension is not big enough to encode the data which has very complex structures.
Fruit dataset. Loss on the test set is 0.38 and it can be seen that the reconstructed images are slightly better than that from the simple autoencoder.
Note that deep autoencoders project the data not onto a subspace but onto a non-linear manifold which means they can learn much more complex encodings for the same encoding dimension compared to simple autoencoders and PCA. However, bottle neck on the encoding dimension can still be a big problem for encoding images with complex structures.
The third example uses a convolutional neural network for both encoder and decoder. In this case the encoded latent representation has a dimension that is much higher than the deep and simple autoencoders which means they possess much higher entropic capacity. A comparison between the original and reconstructed images are shown below.
Anime dataset. Compared to the previous autoencoders, it can be seen that the convolutional autoencoder can encode the complex images much better, even though it still misses many of the details (especially shades of colours).
Fruit dataset. Similarly the convolutional autoencoder gives much better reconstructed images than the previous autoencoders.
In practice autoencoders are not useful tools for data compression compared to algorithms such as JPEG, mainly because they are lossy (outputs will always be degraded) and the fact that they are usually very specific to the training data (data-specific and cannot be generalized to a broad range of images). However, there are many useful applications of autoencoders:
Compared to traditional autoencoders, variational autoencoders (VAE) are generative models - which means not only they can compress data but they can also generate new data (e.g. images) that we have never seen before. As shown in the graph below, instead of encoding the input into a fixed code, the encoder will encode the input into parameters of a statistical distribution and this allows a stochastic generation of encoded input. In other words, for the same input, the reconstructed output will be different on each single pass due to the randomness from sampling of the latent distribution.
The key reason that traditional autoencoders cannot generate new images is the fact that the learned latent space is discontinous, which means if the decoder never saw the encoded data before (which is very likely because there are gaps in the space) then the generated output will be unrealistic. On the other hand, by encoding the input into a probability distribution, the decoder can learn an area or cluster of points in the latent space with a wide range of variations, and the continuous latent space will not leave many gaps where the decoder has never seen before. This allows the VAE to generate variations from the training data and therefore gives us new outputs.
Compared to autoencoders which uses only the reconstruction loss, VAEs involve two losses:
There are many interesting applications of VAEs, include image editing and latent space animations:
Since the prior distribution is the standard Gaussian, the inverse CDF of standard Gaussian is used to produce values of the encodings. It can be seen that by sampling from different regions of the latent space, different samples are created and the bigger the region is, the more variations the sampled outputs can have.
Sample from 0.05-0.95 range of the standard Gaussian CDF.
Sample from 0.5-0.95 range of the standard Gaussian CDF.
Sample from 0.05-0.5 range of the standard Gaussian CDF.
Sample from 0.1-0.9 range of the standard Gaussian CDF.
Sample from 0.1-0.5 range of the standard Gaussian CDF.
Sample from 0.2-0.6 range of the standard Gaussian CDF.
Sample from 0.8-0.9 range of the standard Gaussian CDF.
Last updated on Apr 26, 2020