Generative adversarial networks

In last article we considered the simplest linear generative model of PPCA. The second generative model, which we will consider - Generative Adversarial Networks, abbreviated GAN. In this article, we'll look at the most basic version of this model, leaving the advanced versions and comparison with other approaches in generative modeling for the next chapters.
 
 
Generative adversarial networks
 
Markov Chain Monte Carlo , which is poorly scaled. Therefore, for learning generative models, a method based on such scalable techniques as was needed. Stochastic Gradient Descent (SGD) and backpropagation . One such method is Generative Adversarial Networks (GAN). For the first time GANs were proposed in this article in 2014. High-level this model can be described as two submodels that compete with each other, and one of these models (the generator), tries to learn in some sense to deceive the second (discriminator). To do this, the generator generates random objects, and the discriminator tries to distinguish these generated objects from the real objects from the training sample. In the process of learning, the generator generates more and more sampling objects and it becomes more difficult for the discriminator to distinguish them from the real ones. Thus, the generator turns into a generative model that generates objects from some kind of complex distribution, for example, from the distribution of photographs of human faces.
 
 

Model


 
First we introduce the necessary terminology. Through we will denote a certain space of objects. For example, pictures pixel. On some probability space the vector random variable is given. with probability distribution having density such that the subset of the space on which takes non-zero values ​​- this, for example, photographs of human faces. We are given a random i.i.d. sample photos of persons for the value . In addition, we define the auxiliary space and the random variable with probability distribution having density . Discriminator function. This function accepts the object (in our example - a picture of the appropriate size) and returns the probability that the input image is a photograph of a human face. - function generator. It takes the value and produces an object of the space , that is, in our case, the picture.
 
 
Suppose that we already have an ideal discriminator . For any example, it gives the true probability that this example belongs to the given subset , from which a sample of was obtained. . Reformulating the problem of cheating the discriminator in a probabilistic language, we get that it is necessary to maximize the probability given by the ideal discriminator on the generated examples. Thus, the optimal generator is found as . Since Is a monotonically increasing function and does not change the position of the extrema of the argument, this formula is rewritten in the form , which will be convenient later.
 
 
In reality, there is usually no ideal discriminator and it must be found. Since the task of the discriminator is to provide a signal for training the generator, instead of the ideal discriminator it is sufficient to take a discriminator ideally separating these examples from those generated by the current generator, i.e. ideal only on a subset of from which the examples are generated by the current generator. This problem can be restated as the search for such a function , which maximizes the probability of correctly classifying examples as real or generated. This is called the binary classification problem and in this case we have an infinite training sample: a finite number of present examples and potentially an infinite number of generated examples. Each example has a label: it's real or generated. In first article the solution of the classification problem by the maximum likelihood method was described. Let's paint it for our case.
 
 
So, our sample is . Let us determine the distribution density , then Is the reformulation of the discriminator , which gives the probability of class (the present example) in the form of a distribution on classes . Since , this definition gives the correct probability density. Then the optimal discriminator can be found as:
 
 

 
We group the factors for and :
 
 

 

 

 
And when the sample size tends to infinity, we get:
 
 

 
Total, we get the following iterative process:
 
 
 
We fix an arbitrary initial .
 
Begins I iteration, .
 
We are looking for an optimal discriminator for the current generator:
 
 
.
 
We improve the generator using the optimal discriminator:
 
 
. It is important to be in the vicinity of the current generator. If you move away from the current generator, then the discriminator will cease to be optimal and the algorithm will cease to be true.
 
The task of learning the generator is considered solved when for any . If the process does not converge, then proceed to the next iteration in step (2).
 
 
In the original article, this algorithm is summed into one formula that in a sense defines a minimax game between the discriminator and the generator:
 
 

 
Both functions are can be represented in the form of neural networks: , after which the task of finding optimal functions reduces to the problem of optimization by parameters and can be solved using traditional methods: backpropagation and SGD. Additionally, since the neural network is a universal function approximator, can approximate an arbitrary probability distribution, which removes the question of the choice of the distribution . It can be any continuous distribution in some reasonable framework. For example, or . The correctness of this algorithm and the convergence of to under sufficiently general assumptions, is proved in the original article.
 
 

Finding the parameters of the normal distribution


 
With mathematics we figured out, let's now see how this works. Suppose that i.e. solve a one-dimensional problem. . Let's use the linear generator , where . The discriminator will be a fully connected three-layer neural network with a binary classifier at the end. The solution of this problem is , that is, . Now let's try to program a numerical solution of this problem with the help of Tensorflow. The full code can be found here , the article only highlights the key points.
 
 
The first thing to ask is the input sample: . Since training is done on mini-bathes, we will generate a vector of numbers at a time. Additionally, the sample is parameterized by the mean and standard deviation.
 
 
def data_batch (hparams):
"" "
.The input data are just samples from N (mean, stddev).
" ""
return tf.random_normal (
?[hparams.batch_size, 1], hparams.input_mean, hparams.input_stddev)

 
Now we will specify the random inputs for the generator :
 
 
def generator_input (hparams):
"" "
.The generator input data are just samples from N (? 1) .
" ""
return tf.random_normal ([hparams.batch_size, 1], 0., 1.)

 
Define the generator. Take the absolute value of the second parameter to give it a sense of the standard deviation:
 
 
def generator (input, hparams):
mean = tf.Variable (tf.constant (0.))
stddev = tf.sqrt (tf.Variable (tf.constant (1.)) ** 2)
return input * stddev + mean

 
Let's create a vector of real examples:
 
 
generator_input = generator_input (hparams)
generated = generator (generator_input)

 
And the vector of generated examples:
 
 
generator_input = generator_input (hparams)
generated = generator (generator_input)

 
Now we run all the examples through the discriminator. It is important to remember that we want not two different discriminators, but one, because Tensorflow should be asked to use the same parameters for both inputs:
 
 
with tf.variable_scope ("discriminator"):
real_ratings = discriminator (real_input, hparams)
with tf.variable_scope ("discriminator", reuse = True):
generated_ratings = discriminator (generated, hparams)

 
The loss function for real examples is the cross-entropy between the unit (the expected response of the discriminator on real examples) and the discriminator's estimates:
 
 
loss_real = tf.reduce_mean (
.tf.nn.sigmoid_cross_entropy_with_logits (
labels = tf.ones_like (real_ratings),
.logits = real_ratings))

 
The loss function on counterfeit examples is the cross-entropy between zero (the expected response of the discriminator on fake examples) and discriminator estimates:
 
 
loss_generated = tf.reduce_mean (
tf.nn.sigmoid_cross_entropy_with_logits (
labels = tf.zeros_like (generated_ratings),
.logits = generated_ratings))

 
The discriminator loss function is the sum of the losses on real examples and for counterfeit examples:
 
 
discriminator_loss = loss_generated + loss_real

 
The loss function of the generator is the cross-entropy between the unit (the desired wrong answer of the discriminator on fake examples) and the estimates of these fake examples by the discriminator:
 
 
generator_loss = tf.reduce_mean (
tf.nn.sigmoid_cross_entropy_with_logits (
labels = tf.ones_like (generated_ratings),
logits = generated_ratings))

 
To the discriminator loss function, L2-regularization is optionally added.
 
 
The training of the model is reduced to the alternate training of the discriminator and the generator in a cycle before convergence:
 
 
for step in range (args.max_steps):
session.run (model.discriminator_train)
session.run (model.generator_train)

 
Below are the graphs for the four discriminator models:
 
 
 
three-layer neural network.
 
three-layer neural network with L2-regularization
 
three-layer neural network with dropout-regularization.
 
three-layer neural network with L2- and dropout-regularization.
 
 

 
Fig. 1. The probability of classifying a real example by the discriminator as real.
 
 

 
Fig. 2. Probability of classification by the discriminator of the generated example as real.
 
 
All four models converge quite quickly to the fact that the discriminator issues at all entrances. Because of the simplicity of the problem that the generator solves, there is almost no difference between the models. It can be seen from the graphs that the average and standard deviation converge rather quickly to the values ​​from the data distribution:
 
 

 
Fig. 3. The average of the generated distributions.
 
 

 
Fig. 4. The root-mean-square deviation of the generated distributions.
 
 
Below are the distributions of the present and generated examples in the learning process. It can be seen that the generated examples at the end of the training are practically indistinguishable from the real ones (they are distinguishable on the charts because Tensorboard has chosen different scales, but if you look at the values, they are the same).
 
 

 
Fig. 5. Distribution of real data. Does not change over time. The training step is laid out on the vertical axis.
 
 

 
 

 
Fig. 6. Distribution of real data. Does not change over time. The training step is laid out on the vertical axis.
 
 
Let's look at the learning process of the model:
 

 
Fig. 7. Visualization of the learning process of the model. The stationary Gaussian is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the discriminator's work, i.e. the likelihood of an example being real.
 
 
It can be seen that the discriminator at the beginning of learning very well separates the data, but the distribution of generated examples very quickly literally "crawls" to the distribution of these examples. In the end, the generator is so good at approximating the data that the discriminator becomes a constant and the problem converges.
 
 

Approximation of a mixture of normal distributions I


 
Let's try to replace on , thereby simulating the multimodal distribution of the original data. For this model, only the code for generating real examples needs to be changed. Instead of returning a normally distributed random variable, we return a mixture of several:
 
 
def data_batch (hparams):
count = len (hparams.input_mean)
componens =[]
for i in range (count):
componens.append (
tf.contrib.distributions.Normal (
, loc = hparams.input_mean[i],
scale = hparams.input_stddev[i]))
return tf.contrib.distributions.Mixture (
cat = tf.contrib.distributions.Categorical (
, probs =[1./count]* count),
.compounds = componens)
.sample (sample_shape =[hparams.batch_size, 1])

 
Below are graphs for the same models as in the previous experiment, but for data with two modes:
 
 

 
Fig. 8. The likelihood of the discriminator classifying a real example as real.
 
 

 
Fig. 9. Probability of classification by the discriminator of the generated example as real.
 
 
It is interesting to note that regularized models show themselves significantly better than unregularized ones. However, regardless of the model, it can be seen that now the generator can not so well deceive the discriminator. Let's understand why it happened.
 
 

 
Fig. 10. The average of the generated distributions.
 
 

 
Fig. 11. Mean-square deviation of the generated distributions.
 
 
As in the first experiment, the generator approximates the data by a normal distribution. The reason for the decline in quality is that now the data can not be accurately approximated by a normal distribution, because they are sampled from a mixture of two normal ones. The modes of the mixture are symmetric with respect to zero, and it can be seen that all four models approximate the data by a normal distribution with a center near zero and a sufficiently large variance. Let's look at the distribution of real and fake examples to understand what is happening:
 
 

 
Fig. 12. Distribution of real data. Does not change over time. The training step is laid out on the vertical axis.
 
 

 
 

 
Figure 13. Distributions of generated data from four models. The training step is laid out on the vertical axis.
 
 
So the process of learning the model goes:
 

 
Fig. 14. Visualization of the learning process of the model. The stationary mixture of Gaussians is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the discriminator's work, i.e. the likelihood of an example being real.
 
 
This animation details the case studied above. The generator, not possessing sufficient expressiveness and having the ability to approximate only the Gaussian data, spreads out into a broad Gaussian, trying to cover both modes of data distribution. As a result, the generator genuinely deceives the discriminator only in places where the areas under the generator curves and the initial data are close, that is, in the region of intersections of these curves.
 
 
However, this is not the only possible case. Let's move the right mode a little more to the right, so that the initial approach of the generator does not grab it.
 
 

 
Fig. 15. Visualization of the learning process of the model. The stationary mixture of Gaussians is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the discriminator's work, i.e. the likelihood of an example being real.
 
 
It is evident that in this case it is most advantageous for the generator to try to approximate the left mode of distribution. After this happens, the generator tries to attempt to capture the left mode. This looks like oscillations of the standard deviation of the generator in the second half of the animation. But all these attempts fail, because the discriminator somehow "locks" the generator and, to seize the left fashion, he must overcome the barrier from the high loss function, which he can not do because of the insufficient speed of training. This effect is called the collapse of the mode.
 
 
In the two examples above, we saw two types of problems that arise when the generator is not powerful enough to express the original data distribution: averaging modes, when the generator approximates the entire distribution, but everywhere is bad enough; and the collapse of the mode, when the generator learns a subset of modes, and those that he has not learned, do not affect him in any way.
 
 
In addition to the fact that both of these problems lead to a discriminator's divergence to , they also lead to a decrease in the quality of the generative model. The first problem leads to the fact that the generator gives examples of "between" modes that should not exist, the second problem leads to the fact that the generator gives examples only from some modes, thereby reducing the wealth of the original data distribution.
 
 

Approximation of a mixture of normal distributions II


 
The reason that in the previous section it did not work out to deceive the discriminator was the triviality of the generator, which simply did a linear transformation. Let's try now as a generator to use a fully connected three-layer neural network:
 
 
def generator (self, input, hparams):
# First fully connected layer with 256 features.
input_size = 1
features = 256
weights = tf.get_variable (
"weights_1", initializer = tf.truncated_normal (
.[input_size, features], stddev = 0.1))
biases = tf.get_variable (
"biases_1", initializer = tf.constant (0.? shape =[features]))
hidden_layer = tf.nn.relu (tf.matmul (input, weights) + biases)
# The second full-connected layer with 256 features.
features = 256
weights = tf.get_variable (
"weights_2", initializer = tf.truncated_normal (
.[input_size, features], stddev = 0.1))
biases = tf.get_variable (
"biases_2", initializer = tf.constant (0.? shape =[features]))
hidden_layer = tf.nn.relu (tf.matmul (input, weights) + biases)
# The last linear layer that generates the example.
output_size = 1
weights = tf.get_variable (
"weights_out", initializer = tf.truncated_normal (
.[features, output_size], stddev = 0.1))
biases = tf.get_variable (
"biases_out",
initializer = tf.constant (0.? shape =[output_size]))
return tf.matmul (hidden_layer, weights) + biases

 
Let's look at the training schedules.
 
 

 
Fig. 16. The probability of the discriminator classifying a real example as real.
 
 

 
Fig. 17. Probability of classification by the discriminator of the generated example as real.
 
 
It can be seen that due to the large number of parameters, the training has become much more noisy. Discriminators of all models converge to the result of about , but behave unstably around this point. Let's look at the shape of the generator.
 
 

 
Figure 18. Distribution of real data. Does not change over time. The training step is laid out on the vertical axis.
 
 

 
 

 
Figure 19. Distributions of generated data from four models. The training step is laid out on the vertical axis.
 
 
It can be seen that the distribution of the generator does not at all coincide with the distribution of the data, but it is quite similar to it. The most regularized model again proved to be the best. It can be seen that she learned two modes that roughly coincide with the data distribution modes. The dimensions of the peaks are also not very accurate, but they approximate the distribution of data. Thus, the neural network generator is able to learn the multimodal distribution of data.
 
 
So the process of learning the model goes:
 
 

 
Fig. 20. Visualization of the learning process of the model with close modes. The stationary mixture of Gaussians is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the discriminator's work, i.e. the likelihood of an example being real.
 
 

 
Fig. 21. Visualization of the learning process of the model with distant modes. The stationary mixture of Gaussians is the distribution density of real data, the moving Gaussian is the distribution density of the generated examples, the blue curve is the result of the discriminator's work, i.e. the likelihood of an example being real.
 
 
These two animations show training on data distributions from the previous section. From these animations it is evident that when using a sufficiently large generator with a lot of parameters it is, albeit relatively crude, but capable of approximating the multimodal distribution, thereby indirectly confirming that the problems from the previous section arise due to an insufficiently complex generator. Discriminators on these animations are much more noisy than in the section on finding the parameters of normal distribution, but, nevertheless, by the end of the training they begin to resemble the noisy horizontal line .
 
 

Results of


 
GAN is a model for approximating an arbitrary distribution only by sampling from this distribution. In this article, we looked in detail at how the model works on a trivial example of finding normal distribution parameters and on a more complex example of the approximation of bimodal distribution by a neural network. Both problems were solved with good accuracy, for which it was only necessary to use a rather complex generator model. In the next article, we move from these model examples to real examples of generating samples from complex distributions using the example of image distribution.
 
 

Acknowledgments


 
Thanks Olga Talanova and Ruslan Login for the review of the text. Thanks Ruslan Login for help in preparing images and animations. Thank you Andrei Tarashkevich for help with the layout of this article.
+ 0 -

Add comment