Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

In the last post we looked at how Generative Adversarial Networks could be used to learn representations of documents in an unsupervised manner. In evaluation, we found that although the model was able to learn useful representations, it did not perform as well as an older model called DocNADE. In this post we give a brief overview of the DocNADE model, and provide a TensorFlow implementation.

Neural Autoregressive Distribution Estimation

Recent advances in neural autoregressive generative modeling has lead to impressive results at modeling images and audio, as well as language modeling and machine translation. This post looks at a slightly older take on neural autoregressive models – the Neural Autoregressive Distribution Estimator (NADE) family of models.

An autoregressive model is based on the fact that any D-dimensional distribution can be factored into a product of conditional distributions in any order:

\(p(\mathbf{x}) = \prod_{d=1}^{D} p(x_d | \mathbf{x}_{<d})\)

where \(\mathbf{x}_{<d}\) represents the first \(d-1\) dimensions of \(\mathbf{x}\) in the current ordering. We can therefore create an autoregressive generative model by just parameterising all of the separate conditionals in this equation.

One of the more simple ways to do this is to take a sequence of binary values, and assume that the output at each timestep is just a linear combination of the previous values. We can then pass this weighted sum through a sigmoid to get the output probability for each timestep. This sort of model is called a fully-visible sigmoid belief network (FVSBN):

fvsbn

A fully visible sigmoid belief network. Figure taken from the NADE paper.

Here we have binary inputs \(v\) and generated binary outputs \(\hat{v}\). \(\hat{v_3}\) is produced from the inputs \(v_1\) and \(v_2\).

NADE can be seen as an extension of this, where instead of a linear parameterisation of each conditional, we pass the inputs through a feed-forward neural network:

nade

Neural Autoregressive Distribution Estimator. Figure taken from the NADE paper.

Specifically, each conditional is parameterised as:

\(p(x_d | \mathbf{x_{<d}}) = \text{sigm}(b_d + \mathbf{V}_{d,:} \mathbf{h}_d)\)

\(\mathbf{h}_d = \text{sigm}(c + \mathbf{W}_{:,<d} \mathbf{x}_{<d})\)

where \(\mathbf{W}\), \(\mathbf{V}\), \(b\) and \(c\) are learnable parameters of the model. This can then be trained by minimising the negative log-likelihood of the data.

When compared to the FVSBN there is also additional weight sharing in the input layer of NADE: each input element uses the same parameters when computing the various output elements. This parameter sharing was inspired by the Restricted Boltzmann Machine, but also has some computational benefits – at each timestep we only need to compute the contribution of the new sequence element (we don’t need to recompute all of the preceding elements).

Modeling documents with NADE

In the standard NADE model, the input and outputs are binary variables. In order to work with sequences of text, the DocNADE model extends NADE by considering each element in the input sequence to be a multinomial observation – or in other words one of a predefined set of tokens (from a fixed vocabulary). Likewise, the output must now also be multinomial, and so a softmax layer is used at the output instead of a sigmoid. The DocNADE conditionals are then given by:

\(p(x | \mathbf{x_{<d}}) =
\frac{\text{exp} (b_{w_d} + \mathbf{V}_{w_d,:} \mathbf{h}_d) }
{\sum_w \text{exp} (b_w + \mathbf{V}_{w,:} \mathbf{h}_d) }\)

\(\mathbf{h}_d = \text{sigm}\Big(c + \sum_{k<d} \mathbf{W}_{:,x_k} \Big)\)

An additional type of parameter sharing has been introduced in the input layer – each element will have the same weights no matter where it appears in the sequence (so if the word “cat” appears input positions 2 and 10, it will use the same weights each time).

There is another way to look at this however. We now have a single set of parameters for each word no matter where it appears in the sequence, and there is a common name for this architectural pattern – a word embedding. So we can view DocNADE a way of constructing word embeddings, but with a different set of constraints than we might be used to from models like Word2Vec. For each input in the sequence, DocNADE uses the sum of the embeddings from the previous timesteps (passed through a sigmoid nonlinearity) to predict the word at the next timestep. The final representation of a document is just the value of the hidden layer at the final timestep (or in the other words, the sum of the word embeddings passed through a nonlinearity).

There is one more constraint that we have not yet discussed – the sequence order. Instead of training on sequences of words in the order that they appear in the document, as we do when training a language model for example, DocNADE trains on random permutations of the words in a document. We therefore get embeddings that are useful for predicting what words we expect to see appearing together in a full document, rather than focusing on patterns that arise due to syntax and word order (or focusing on smaller contexts around each word).

An Overview of the TensorFlow code

The full source code for our TensorFlow implementation of DocNADE is available on Github, here we will just highlight some of the more interesting parts.

First we do an embedding lookup for each word in our input sequence (x). We initialise the embeddings to be uniform in the range [0, 1.0 / (vocab_size * hidden_size)], which is taken from the original DocNADE source code. I don’t think that this is mentioned anywhere else, but we did notice a slight performance bump when using this instead of the default TensorFlow initialisation.


with tf.device('/cpu:0'):
    max_embed_init = 1.0 / (params.vocab_size * params.hidden_size)
    W = tf.get_variable(
        'embedding',
        [params.vocab_size, params.hidden_size],
        initializer=tf.random_uniform_initializer(maxval=max_embed_init)
    )
    self.embeddings = tf.nn.embedding_lookup(W, x)

Next we compute the pre-activation for each input element in our sequence. We transpose the embedding sequence so that the sequence length elements are now the first dimension (instead of the batch), then we use the higher-order tf.scan function to apply sum_embeddings to each sequence element in turn. This replaces each embedding with sum of that embedding and the previously summed embeddings.


def sum_embeddings(previous, current):
    return previous + current
 
h = tf.scan(sum_embeddings, tf.transpose(self.embeddings, [1, 2, 0]))
h = tf.transpose(h, [2, 0, 1])
 
h = tf.concat([
    tf.zeros([batch_size, 1, params.hidden_size], dtype=tf.float32), h
], axis=1)
 
h = h[:, :-1, :]

We then initialise the bias terms, prepend a zero vector to the input sequence (so that the first element is generated from just the bias term), and apply the nonlinearity.


bias = tf.get_variable(
    'bias',
    [params.hidden_size],
    initializer=tf.constant_initializer(0)
)
h = tf.tanh(h + bias)

Finally we compute the sequence loss, which is masked according to the length of each sequence in the batch. Note that for optimisation, we do not normalise this loss by the length of each document. This leads to slightly better results as mentioned in the paper, particularly for the document retrieval evaluation (discussed below).


h = tf.reshape(h, [-1, params.hidden_size])
logits = linear(h, params.vocab_size, 'softmax')
loss = masked_sequence_cross_entropy_loss(x, seq_lengths, logits)

Experiments

As DocNADE computes the probability of the input sequence, we can measure how well it is able to generalise by computing the probability of a held-out test set. In the paper the actual metric that they use is the average perplexity per word, which for time \(t\), input \(x\) and test set size \(N\) is given by:

\(\text{exp} \big(-\frac{1}{N} \sum_{t} \frac{1}{|x_t|} \log p(x_t) \big)\)

As in the paper, we evaluate DocNADE on the same (small) 20 Newsgroups dataset that we used in our previous post, which consists of a collection of around 19000 postings to 20 different newsgroups. The published version of DocNADE uses a hierarchical softmax on this dataset, despite the fact that they use a small vocabulary size of 2000. There is not much need to approximate a softmax of this size when training relatively small models on modern GPUs, so here we just use a full softmax. This makes a large difference in the reported perplexity numbers – the published implementation achieves a test perplexity of 896, but with the full softmax we can get this down to 579. To note how big an improvement this is, the following table shows perplexity values on this task for models that have been published much more recently:

Model/paper Perplexity
DocNADE (original) 896
Neural Variational Inference 836
DeepDocNADE 835
DocNADE with full softmax 579

One additional change from the evaluation in the paper is that we evaluate the average perplexity over the full test set (in the paper they just take a random sample of 50 documents).

We were expecting to see an improvement due to the use of the full softmax, but not an improvement of quite this magnitude. Even when using a sampled softmax on this task instead of the full softmax, we see some big improvements over the published results. This suggests that the hierarchical softmax formulation that was used in the original paper was a relatively poor approximation of the true softmax (but it’s possible that there is a bug somewhere in our implementation, if you find any issues please let us know).

We also see an improvement on the document retrieval evaluation results with the full softmax:

20ng_results

For the retrieval evaluation, we first create vectors for every document in the dataset. We then use the held-out test set vectors as “queries”, and for each query we find the closest N documents in the training set (by cosine similarity). We then measure what percentage of these retrieved training documents have the same newsgroup label as the query document. We then plot a curve of the retrieval performance for different values of N.

Note: for working with larger vocabularies, the current implementation supports approximating the softmax using the sampled softmax.

Conclusion

We took another look a DocNADE, noting that it can be viewed as another way to train word embeddings. We also highlighted the potential for large performance boosts with older models due simply to modern computational improvements – in this case because it is no longer necessary to approximate smaller vocabularies. The full source code for the model is available on Github.

0

I presented some preliminary work on using Generative Adversarial Networks to learn distributed representations of documents at the recent NIPS workshop on Adversarial Training. In this post I provide a brief overview of the paper and walk through some of the code.

Learning document representations

Representation learning has been a hot topic in recent years, in part driven by the desire to apply the impressive results of Deep Learning models on supervised tasks to the areas of unsupervised learning and transfer learning. There are a large variety of approaches to representation learning in general, but the basic idea is to learn some set of features from data, and then using these features for some other task where you may only have a small number of labelled examples (such as classification). The features are typically learned by trying to predict various properties of the underlying data distribution, or by using the data to solve a separate (possibly unrelated) task for which we do have a large number of labelled examples.

This ability to do this is desirable for several reasons. In many domains there may be an abundance of unlabelled data available to us, while supervised data is difficult and/or costly to acquire. Some people also feel that we will never be able to build more generally intelligent machines using purely supervised learning (a viewpoint that is illustrated by the now infamous LeCun cake slide).

Word (and character) embeddings have become a standard component of Deep Learning models for natural language processing, but there is less consensus around how to learn representations of sentences or entire documents. One of the most established techniques for learning unsupervised document representations from the literature is Latent Dirichlet Allocation (LDA). Later neural approaches to modeling documents have been shown to outperform LDA when evaluated on small news corpus (discussed below). The first of these is the Replicated Softmax, which is based on the Restricted Boltzmann Machine, and then later this was also surpassed by a neural autoregressive model called DocNADE.

In addition to autoregressive models like the NADE family, there are two other popular approaches to building generative models at the moment – Variational Autoencoders and Generative Adversarial Networks (GANs). This work is an early exploration to see if GANs can be used to learn document representations in an unsupervised setting.

Modeling Documents with Generative Adversarial Networks

In the original GAN setup, a generator network learns to map samples from a (typically low-dimensional) noise distribution into the data space, and a second network called the discriminator learns to distinguish between real data samples and fake generated samples. The generator is trained to fool the discriminator, with the intended goal being a state where the generator has learned to create samples that are representative of the underlying data distribution, and the discriminator is unsure whether it is looking at real or fake samples.

There are a couple of questions to address if we want to use this sort of technique to model documents:

  • At Aylien we are primarily interested in using the learned representations for new tasks, rather than doing some sort of text generation. Therefore we need some way to map from a document to a latent space. One shortcoming with this GAN approach is that there is no explicit way to do this – you cannot go from the data space back into the low-dimensional latent space. So what is our representation?
  • As this requires that both steps are end-to-end differentiable, how do we represent collections of discrete symbols?

To answer the first question, some extensions to standard GAN model train an additional neural network to perform this mapping (like this, this and this).

Another more simple idea is to just use some internal part of the discriminator as the representation (as is done in the DCGAN paper). We experimented with both approaches, but so far have gotten better results with the latter. Specifically, we use a variation on the Energy-based GAN model, where our discriminator is a denoising autoencoder, and use the autoencoder bottleneck as our representation (see the paper for more details).

As for representing discrete symbols, we take the most overly-simplified approach that we can – assume that a document is just a binary vector of bag-of-words (ie, a vector in which there is a 1 if a given word in a fixed vocabulary is present in a document, and a 0 otherwise). Although this is actually still a discrete vector, we can now just treat it as if all elements are continuous in the range [0, 1] and backpropagate through the full network.

The full model looks like this:

adm-large

Here z is a noise vector, which passes through a generator network G and produces a vector that is the size of the vocabulary. We then pass either this generated vector or a sampled bag-of-words vector from the data (x) to our denoising autoencoder discriminator D. The vector is then corrupted with masking noise C, mapped into a lower-dimensional space by an encoder, mapped back to the data space by a decoder and then finally the loss is taken as the mean-squared error between the input to D and the reconstruction. We can also extract the encoded representation (h) for any input document.

An overview of the TensorFlow code

The full source for this model can be found at https://github.com/AYLIEN/adversarial-document-model, here will just highlight some of the more important parts.

In TensorFlow, the generator is written as:


def generator(z, size, output_size):
    h0 = tf.nn.relu(slim.batch_norm(linear(z, size, 'h0')))
    h1 = tf.nn.relu(slim.batch_norm(linear(h0, size, 'h1')))
    return tf.nn.sigmoid(linear(h1, output_size, 'h2'))

This function takes parameters containing the noise vector z, the size of the generator’s hidden dimension and the size of the final output dimension. It then passes noise vector through two fully-connected RELU layers (with batch norm), before passing the output through a final sigmoid layer.

The discriminator is similarly straight-forward:


def discriminator(x, mask, size):
    noisy_input = x * mask
    h0 = leaky_relu(linear(noisy_input, size, 'h0'))
    h1 = linear(h0, x.get_shape()[1], 'h1')
    diff = x - h1
    return tf.reduce_mean(tf.reduce_sum(diff * diff, 1)), h0

It takes a vector x, a noise vector mask and the size of the autoencoder bottleneck. The noise is applied to the input vector, before it is passed through a single leaky RELU layer and then mapped linearly back to the input space. It returns both the reconstruction loss and the bottleneck tensor.

The full model is:


with tf.variable_scope('generator'):
    self.generator = generator(z, params.g_dim, params.vocab_size)

with tf.variable_scope('discriminator'):
    self.d_loss, self.rep = discriminator(x, mask, params.z_dim)

with tf.variable_scope('discriminator', reuse=True):
    self.g_loss, _ = discriminator(self.generator, mask, params.z_dim)

margin = params.vocab_size // 20
self.d_loss += tf.maximum(0.0, margin - self.g_loss)

vars = tf.trainable_variables()
self.d_params = [v for v in vars if v.name.startswith('discriminator')]
self.g_params = [v for v in vars if v.name.startswith('generator')]

step = tf.Variable(0, trainable=False)

self.d_opt = tf.train.AdamOptimizer(
    learning_rate=params.learning_rate,
    beta1=0.5
)

self.g_opt = tf.train.AdamOptimizer(
    learning_rate=params.learning_rate,
    beta1=0.5
)

We first create the generator, then two copies of the discriminator network (one taking real samples as input, and one taking generated samples). We then complete the discriminator loss by adding the cost from the generated samples (with an energy margin), and create separate Adam optimisers for the discriminator and generator networks. The magic Adam beta1 value of 0.5 comes from the DCGAN paper, and similarly seems to stabilize training in our model.

This model can then be trained as follows:


def update(model, x, opt, loss, params, session):
    z = np.random.normal(0, 1, (params.batch_size, params.z_dim))
    mask = np.ones((params.batch_size, params.vocab_size)) * np.random.choice(
        2,
        params.vocab_size,
        p=[params.noise, 1.0 - params.noise]
    )
    loss, _ = session.run([loss, opt], feed_dict={
        model.x: x,
        model.z: z,
        model.mask: mask
    })
    return loss
 
# … TF training/session boilerplate …
 
for step in range(params.num_steps + 1):
    _, x = next(training_data)
 
    # update discriminator
    d_losses.append(update(
        model,
        x,
        model.d_opt,
        model.d_loss,
        params,
        session
    ))
 
    # update generator
    g_losses.append(update(
        model,
        x,
        model.g_opt,
        model.g_loss,
        params,
        session
    ))

Here we get the next batch of training data, then update the discriminator and generator separately. At each update, we generate a new noise vector to pass to the generator, and a new noise mask for the denoising autoencoder (the same noise mask is used for each input in the batch).

Experiments

To compare with previous published work in this area (LDA, Replicated Softmax, DocNADE) we ran some experiments with this adversarial model on the 20 Newsgroups dataset. It must be stressed that this is a relatively toy dataset by current standards, consisting of a collection of around 19000 postings to 20 different newsgroups.

One open question with generative models (and GANs in particular) is what metric do you actually use to evaluate how well they are doing? If the model yields a proper joint probability over the input, a popular choice is to evaluate the likelihood of a held-out test set. Unfortunately this is not an option for GAN models.

Instead, as we are only really interested in the usefulness of the learned representation, we also follow previous work and compare how likely similar documents are to have representations that are close together in vector space. Specifically, we create vectors for every document in the dataset. We then use the held-out test set vectors as “queries”, and for each query we find the closest N documents in the training set (by cosine similarity). We then measure what percentage of these retrieved training documents have the same newsgroup label as the query document. We then plot a curve of the retrieval performance for different values of N. The results are shown below.

20ng_results

Precision-recall curves for the document retrieval task on the 20 Newsgroups dataset. ADM is the adversarial document model, ADM (AE) is the adversarial document model with a standard Autoencoder as the discriminator (and so it similar to the Energy-Based GAN), and DAE is a Denoising Autoencoder.

Here we can see a few notable points:

  • The model does learn useful representations, but is still not reaching the performance of DocNADE on this task. At lower recall values though it is better than the LDA results on the same task (not shown above, see the Replicated Softmax paper).
  • By using a denoising autoencoder as the discriminator, we get a bit of a boost versus just using a standard autoencoder.
  • We get quite a large improvement over just training a denoising autoencoder with similar parameters on this dataset.

We also looked at whether the model produced results that were easy to interpret. We note that the autoencoder bottleneck has weights connecting it to every word in the vocabulary, so we looked to see if specific hidden units were strongly connected to groups of words that could be interpreted as newsgroup topics. Interestingly we find some evidence of this as shown in the table below, where we present the words most strongly associated with three of these hidden units. They do generally fit into understandable topic categories, with a few of exceptions. However, we note that these are cherry-picked examples, and that overall the weights for a specific hidden unit do not tend to strongly associate with single topics.

Computing Sports Religion
windows hockey christians
pc season windows
modem players atheists
scsi baseball waco
quadra rangers batf
floppy braves christ
xlib leafs heart
vga sale arguments
xterm handgun bike
shipping bike rangers

 

We can also see reasonable clustering of many of the topics in a TSNE plot of the test-set vectors (1 colour per topic), although some are clearly still being confused:

20ng_tsne

Conclusion

We showed some interesting first steps in using GANs to model documents, admittedly perhaps asking more questions than we answered. In the time since the completion of this work, there have been numerous proposals to improve GAN training (such as this, this and this), so it would be interesting to see if any of the recent advances help with this task. And of course, we still need to see if this approach can be scaled up to larger datasets and vocabularies. The full source code is now available on Github, we look forward to seeing what people do with it.





Text Analysis API - Sign up




0

At Aylien we are using recent advances in Artificial Intelligence to try to understand natural language. Part of what we do is building products such as our Text Analysis API and News API to help people extract meaning and insight from text. We are also a research lab, conducting research that we believe will make valuable contributions to the field of Artificial Intelligence, as well as driving further product development (take a look at the research section of our blog to see some of our interests).

We are delighted to announce a call for applications from academic researchers who would like to collaborate with our talented research team as part of the Science Foundation Ireland Industry Fellowship programme. For researchers, we feel that this is a great opportunity to work in industry with a team of talented scientists and engineers, with the resources and infrastructure to support your work.

We are especially interested in collaborating on work in these areas (but we’re open to suggestions):

  • Representation Learning
  • Domain Adaptation and Transfer Learning
  • Sentiment Analysis
  • Question Answering
  • Dialogue Systems
  • Entity and Relation Extraction
  • Topic Modeling
  • Document Classification
  • Taxonomy Inference
  • Document Summarization
  • Machine Translation

Details

A notable change from the previous requirements for the SFI Industry Fellowship programme is that you are now also eligible to apply if you hold  a PhD awarded by an Irish university (wherever you  are currently living). You can work with us under this programme if you are:

  • a faculty or postdoctoral researcher currently working in an eligible Irish Research Body;
  • a postdoctoral researcher having held a research contract in an eligible Irish Research Body; or
  • the holder of a PhD awarded by an Irish Research Body.

The project allows for an initial duration of 12 months full-time work, but this can be spread over up to 24 months at part-time, depending on what suits you. We also hope to continue collaboration afterwards. The total funding available from the SFI is €100,000.

The final application date is July 6th 2017, but as we will work with you on the application, we encourage you to get in touch as soon as possible if you are interested.

Must haves:

  • A faculty or postdoctoral position at an eligible Irish Research Body, a previous research contract in an eligible Irish Research Body, or a PhD awarded by an Irish Research Body (in the areas of science, engineering or mathematics).
  • Strong knowledge of at least one programming language, and general software engineering skills (experience with version control systems, debugging, testing, etc.).
  • The ability to grasp new concepts quickly.

Nice to haves:

  • An advanced knowledge of Artificial Intelligence and its subfields.
  • A solid background in Machine Learning.
  • Experience with the scientific Python stack: NumPy, SciPy, scikit-learn, TensorFlow, Theano, Keras, etc.
  • Experience with Deep Learning.
  • Good understanding of language and linguistics.
  • Experience with non-English NLP.

About the Research Team at AYLIEN

We are a team of research scientists and research engineers that includes both PhDs and PhD students, who work in an open and transparent way.

If you join us, you will get to work with a seasoned engineering team that will allow you to take successful algorithms from prototype into production. You will also have the opportunity to work on many technically challenging problems, publish at conferences, and contribute to the development of state-of-the-art Deep Learning and natural language understanding applications.

We work primarily on Deep Learning with TensorFlow and the rest of the the scientific Python stack (Numpy, SciPy, scikit-learn, etc.).

How to apply

Please send your CV with a brief introduction to jobs@aylien.com, and be sure to include links to any code or publications that are publicly available.

0

At Aylien we are using recent advances in Artificial Intelligence to try to understand natural language. Part of what we do is building products such as our Text Analysis API and News API to help people extract meaning and insight from text. We are also a research lab, conducting research that we believe will make valuable contributions to the field of Artificial Intelligence, as well as driving further product development (see this post about a recent publication on aspect-based sentiment analysis by one of our research scientists for example).

We are excited to announce that we are currently accepting applications from researchers from academia who would like to collaborate on joint research projects, as part of the Science Foundation Ireland Industry Fellowship programme. The main aim of these projects is to conduct and publish novel research in shared areas of interest, but we are also happy to work with researchers who are interested in collaborating on building products. For researchers, we feel that this is a great opportunity to work in industry with a team of talented scientists and engineers, and with the resources and infrastructure to support your work.

We are particularly interested in collaborating on work in the following areas (but are open to other suggestions):

  • Representation Learning
  • Domain Adaptation and Transfer Learning
  • Sentiment Analysis
  • Question Answering
  • Dialogue Systems
  • Entity and Relation Extraction
  • Topic Modeling
  • Document Classification
  • Taxonomy Inference
  • Document Summarization
  • Machine Translation

Details and requirements

This work is funded by the SFI Industry Fellowship programme, and so to be eligible you must have a PhD and be currently hired as a Postdoctoral Researcher, a Research Fellow or a Lecturer by a Higher Education Institution in Ireland.

The initial duration of the project is between 12 and 24 months (12 months full-time, but you can spread it over up to 24 months if you like), but we hope to continue the collaboration afterwards. The total amount of funding available is up to €100,000.

The final application deadline is December 2nd 2016, with a tentative start date at Aylien in March 2017. We will work with you on the application, so if you are interested then you should get in touch with us as soon as possible.

You must have:

  • A PhD degree in a related field (science, engineering, or mathematics), and be currently hired as a Postdoctoral Researcher, a Research Fellow or a Lecturer by a higher education institution in Ireland.
  • Strong knowledge of at least one programming language, and general software engineering skills (experience with version control systems, debugging, testing, etc.).
  • The ability to grasp new concepts quickly.

Preferably you should also have:

  • A strong knowledge of Artificial Intelligence and its subfields.
  • A strong Machine Learning background.
  • Experience with the scientific Python stack: NumPy, SciPy, scikit-learn, TensorFlow, Theano, Keras, etc.
  • Experience with Deep Learning.
  • Good understanding of linguistics and language.
  • Experience with non-English NLP.

Research at Aylien

We are a team of research scientists (PhDs and PhD students) and research engineers, working in an open and transparent way with no bureaucracy. If you join us, you will get to work with a seasoned engineering team that will allow you to take successful algorithms from prototype into production. You will also have the opportunity to work on many technically challenging problems, publish at conferences, and contribute to the development of state-of-the-art Deep Learning and natural language understanding applications.

We work primarily on Deep Learning with TensorFlow and the rest of the the scientific Python stack (Numpy, SciPy, scikit-learn etc.).

How to apply

Please send your CV along with a brief introduction (and links to any of your publicly available code and publications) to jobs@aylien.com.

0

There has been a large resurgence of interest in generative models recently (see this blog post by OpenAI for example). These are models that can learn to create data that is similar to data that we give them. The intuition behind this is that if we can get a model to write high-quality news articles for example, then it must have also learned a lot about news articles in general. Or in other words, the model should also have a good internal representation of news articles. We can then hopefully use this representation to help us with other related tasks, such as classifying news articles by topic.

Actually training models to create data like this is not easy, but in recent years a number of methods have started to work quite well. One such promising approach is using Generative Adversarial Networks (GANs). The prominent deep learning researcher and director of AI research at Facebook, Yann LeCun, recently cited GANs as being one of the most important new developments in deep learning:

“There are many interesting recent development in deep learning…The most important one, in my opinion, is adversarial training (also called GAN for Generative Adversarial Networks). This, and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.” – Yann LeCun
 

The rest of this post will describe the GAN formulation in a bit more detail, and provide a brief example (with code in TensorFlow) of using a GAN to solve a toy problem.

Discriminative vs. Generative models

Before looking at GANs, let’s briefly review the difference between generative and discriminative models:

  • A discriminative model learns a function that maps the input data (x) to some desired output class label (y). In probabilistic terms, they directly learn the conditional distribution P(y|x).
  • A generative model tries to learn the joint probability of the input data and labels simultaneously, i.e. P(x,y). This can be converted to P(y|x) for classification via Bayes rule, but the generative ability could be used for something else as well, such as creating likely new (x, y) samples.

Both types of models are useful, but generative models have one interesting advantage over discriminative models – they have the potential to understand and explain the underlying structure of the input data even when there are no labels. This is very desirable when working on data modelling problems in the real world, as unlabelled data is of course abundant, but getting labelled data is often expensive at best and impractical at worst.

Generative Adversarial Networks

GANs are an interesting idea that were first introduced in 2014 by a group of researchers at the University of Montreal lead by Ian Goodfellow (now at OpenAI). The main idea behind a GAN is to have two competing neural network models. One takes noise as input and generates samples (and so is called the generator). The other model (called the discriminator) receives samples from both the generator and the training data, and has to be able to distinguish between the two sources. These two networks play a continuous game, where the generator is learning to produce more and more realistic samples, and the discriminator is learning to get better and better at distinguishing generated data from real data. These two networks are trained simultaneously, and the hope is that the competition will drive the generated samples to be indistinguishable from real data.

gan

GAN overview. Source: https://ishmaelbelghazi.github.io/ALI

The analogy that is often used here is that the generator is like a forger trying to produce some counterfeit material, and the discriminator is like the police trying to detect the forged items. This setup may also seem somewhat reminiscent of reinforcement learning, where the generator is receiving a reward signal from the discriminator letting it know whether the generated data is accurate or not. The key difference with GANs however is that we can backpropagate gradient information from the discriminator back to the generator network, so the generator knows how to adapt its parameters in order to produce output data that can fool the discriminator.

So far GANs have been primarily applied to modelling natural images. They are now producing excellent results in image generation tasks, generating images that are significantly sharper than those trained using other leading generative methods based on maximum likelihood training objectives. Here are some examples of images generated by GANs:

gan-samples-1

Generated bedrooms. Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” https://arxiv.org/abs/1511.06434v2

 

gan-samples-2

Generated CIFAR-10 samples. Source: “Improved Techniques for Training GANs” https://arxiv.org/abs/1606.03498

Approximating a 1D Gaussian distribution

To get a better understanding of how this all works, we’ll use a GAN to solve a toy problem in TensorFlow – learning to approximate a 1-dimensional Gaussian distribution. This is based on a blog post with a similar goal by Eric Jang. The full source code for our demo is available on Github (https://github.com/AYLIEN/gan-intro), here we will just focus on some of the more interesting parts of the code.

First we create the “real” data distribution, a simple Gaussian with mean 4 and standard deviation of 0.5. It has a sample function that returns a given number of samples (sorted by value) from the distribution.

class DataDistribution(object):
    def __init__(self):
        self.mu = 4
        self.sigma = 0.5

    def sample(self, N):
        samples = np.random.normal(self.mu, self.sigma, N)
        samples.sort()
        return samples

The data distribution that we will try to learn looks like this:

data

We also define the generator input noise distribution (with a similar sample function). Following Eric Jang’s example, we also go with a stratified sampling approach for the generator input noise – the samples are first generated uniformly over a specified range, and then randomly perturbed.

class GeneratorDistribution(object):
    def __init__(self, range):
        self.range = range

    def sample(self, N):
        return np.linspace(-self.range, self.range, N) + \
            np.random.random(N) * 0.01

Our generator and discriminator networks are quite simple. The generator is a linear transformation passed through a nonlinearity (a softplus function), followed by another linear transformation.

def generator(input, hidden_size):
    h0 = tf.nn.softplus(linear(input, hidden_size, 'g0'))
    h1 = linear(h0, 1, 'g1')
    return h1

In this case we found that it was important to make sure that the discriminator is more powerful than the generator, as otherwise it did not have sufficient capacity to learn to be able to distinguish accurately between generated and real samples. So we made it a deeper neural network, with a larger number of dimensions. It uses tanh nonlinearities in all layers except the final one, which is a sigmoid (the output of which we can interpret as a probability).

def discriminator(input, hidden_size):
    h0 = tf.tanh(linear(input, hidden_size * 2, 'd0'))
    h1 = tf.tanh(linear(h0, hidden_size * 2, 'd1'))
    h2 = tf.tanh(linear(h1, hidden_size * 2, 'd2'))
    h3 = tf.sigmoid(linear(h2, 1, 'd3'))
    return h3

We can then connect these pieces together in a TensorFlow graph. We also define loss functions for each network, with the goal of the generator being simply to fool the discriminator.

with tf.variable_scope('G'):
    z = tf.placeholder(tf.float32, shape=(None, 1))
    G = generator(z, hidden_size)

with tf.variable_scope('D') as scope:
    x = tf.placeholder(tf.float32, shape=(None, 1))
    D1 = discriminator(x, hidden_size)
    scope.reuse_variables()
    D2 = discriminator(G, hidden_size)

loss_d = tf.reduce_mean(-tf.log(D1) - tf.log(1 - D2))
loss_g = tf.reduce_mean(-tf.log(D2))

We create optimizers for each network using the plain GradientDescentOptimizer in TensorFlow with exponential learning rate decay. We should also note that finding good optimization parameters here did require some tuning.

def optimizer(loss, var_list):
    initial_learning_rate = 0.005
    decay = 0.95
    num_decay_steps = 150
    batch = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
        initial_learning_rate,
        batch,
        num_decay_steps,
        decay,
        staircase=True
    )
    optimizer = GradientDescentOptimizer(learning_rate).minimize(
        loss,
        global_step=batch,
        var_list=var_list
    )
    return optimizer

vars = tf.trainable_variables()
d_params = [v for v in vars if v.name.startswith('D/')]
g_params = [v for v in vars if v.name.startswith('G/')]

opt_d = optimizer(loss_d, d_params)
opt_g = optimizer(loss_g, g_params)

To train the model, we draw samples from the data distribution and the noise distribution, and alternate between optimizing the parameters of the discriminator and the generator.

with tf.Session() as session:
    tf.initialize_all_variables().run()

    for step in xrange(num_steps):
        # update discriminator
        x = data.sample(batch_size)
        z = gen.sample(batch_size)
        session.run([loss_d, opt_d], {
            x: np.reshape(x, (batch_size, 1)),
            z: np.reshape(z, (batch_size, 1))
        })

        # update generator
        z = gen.sample(batch_size)
        session.run([loss_g, opt_g], {
            z: np.reshape(z, (batch_size, 1))
        })

The following animation shows how the generator learns to approximate the data distribution during training:

 

We can see that at the start of the training process, the generator was producing a very different distribution to the real data. It eventually learned to approximate it quite closely (somewhere around frame 750), before converging to a narrower distribution focused on the mean of the input distribution. After training, the two distributions look something like this:

gan-trained-1

 

This makes sense intuitively. The discriminator is looking at individual samples from the real data and from our generator. If the generator just produces the mean value of the real data in this simple example, then it is going to be quite likely to fool the discriminator.

There are many possible solutions to this problem. In this case we could add some sort of early-stopping criterion, to pause training when some similarity threshold between the two distributions is reached. It is not entirely clear how to generalise this to bigger problems however, and even in the simple case, it may be hard to guarantee that our generator distribution will always reach a point where early stopping makes sense. A more appealing solution is to address the problem directly by giving the discriminator the ability to examine multiple examples at once.

Improving sample diversity

The problem of the generator collapsing to a parameter setting where it outputs a very narrow distribution of points is “one of the main failure modes” of GANs according to a recent paper by Tim Salimans and collaborators at OpenAI. Thankfully they also propose a solution: allow the discriminator to look at multiple samples at once, a technique that they call minibatch discrimination.

In the paper, minibatch discrimination is defined to be any method where the discriminator is able to look at an entire batch of samples in order to decide whether they come from the generator or the real data. They also present a more specific algorithm which works by modelling the distance between a given sample and all other samples in the same batch. These distances are then combined with the original sample and passed through the discriminator, so it has the option to use the distance measures as well as the sample values during classification.

The method can be loosely summarized as follows:

  • Take the output of some intermediate layer of the discriminator.
  • Multiply it by a 3D tensor to produce a matrix (of size num_kernels x kernel_dim in the code below).
  • Compute the L1-distance between rows in this matrix across all samples in a batch, and then apply a negative exponential.
  • The minibatch features for a sample are then the sum of these exponentiated distances.
  • Concatenate the original input to the minibatch layer (the output of the previous discriminator layer) with the newly created minibatch features, and pass this as input to the next layer of the discriminator.

In TensorFlow that translates to something like:

def minibatch(input, num_kernels=5, kernel_dim=3):
    x = linear(input, num_kernels * kernel_dim)
    activation = tf.reshape(x, (-1, num_kernels, kernel_dim))
    diffs = tf.expand_dims(activation, 3) - \
        tf.expand_dims(tf.transpose(activation, [1, 2, 0]), 0)
    abs_diffs = tf.reduce_sum(tf.abs(diffs), 2)
    minibatch_features = tf.reduce_sum(tf.exp(-abs_diffs), 2)
    return tf.concat(1, [input, minibatch_features])

We implemented the proposed minibatch discrimination technique to see if it would help with the collapse of the generator output distribution in our toy example. The new behaviour of the generator network during training is shown below.

 

It’s clear in this case that adding minibatch discrimination causes the generator to maintain most of the width of the original data distribution (although it’s still not perfect). After convergence the distributions now look like this:

gan-trained-2

One final point on minibatch discrimination is that it makes the batch size even more important as a hyperparameter. In our toy example we had to keep batches quite small (less than around 16) for training to converge. Perhaps it would be sufficient to just limit the number of samples that contribute to each distance measure rather than use the full batch, but again this another parameter to tune.

Final thoughts

Generative Adversarial Networks are an interesting development, giving us a new way to do unsupervised learning. Most of the successful applications of GANs have been in the domain of computer vision, but here at Aylien we are researching ways to apply these techniques to natural language processing. If you’re working on the same idea and would like to compare notes then please get in touch.

One big open problem in this area is how best to evaluate these sorts of models. In the image domain it is quite easy to at least look at the generated samples, although this is obviously not a satisfying solution. In the text domain this is even less useful (unless perhaps your goal is to generate prose). With generative models that are based on maximum likelihood training, we can usually produce some metric based on likelihood (or some lower bound to the likelihood) of unseen test data, but that is not applicable here. Some GAN papers have produced likelihood estimates based on kernel density estimates from generated samples, but this technique seems to break down in higher dimensional spaces. Another solution is to only evaluate on some downstream task (such as classification). If you have any other suggestions then we would love to hear from you.

More information

If you want to learn more about GANs we recommend starting with the following publications:

Feel free to reuse our GAN code, and of course keep an eye on our blog. Comments, corrections and feedback are welcome.





Text Analysis API - Sign up




17