Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

Programming languages are the primary tool of the software development industry. Since the 1940’s hundreds of them have been created and a huge amount of new lines of code in diverse programming languages are written and pushed to active repositories every day.

We believe that a source code classifier that can identify the programming language that a piece of code is written in would be a very useful tool for automatic syntax highlighting and label suggestion on platforms, such as StackOverflow and technical wikis. This inspired us to train a model for classifying code snippets based on their language, leveraging recent AI techniques for text classification.

We collected hundreds of thousands of source code files from GitHub repositories using the GitHub API. Before training our model, the raw data had to be processed to remove and mitigate some unwanted characteristics of code found in the wild. The final classifier performance is pretty good, and you can find the results at the end of this post, along with some explanations about the model’s decisions.


Programming languages were selected based on their prominence. Chart 1 shows the 49 most used languages on GitHub during the final quarter of 2014 [1].  This analysis only considers active repositories, i.e. repositories that had at least one code push during this period. We’ve added HTML and XML to the list, and even though one may not consider them as programming languages, they are still relevant to software development projects. SQL was added for the same reason.

Chart 1 – Active Repositories

We used the

GitHub API  to retrieve repositories for a specific query language. The chart below shows the data shape after a few days of crawling. Thousands of repositories were inspected, but the ones with a size greater than 100mb were ignored to avoid spending too much time on downloading and preprocessing.  We used the file extensions to label each sample with its programming language (i.e. file.php is a PHP source file). C# was the most abundant source code language, while Arduino was the least frequent among the repos we crawled. In order to avoid an unbalanced training set, we used 10,000 examples at most per class.

Chart 2 – Crawled Files

Mixed Source Codes

Looking carefully at the raw data, we find some challenging behaviours and characteristics, which is not a big surprise given that this data is pulled out of actual arbitrary repositories. The most frequent one is mixed languages in a single file, which happens more often in languages used for web applications, such as JavaScript, HTML, CSS, PHP and ASP. The ASP snippet below, extracted from a .asp source file, illustrates how this happens.

Figure 1 – Mixed Languages

In our case, we want to assign only one class to each document. So in case of mixed languages in a single source code file, we would like to keep only the snippets that belong to the primary language of the file (inferred from its extension), and strip everything else. To do so, we use known reserved words and expressions for each language. For instance, we know that all content between <%php and %> are php code, so if it is a .php file, we only keep this content, and remove everything else. In the same way it is easy to strip HTML tags from code using regex or built-in parsersin Python.

Another common feature found in these documents is embedded code snippets. For instance, in the JavaScript script below there is an embedded C snippet between the quotes. This is another type of mixed code that’s very common. We mitigated this issue by replacing all content between the quotes by a placeholder (in this case we used strv as a string placeholder).

Figure 2 – JavaScript snippet with a “hidden” C code embedded


After the preprocessing step, which includes also escaping of newline and tab characters, we need to tokenize all our text. It is very important to retain all the code syntax information during this step. We used the [\w']+|[""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""\\] regex to extract our tokens. After this step the data is ready for training.  


print "This is using getsize"
for root, dirs, files in os.walk('/tmp'):
    print root, "consumes"


print strv \n for root, dirs, files in os.walk(strv):\n\t print root, strv


['print',  'strv',  'for',  'root',  ',',  'dirs',  ',',  'files',  'in',  'os',  '.',  'walk',  '(', 'strv', ')',  ':',  'print',  'root,',  '"',  'strv']

The Model

Recently, Convolutional Neural Networks (CNNs) have gained popularity for handling various NLP tasks. For text classification in particular, deep learning models have achieved remarkable results [2, 3]. Our model uses a word embedding layer followed by a convolutional layer with multiple filters, a max-pooling layer and finally a softmax layer (Figure 3). We used a non-static and randomly initialized embedding layer, so the vectors were trained from scratch.

Figure 3 – Convolutional Neural Network model (Figure based on [2])


We performed a test over a 10% data split and calculated the accuracy, precision, recall and f1-score for each label. The overall results for accuracy, precision, recall and f1-score are 97%, 96%, 96 and 96% respectively. The per-label performance scores are also quite high (Chart 3).  



Chart 3 – Results per class
So the results look good, but let’s take a look at prediction explanations to check how the classifier is making its decisions. We used LIME to generate “explanations” which highlight words that are most correlated with each label. This way, we have an idea why the model is choosing one label instead of another.

A Scala snippet:

object HelloWorld {
  def main(args: Array[String]) {
    println("Hello, world!")

Result with explanations:


A Java snippet:

BufferedWriter out = null;
try {
   out = new BufferedWriter(new FileWriter("filename", true));
} catch (IOException e) {
// error processing code
} finally {
   if (out != null) {



An OCaml snippet:

# let average a b =
    let sum = a +. b in
    sum /. 2.0;;
val average : float -> float -> float = 



Future directions

Although the classifier performance is very high, there are still ways to improve the results. For instance, trying a character-level model which learns directly from characters without the need for a word embedding layer [4]. Also, versioned data for each programming language could be obtained to make it possible to assign a specific version to a source code snippet.


  1. Githut –
  2. Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 1746–1751.
  3. Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., & Hao, H. (2015). Semantic Clustering and Convolutional Neural Network for Short Text Categorization. Proceedings ACL 2015, 352–357.
  4. Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification, 1–9.


Text Analysis API - Sign up



When it comes to aggregating news content, two of the most critical elements we consider are the quality and reputation of a media source. Together, these two elements play a significant role in deciding the level of trust and popularity afforded to websites by the people who visit and consume them.

With websites in general, a greater volume of traffic will often correlate directly with source popularity and in some cases a higher degree of quality content. Site reputation then naturally follows.  Therefore it makes sense to say that, for the most part, a website’s reputation can be benchmarked using visitor traffic volumes.

Wouldn’t it be great to be able to filter and sort your news content based on such reputation?

Well now you can with our latest News API feature – the ability to search, sort and filter by Alexa ranking.

What is Alexa ranking?

The Alexa ranking system is compiled by who analyze the frequency of visits on websites and rank them against each other according to the volume of visits they receive. Alexa’s algorithm is pretty simple – it is calculated by the amount of website traffic generated over the past 3 months.

More website traffic = better ranking

You can check out Alexa’s Top 500 ranked websites. No prizes for guessing which site is top of the pile!

Why is it useful in content sourcing?

As an aggregator and analyzer of news content, you should be able to sort your sourced content according to a trust rating of some kind. As one of the world’s most respected and trusted ranking systems, Alexa was an obvious choice for us when it came to deciding how to bring this option to you.

Put simply, by including specific Alexa ranking parameters in your content search you can significantly increase the chances that your results/stories are coming from reputable, popular and trustworthy sources. This is particularly relevant to News API users who are pushing content to apps and news feeds for their own users to consume.

Alexa ranking in the News API

While we check each and every news source for quality standards before adding it to our News API, there is naturally going to be a significant variation in the Alexa ranking of these sources.

Sure, you will find plenty of top quality content from websites that perhaps haven’t got the best Alexa ranking, but if you’re pushing stories to an app or news feed you may want to ensure you are gathering stories from the most reliable sources possible.

In saying that, you may have a completely different approach and wish to avoid the higher-ranked sources and opt for some lesser-known media outlets whose stories are perhaps performing well on social media. Creating a search that combines a specific Alexa rank range with Social Shares will allow you to do exactly this.

Whatever your goal, this new feature adds a whole new dimension to your news content sourcing powers, further adding to the already substantial search options in the News API.

How do I search and sort by Alexa ranking?

We’ve added Alexa integration to nearly all News API endpoints. For the full list of parameters available check out our documentation.

To source stories from websites with a specific range of Alexa ranks, you can use source.rankings.alexa.rank.min: and/or source.rankings.alexa.rank.max. to define your desired range.


Let’s take a look at an example where we want to only search for a specific entity mentioned on sources from a defined range of Alexa rankings.

We’re going to search for stories that mention Usain Bolt in the past 7 days from sources with an Alexa rank between 1 and 1000.

Here are the new parameters that you can include in your search query;



sort_by: source.rankings.alexa.rank


Note: You can try this search query for yourself using our live News API Demo, where we generated the following results and charts;

Here’s an example of a retrieved story from our Demo search query with the Alexa rank displayed just below the image and source to the left.

As mentioned, we’ve added Alexa integration to nearly all News API endpoints. Here’s a few examples, starting with ‘Stories Over Time’ using the /time_series endpoint and a World Cloud generated from /trends

Screen Shot 2016-08-25 at 17.18.31

Social Shares & Article Length Breakdowns using the /histograms endpoint

Screen Shot 2016-08-25 at 17.19.08

Other filtering options

– Country specific

Sometimes it’s particularly useful to narrow your search down to popular sources from particular countries. For example, when reporting on a news story from India it might be worthwhile narrowing your search down to sources which are popular in India.

You can easily define ranking ranges from specific countries using the[]: parameter. We support ISO 3166-1 alpha-2 country codes.

– Sort by links-in

Another strong indicator of the popularity of a source is the number of backlinks or links-in it receives.

You can sort_by source.links_in_count to sort your results in order of the most links-in (or backlinks as you may know them) that a site receives.


The ability to search and sort news content by Alexa ranking has added a fantastic new dimension to our News API and judging by the feedback we have been receiving from our users, it is one that will be well utilized.

Wanna try our News API for free? We’re offering a no-obligation 14 day trial. Simply head over to our News API Sign Up page or click the image below.


News API - Sign up


There has been a large resurgence of interest in generative models recently (see this blog post by OpenAI for example). These are models that can learn to create data that is similar to data that we give them. The intuition behind this is that if we can get a model to write high-quality news articles for example, then it must have also learned a lot about news articles in general. Or in other words, the model should also have a good internal representation of news articles. We can then hopefully use this representation to help us with other related tasks, such as classifying news articles by topic.

Actually training models to create data like this is not easy, but in recent years a number of methods have started to work quite well. One such promising approach is using Generative Adversarial Networks (GANs). The prominent deep learning researcher and director of AI research at Facebook, Yann LeCun, recently cited GANs as being one of the most important new developments in deep learning:

“There are many interesting recent development in deep learning…The most important one, in my opinion, is adversarial training (also called GAN for Generative Adversarial Networks). This, and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion.” – Yann LeCun

The rest of this post will describe the GAN formulation in a bit more detail, and provide a brief example (with code in TensorFlow) of using a GAN to solve a toy problem.

Discriminative vs. Generative models

Before looking at GANs, let’s briefly review the difference between generative and discriminative models:

  • A discriminative model learns a function that maps the input data (x) to some desired output class label (y). In probabilistic terms, they directly learn the conditional distribution P(y|x).
  • A generative model tries to learn the joint probability of the input data and labels simultaneously, i.e. P(x,y). This can be converted to P(y|x) for classification via Bayes rule, but the generative ability could be used for something else as well, such as creating likely new (x, y) samples.

Both types of models are useful, but generative models have one interesting advantage over discriminative models – they have the potential to understand and explain the underlying structure of the input data even when there are no labels. This is very desirable when working on data modelling problems in the real world, as unlabelled data is of course abundant, but getting labelled data is often expensive at best and impractical at worst.

Generative Adversarial Networks

GANs are an interesting idea that were first introduced in 2014 by a group of researchers at the University of Montreal lead by Ian Goodfellow (now at OpenAI). The main idea behind a GAN is to have two competing neural network models. One takes noise as input and generates samples (and so is called the generator). The other model (called the discriminator) receives samples from both the generator and the training data, and has to be able to distinguish between the two sources. These two networks play a continuous game, where the generator is learning to produce more and more realistic samples, and the discriminator is learning to get better and better at distinguishing generated data from real data. These two networks are trained simultaneously, and the hope is that the competition will drive the generated samples to be indistinguishable from real data.


GAN overview. Source:

The analogy that is often used here is that the generator is like a forger trying to produce some counterfeit material, and the discriminator is like the police trying to detect the forged items. This setup may also seem somewhat reminiscent of reinforcement learning, where the generator is receiving a reward signal from the discriminator letting it know whether the generated data is accurate or not. The key difference with GANs however is that we can backpropagate gradient information from the discriminator back to the generator network, so the generator knows how to adapt its parameters in order to produce output data that can fool the discriminator.

So far GANs have been primarily applied to modelling natural images. They are now producing excellent results in image generation tasks, generating images that are significantly sharper than those trained using other leading generative methods based on maximum likelihood training objectives. Here are some examples of images generated by GANs:


Generated bedrooms. Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”



Generated CIFAR-10 samples. Source: “Improved Techniques for Training GANs”

Approximating a 1D Gaussian distribution

To get a better understanding of how this all works, we’ll use a GAN to solve a toy problem in TensorFlow – learning to approximate a 1-dimensional Gaussian distribution. This is based on a blog post with a similar goal by Eric Jang. The full source code for our demo is available on Github (, here we will just focus on some of the more interesting parts of the code.

First we create the “real” data distribution, a simple Gaussian with mean 4 and standard deviation of 0.5. It has a sample function that returns a given number of samples (sorted by value) from the distribution.

class DataDistribution(object):
    def __init__(self): = 4
        self.sigma = 0.5

    def sample(self, N):
        samples = np.random.normal(, self.sigma, N)
        return samples

The data distribution that we will try to learn looks like this:


We also define the generator input noise distribution (with a similar sample function). Following Eric Jang’s example, we also go with a stratified sampling approach for the generator input noise – the samples are first generated uniformly over a specified range, and then randomly perturbed.

class GeneratorDistribution(object):
    def __init__(self, range):
        self.range = range

    def sample(self, N):
        return np.linspace(-self.range, self.range, N) + \
            np.random.random(N) * 0.01

Our generator and discriminator networks are quite simple. The generator is a linear transformation passed through a nonlinearity (a softplus function), followed by another linear transformation.

def generator(input, hidden_size):
    h0 = tf.nn.softplus(linear(input, hidden_size, 'g0'))
    h1 = linear(h0, 1, 'g1')
    return h1

In this case we found that it was important to make sure that the discriminator is more powerful than the generator, as otherwise it did not have sufficient capacity to learn to be able to distinguish accurately between generated and real samples. So we made it a deeper neural network, with a larger number of dimensions. It uses tanh nonlinearities in all layers except the final one, which is a sigmoid (the output of which we can interpret as a probability).

def discriminator(input, hidden_size):
    h0 = tf.tanh(linear(input, hidden_size * 2, 'd0'))
    h1 = tf.tanh(linear(h0, hidden_size * 2, 'd1'))
    h2 = tf.tanh(linear(h1, hidden_size * 2, 'd2'))
    h3 = tf.sigmoid(linear(h2, 1, 'd3'))
    return h3

We can then connect these pieces together in a TensorFlow graph. We also define loss functions for each network, with the goal of the generator being simply to fool the discriminator.

with tf.variable_scope('G'):
    z = tf.placeholder(tf.float32, shape=(None, 1))
    G = generator(z, hidden_size)

with tf.variable_scope('D') as scope:
    x = tf.placeholder(tf.float32, shape=(None, 1))
    D1 = discriminator(x, hidden_size)
    D2 = discriminator(G, hidden_size)

loss_d = tf.reduce_mean(-tf.log(D1) - tf.log(1 - D2))
loss_g = tf.reduce_mean(-tf.log(D2))

We create optimizers for each network using the plain GradientDescentOptimizer in TensorFlow with exponential learning rate decay. We should also note that finding good optimization parameters here did require some tuning.

def optimizer(loss, var_list):
    initial_learning_rate = 0.005
    decay = 0.95
    num_decay_steps = 150
    batch = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    optimizer = GradientDescentOptimizer(learning_rate).minimize(
    return optimizer

vars = tf.trainable_variables()
d_params = [v for v in vars if'D/')]
g_params = [v for v in vars if'G/')]

opt_d = optimizer(loss_d, d_params)
opt_g = optimizer(loss_g, g_params)

To train the model, we draw samples from the data distribution and the noise distribution, and alternate between optimizing the parameters of the discriminator and the generator.

with tf.Session() as session:

    for step in xrange(num_steps):
        # update discriminator
        x = data.sample(batch_size)
        z = gen.sample(batch_size)[loss_d, opt_d], {
            x: np.reshape(x, (batch_size, 1)),
            z: np.reshape(z, (batch_size, 1))

        # update generator
        z = gen.sample(batch_size)[loss_g, opt_g], {
            z: np.reshape(z, (batch_size, 1))

The following animation shows how the generator learns to approximate the data distribution during training:


We can see that at the start of the training process, the generator was producing a very different distribution to the real data. It eventually learned to approximate it quite closely (somewhere around frame 750), before converging to a narrower distribution focused on the mean of the input distribution. After training, the two distributions look something like this:



This makes sense intuitively. The discriminator is looking at individual samples from the real data and from our generator. If the generator just produces the mean value of the real data in this simple example, then it is going to be quite likely to fool the discriminator.

There are many possible solutions to this problem. In this case we could add some sort of early-stopping criterion, to pause training when some similarity threshold between the two distributions is reached. It is not entirely clear how to generalise this to bigger problems however, and even in the simple case, it may be hard to guarantee that our generator distribution will always reach a point where early stopping makes sense. A more appealing solution is to address the problem directly by giving the discriminator the ability to examine multiple examples at once.

Improving sample diversity

The problem of the generator collapsing to a parameter setting where it outputs a very narrow distribution of points is “one of the main failure modes” of GANs according to a recent paper by Tim Salimans and collaborators at OpenAI. Thankfully they also propose a solution: allow the discriminator to look at multiple samples at once, a technique that they call minibatch discrimination.

In the paper, minibatch discrimination is defined to be any method where the discriminator is able to look at an entire batch of samples in order to decide whether they come from the generator or the real data. They also present a more specific algorithm which works by modelling the distance between a given sample and all other samples in the same batch. These distances are then combined with the original sample and passed through the discriminator, so it has the option to use the distance measures as well as the sample values during classification.

The method can be loosely summarized as follows:

  • Take the output of some intermediate layer of the discriminator.
  • Multiply it by a 3D tensor to produce a matrix (of size num_kernels x kernel_dim in the code below).
  • Compute the L1-distance between rows in this matrix across all samples in a batch, and then apply a negative exponential.
  • The minibatch features for a sample are then the sum of these exponentiated distances.
  • Concatenate the original input to the minibatch layer (the output of the previous discriminator layer) with the newly created minibatch features, and pass this as input to the next layer of the discriminator.

In TensorFlow that translates to something like:

def minibatch(input, num_kernels=5, kernel_dim=3):
    x = linear(input, num_kernels * kernel_dim)
    activation = tf.reshape(x, (-1, num_kernels, kernel_dim))
    diffs = tf.expand_dims(activation, 3) - \
        tf.expand_dims(tf.transpose(activation, [1, 2, 0]), 0)
    abs_diffs = tf.reduce_sum(tf.abs(diffs), 2)
    minibatch_features = tf.reduce_sum(tf.exp(-abs_diffs), 2)
    return tf.concat(1, [input, minibatch_features])

We implemented the proposed minibatch discrimination technique to see if it would help with the collapse of the generator output distribution in our toy example. The new behaviour of the generator network during training is shown below.


It’s clear in this case that adding minibatch discrimination causes the generator to maintain most of the width of the original data distribution (although it’s still not perfect). After convergence the distributions now look like this:


One final point on minibatch discrimination is that it makes the batch size even more important as a hyperparameter. In our toy example we had to keep batches quite small (less than around 16) for training to converge. Perhaps it would be sufficient to just limit the number of samples that contribute to each distance measure rather than use the full batch, but again this another parameter to tune.

Final thoughts

Generative Adversarial Networks are an interesting development, giving us a new way to do unsupervised learning. Most of the successful applications of GANs have been in the domain of computer vision, but here at Aylien we are researching ways to apply these techniques to natural language processing. If you’re working on the same idea and would like to compare notes then please get in touch.

One big open problem in this area is how best to evaluate these sorts of models. In the image domain it is quite easy to at least look at the generated samples, although this is obviously not a satisfying solution. In the text domain this is even less useful (unless perhaps your goal is to generate prose). With generative models that are based on maximum likelihood training, we can usually produce some metric based on likelihood (or some lower bound to the likelihood) of unseen test data, but that is not applicable here. Some GAN papers have produced likelihood estimates based on kernel density estimates from generated samples, but this technique seems to break down in higher dimensional spaces. Another solution is to only evaluate on some downstream task (such as classification). If you have any other suggestions then we would love to hear from you.

More information

If you want to learn more about GANs we recommend starting with the following publications:

Feel free to reuse our GAN code, and of course keep an eye on our blog. Comments, corrections and feedback are welcome.

Text Analysis API - Sign up


Getting Started with the News API Part 2: Insights


Welcome to Part 2 of our Getting Started with the News API series. In Part 1 we introduced you to the API, the demo/query builder and the supporting interactive documentation. We also showed you how to perform a variety of basic searches to help you get familiar with the API.

Today we’ll begin to explore some of the more advanced capabilities that the API has to offer from an analysis and insights point of view. Whether you’re pushing data from the News API into an app, resurfacing the analysis in a news feed or building intuitive dashboards with the extracted data, the News API enables you to obtain a deep understanding for what’s happening in the news, in near real-time.

With this in mind, we’ll today be focusing on getting you up to speed with the following features;

  • Time Series – leveraging time stamped data
  • Sentiment – comparing the positive/negative polarity in author opinion
  • Histograms – working with numerical data points and metrics
  • Trends – uncovering useful insights

Before you go any further – have you created an AYLIEN account?

If not, we recommend you sign up for your Free trial and head back to Part 1 to learn the basics of search, making calls and creating specific search queries;


Basic Search News API


If you’ve created your account already and learned how to perform basic searches in Part 1, let’s get cracking with Part 2, starting with the Time Series endpoint.

Making calls:

We’ve created SDKs for some of the most popular programming languages which make using the API super easy. Just like Part 1, we’ve also included some code snippets for you to copy and build on.


News API results are returned in JSON format, making it easy for you to do as you please with the data. Throughout this post, we will be displaying charts and graphs that we generated using the JSON results returned from the code snippet examples provided.

1. Time Series – leveraging time stamped data

A Time Series is a sequence of data points plotted over a specified time period. The News API /time_series endpoint can be used to analyze a set of data points relative to this specified period. This makes it easy to analyze timestamped data, allowing you to easily visualize the data in an understandable and meaningful format.

In keeping with our analysis of the US Presidential election from Part 1, we’ll use the following example;


Search stories from sources in the USA mentioning Hillary Clinton from the past 60 Days


Note: Make sure you replace the APP_ID and APP_KEY placeholders with your own API credentials. Again, if you haven’t got an API account you can sign up here.


Visualized Results

As we mentioned earlier, because the News API returns results in JSON format, you can display the results as you please. Here’s an example of a chart we built using the results obtained from the snippet above.

Screen Shot 2016-08-10 at 16.39.39

Visit our documentation for more info on the Time Series endpoint.

Sentiment Analysis

Now let’s dive a little deeper by changing up and specifying our search to look at the levels of sentiment within the stories we retrieve. We’re narrowing our search parameters to only include stories from the past two days and from CNN and Fox News only;


Search stories from sources CNN and Fox News in the US mentioning Hillary Clinton from the past 2 Days

News API - Sign up


From July 20th to July 28th 2016, I had the opportunity of  attending the 6th Lisbon Machine Learning School. The Lisbon Machine Learning School (LxMLS) is an annual event that brings together researchers and graduate students in the fields of NLP and Computational Linguistics, computer scientists with an interest in statistics and ML, and industry practitioners with a desire for a more in-depth understanding. Participants had a chance to join workshops and labs, where they got hands-on experience with building and exploring state-of-the-art deep learning models, as well as to attend talks and speeches by prominent deep learning and NLP researchers from a variety of academic and industrial organisations. You can find the entire programme here.

In this blog post, I am going to share some of the highlights, key insights and takeaways of the summer school. I am going to skip the lectures of the first and second day as they introduce basic Python, Linear Algebra, and Probability Theory concepts and focus on the later lectures and talks. First, we are going to talk about sequence models. We will then turn to structured prediction, a type of supervised ML common to NLP. We will then summarize the lecture on Syntax and Parsing and finally provide insights with regard to Deep Learning. The accompanying slides can be found as a reference at the end of this blog post.

Disclaimer: This blog post is not meant to give a comprehensive introduction of each of the topics discussed; it should rather give you an overview of the week-long event and provide you with pointers if you want to delve deeper into any of the topics.

Sequence Models

Noah Smith of the University Washington kicked off the third day of the summer school with a compelling lecture about sequence models. To test your understanding of sequence models, try to answer – without reading further – the following question: What is the most basic sequence model depicted in Figure 1?


Figure 1: The most basic sequence model

Correct! It is the bag-of-words (notice which words have “fallen” out of the bag-of-words). The bag-of-words model makes the strongest independence assumption of all sequence models: It supposes that each word is entirely independent of its predecessors. It is obvious why models that rely on this assumption do only a poor job at modelling language: every word naturally depends on the words that have preceded it.

Somewhat more sophisticated models thus relax this naive assumption to reduce the entropy: A 1st Order Markov model makes each word dependent on the word that immediately preceded it. This way, it is already able to capture some context of the context that can help to disambiguate a new word. More generally, \(m^{\text{th}}\) Order Markov Models make each word depend on its previous \(m\) words.

In mathematical terms, in \(m^{\text{th}}\) Order Markov Models, the probability of a text sequence (we assume here that such a sequence is delimited by start and stop symbols) can be calculated using the chain rule as the product of the probabilities of the individual words:

\(p(\text{start}, w_1, w_2, …, w_n, \text{stop}) = \prod\limits_{i=1}^{n+1} \gamma (w_i \: | \: w_{i-m}, …, w_{i-1}) \)

where \(\gamma\) is the probability of the current word \(w_i\) given its \(m\) previous words, i.e. the probability to transition from the previous words to the current word.

We can view bag-of-words and \(m^{\text{th}}\) Order Markov Models as occupying the following spectrum:


Figure 2: From bag-of-words to history-based models

As we go right in Figure 2, we make weaker independence assumption and in exchange gain richer expressive power, while requiring more parameters – until we eventually obtain the most expressive – and most rigid – model, a history-based model where each word depends on its entire history, i.e. all preceding words.

As a side-note, state-of-the-art sequence modelling models such as recurrent neural networks and LSTMs can be thought of as being located on the right side of this spectrum, as they don’t require an explicit specification of context words but are – theoretically – able to take into account the entire history.

In many cases, we would not only like to model just the observed sequence of symbols, but take some additional information into account. Hidden Markov Models (HMMs) allow us to associate with each symbol \(w_i\) some missing information, its “state” \(s_i\). The probability of a word sequence in an HMM then not only depends on the transition probability \(\gamma\) but also on the so-called emission probability \(\eta\):

\(p(\text{start}, w_1, w_2, …, w_n, \text{stop}) = \prod\limits_{i=1}^{n+1} \eta (w_i \: | \: s_i) \: \gamma (s_i \: | \: s_{i-1}) \)

Consequently, the HMM is a joint model over observable symbols and hidden/latent/unknown classes. HMMs have traditionally been used in part-of-speech tagging or named entity recognition where the hidden states are POS and NER tags respectively.

If we want to determine the most probable sequence of hidden states, we face a space of potential sequences that grows exponentially with the sequence length. The classic dynamic algorithm to cope with this problem is the Viterbi algorithm, which is used in HMMs, CRFs, and other sequence models to calculate the most probable sequence of hidden states: It lays out the symbol sequence and all possible states in a grid and proceeds left-to-right to compute the maximum probability to transition in every new state given the previous states. The most probable sequence can then be found by back-tracking as in Figure 3.


Figure 3: The Viterbi algorithm

A close relative is the forward-backward algorithm, which is used to calculate the probability of a word sequence and the probabilities of each word’s states, e.g. for language modelling. Indeed, the only difference between Viterbi and the forward-backward algorithm is that Viterbi takes the maximum of the probabilities of the previous state, while forward-backward takes the sum. In this sense, they correspond to the same abstract algorithm, which is instantiated in two different semirings where a semiring informally is a set of values and some operations that obey certain properties.

Finally, if we want to learn HMMs in an unsupervised way, we use the well-known Expectation Maximisation (EM) algorithm, which consists of two steps: During the E step, we calculate the probability of each possible transition and emission at every position with forward-backward (or Viterbi for “hard” EM); for the M step, we re-estimate the parameters with MLE.

Machine Translation

On the evening of the third day, Philipp Koehn, one of the pioneers of MT and inventor of phrase-based machine translation gave a talk on Machine Translation as Sequence Modelling, including a detailed review of different MT and alignment approaches. If you are interested in a comprehensive history of MT that takes you from IBM Model 1 all the way to phrase-based, syntax-based and eventually neural MT, while delving into the details of alignment, translation, and decoding, definitely check out the slides here.

Structured Prediction

HMMs can model sequences, but as their weights are tied to the generative process, strong independence assumptions need to be made to make their computation tractable. We will now turn to a category of models that are more expressive and can be used to predict more complex structures: Structured prediction — which was introduced by Xavier Carreras of Xerox Research on the morning of the fourth day — is used to refer to ML algorithms that don’t just predict scalar or real values, but more complex structures. As complex structures are common in language, so is structured prediction; example tasks of structured prediction in NLP include POS tagging, named entity recognition, machine translation, parsing, and many others.

A successful category of structured prediction models are log-linear models, which are so-called because they model log-probabilities using a linear predictor. Such models try to estimate the parameters \(w\) by calculating the following probability:

\(\text{log} \: \text{Pr}(\mathbf{y} \: | \: \mathbf{x}; \mathbf{w}) = \text{log} \:\frac {\text{exp}\{\mathbf{w} \cdot \mathbf{f}(\mathbf{x},\mathbf{y})\}}{Z(\mathbf{x};\mathbf{w})}\)

where \(\mathbf{x} = x_1, x_2, …, x_n \in \mathcal{X}\) is the sequence of symbols, \(\mathbf{y} = y_1, y_2, …, y_n \in \mathcal{Y}\) is the corresponding sequence of labels, \(\mathbf{f}(\mathbf{x},\mathbf{y})\) is a feature representation of \(\mathbf{x}\) and \(\mathbf{y}\), and \(Z(\mathbf{x};\mathbf{w}) = \sum\limits_{\mathbf{y}’ \in \mathcal{Y}} \text{exp}(\mathbf{w} \cdot \mathbf{f}(\mathbf{x},\mathbf{y}’)) \) is also referred to as the partition function.

Two approaches that can be used to estimate the model parameters \(w\) are:

  1. Maximum Entropy Markov Models (MEMMs), which assume that \(\text{Pr}(\mathbf{y} \: | \: \mathbf{x}; \mathbf{w})\) decomposes, i.e. that we can express it as a product of the individual label probabilities that only depend on the previous label (similar to HMMs).
  2. Conditional Random Fields (CRFs), which make a weaker assumption by only assuming that \(\mathbf{f}(\mathbf{x},\mathbf{y})\) decomposes.

In MEMMs, we assume – similarly to Markov Models – that the label \(y_i\) at the \(i\) th position does not depend on all past labels, but only on the previous label \(y_{i-1}\). In contrast to Markov Models, MEMMs allow us to condition the label \(y_i\) on the entire symbol sequence \(x_{1:n}\). Both assumptions combined lead to the following probability of label \(y_i\) in MEMMs:

\(\text{Pr}(y_i \: | \: x_{1:n}, y_{1:i-1}) = \text{Pr}(y_i \: | \: x_{1:n}, y_{i-1})\)

By this formulation, the objective of MEMMs reduces sequence modelling to multi-class logistic regression.

In CRFs, we factorize on label bigrams. Instead of greedily predicting the most probable label \(y_i\) at every position \(i\), we instead aim to find the sequence of labels with the maximum probability:

\(\underset{y \in \mathcal{Y}}{\text{argmax}} \sum_i \mathbf{w} \cdot \mathbf{f}(\mathbf{x}, i, y_{i-1}, y_i)\)

We then estimate the parameters \(w\) of our model using gradient-based methods where we can use forward-backward to compute the gradient.

CRFs vs. MEMMs

By choosing between MEMMs and CRFs, we make the choice between local and global normalisation. While MEMMs aim to predict the most probable label at every position, CRFs aim to find the most probable label sequence. This, however, leads to the so-called “Label Bias Problem” in MEMMs: As MEMMs choose the label with the highest probability, the model is biased towards more frequent labels, often irrespective of the local context.

As MEMMs reduce to multi-class classification, they are cheaper to train. On the other hand, CRFs are more flexible and thus easier to extend to more complex structures.

This distinction between local and global normalisation has been a recurring topic in sequence modelling and a key criterion when choosing an algorithm. For text generation tasks, global normalisation is still too expensive, however. Many state-of-the-art approaches thus employ beam search as a compromise between local and global normalisation. In most sequence modelling tasks, local normalisation is very popular due to its ease of use, but might fall out of favour as more advanced models and implementations for global normalisation become available. To this effect, a recent outstanding paper at ACL (Andor et al., 2016) shows that globally normalised models are strictly more expressive than locally normalised ones.

HMMs vs. CRFs

Another distinction that is worth investigating is the difference between generative and discriminative models: HMMs are generative models, while CRFs are discriminative. HMMs only take into account the previous word as its features are tied to the generative process. In contrast, CRF features are very flexible. They can look at the whole input \(x\) paired with a label bigram \((y_i , y_{i+1})\). In practice, for prediction tasks, such “good” discriminative features can improve accuracy a lot.

Regarding the parameter estimation, the distinction between generative and discriminative becomes apparent: HMMs focus on explaining the data, both \(x\) and \(y\), while CRFs focus on the mapping from \(x\) to \(y\). Which model is more appropriate depends on the task: CRFs are commonly used in tasks such as POS tagging and NER, while HMMs have traditionally lain at the heart of speech recognition.

Structured Prediction in NLP with Imitation Learning

Andreas Vlachos of the University of Sheffield gave a talk on using imitation learning for structured prediction in NLP, which followed the same distinction between local normalisation (aka incremental modelling), i.e. greedily predicting one label at a time and global normalisation (aka joint modelling), i.e. scoring the complete outputs with a CRF that we discussed above. Andreas talked about how imitation learning can be used to improve incremental modelling as it allows to a) explore the search space, b) to address error-propagation, and c) to train with regard to the task-specific loss function.

There are many popular imitation learning algorithms in the literature such as SEARN (Daumé III et al., 2009), Dagger (Ross et al. 2011), or V-DAgger (Vlachos and Clark, 2014). Recently, MIXER (Ranzato et al., 2016) has been proposed to directly optimise metrics for text generation, such as BLEU or ROUGE.

An interesting perspective is that imitation learning can be seen as inverse reinforcement learning: Whereas we want to learn the best policy in reinforcement learning, we know the optimal policy in imitation learning, i.e. the labels in the training data; we then infer the per-action reward function and learn a policy, i.e. a classifier that can generalise to unseen data.

Demo Day

Figure 4: Aylien stand at Demo Day

In the evening of the fourth day, we presented – along with other NLP companies and research labs – Aylien at the LxMLS Demo Day.

We presented an overview of our research directions at Aylien, as well as a 1D generative adversarial network demo and visualization.

Syntax and Parsing

Having looked at generic models that are able to cope with sequences and more complex structures, we now briefly mention some of the techniques that are commonly used to deal with one of language’s unique characteristics: syntax. To this end, Slav Petrov of Google Research gave an in-depth lecture about syntax and parsing on the fifth day of the summer school, which discussed, among others, successful parsers such as the Charniak and the Berkeley parser, context-free grammars and phrase-based parsing, projective and non-projective dependency parsing, as well as more recent transition and graph-based parsers.

To tie this to what we’ve already discussed, Figure 5 demonstrates how the distinction between generative and discriminative models applies to parsers.


Figure 5: Generative vs. discriminative parsing models

From Dependencies to Constituents

In the evening of the fifth day, André Martins of Unbabel gave talk a on an ACL 2015 paper of his, in which he shows that constituent parsing can be reduced to dependency parsing to get the best of both worlds: the informativeness of constituent parser output and the speed of dependency parsers.

Their approach works for any out-of-the-box dependency parser, is competitive for English and morphologically rich languages, and achieves results above the state of the art for discontinuous parsing (where edges are allowed to intersect).

Deep Learning

Finally, the two last days were dedicated to Deep Learning and featured prolific researchers from academia and industry labs as speakers. On the morning of the sixth day, Wang Ling of Google DeepMind gave one of the most gentle, family-friendly intro to Deep Learning I’ve seen – titled Deep Neural Networks Are Our Friends with a theme inspired by the Muppets.

The evening talk by Oriol Vinyals of Google DeepMind detailed some of his lessons learned when working on sequence-to-sequence models at Google and gave glimpses of interesting future challenges, among them, one-shot learning for NLP (Vinyals et al., 2016) and enabling neural networks to ponder decisions (Graves, 2016).

For the lecture on the last day, Chris Dyer of CMU and Google DeepMind discussed modelling sequential data with recurrent neural networks (RNNs) and shared some insights and intuitions with regard to working with RNNs and LSTMs.

Exploding / vanishing gradients

If you’ve worked with RNNs before, then you’re most likely familiar with the exploding/vanishing gradients problem: As the length of the sequence increases, computations of the model get amplified, which leads to either exploding or vanishing gradients and thus renders the model incapable to learn. The intuition why advanced models such as LSTMs and GRUs mitigate this problem is that they use summations instead of multiplications (which lead to exponential growth).

Deep LSTMs

Figure 6: Deep LSTMs

Deep or stacked LSTMs are by now a very common sight in the literature and state-of-the-art for many sequence modelling problems. Still, descriptions of implementations often omit details, which might be perceived as self-evident. This, however, means that it is not always clear how a model looks like exactly or how it differs from similar architectures. The same applies to Deep LSTMs. The most standard convention feeds the input not only to the first but (via skip connections) also to subsequent layers as in Figure 6. Additionally, dropout is generally applied only between layers and not on the recurrent connections as this would drop out more and more value over time.


Figure 7: Dropout in Deep LSTMs

Does Depth Matter?

Generally, depth helps. However, in comparison to other applications such as audio/visual processing, depth plays a less significant role in NLP. Hypotheses for this observation are: a) More transformation is required for speech processing, image recognition etc. than for common text applications; b) Less effort has been made to find good architectures (RNNs are expensive to train; have been widely used for less long); c) Backpropagation through time and depth is hard and we need better optimisers.

Generally, 2-8 layers are standard across text applications. Input skip connections are used often but by no means universally.

Only recently have also very deep architectures been proposed for NLP (Conneau et al., 2016).


Mini-batching is generally necessary to make use of optimised matrix-matrix multiplication. In practice, however, this usually requires bucketing training instances based on similar lengths and padding with \(0\)’s, which can be a nuisance. Because of this, this is – according to Chris Dyer – “the era of assembly language programming for neural networks. Make the future an easier place to program!”

Character-based models

Character-based models have gained more popularity recently and for some tasks such as language modelling, using character-based LSTMs blows the results of word-based models out of the water, achieving a significantly lower perplexity with a lot fewer parameters particularly for morphologically rich languages.


Figure 8: CharLSTM > Word Lookup


Finally, no overview of recurrent neural networks is complete without the mention of attention, one of the most influential, recently proposed notions with regard to LSTMs. Attention is closely related to “pooling” operations in convolutional neural networks (and other architectures) as it also allows to selectively focus on particular elements of the input. The most popular attention architecture pioneered by Bahdanau et al. (2015) seems to only care about “content” in that it relies on computing the dot product, i.e. the cosine similarity between vectors. It contains no obvious bias in favor of diagonals, short jumps, fertility, or other structures that might guide actual attention from a psycho-linguistic perspective. Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are many more opportunities for research.

Attention is similar to alignment, but there are important differences: a) alignment makes stochastic but hard decisions. Even if the alignment probability distribution is “flat”, the model picks one word or phrase at a time; b) in contrast, attention is “soft” (all words are interpolated based on their attention weights). Finally, there is a big difference between “flat” and “peaked” attention weights.

Memory Networks for Language Understanding

Antoine Bordes of Facebook AI Research gave the last talk of the summer school, in which he discussed Facebook AI Research’s two main research directions:

On the one hand, they are working on (Key-Value) Memory Networks, which can be used to jointly model symbolic and continuous systems. They can be trained end-to-end through back propagation and with SGD and provide great flexibility on how to design memories.

On the other hand, they are working on new tools for developing learning algorithms. They have created several datasets of reasonable sizes, such as bAbI, CBT, and MovieQA that are designed to ease interpretation of a model’s capabilities and to foster research.


Figure 9: LxMLS 2016 Group Picture

That was the Lisbon Machine Learning Summer School 2016! We had a blast and hope to be back next year!


Sequence Models (Noah Smith)

Machine Translation as Sequence Modelling (Philipp Koehn)

Learning Structured Predictors (Xavier Carreras)

Structured Prediction in NLP with Imitation Learning (Andreas Vlachos)

Syntax and Parsing – I, II (Slav Petrov)

Turbo Parser Redux: From Dependencies to Constituents (André Martins)

Deep Neural Networks Are Our Friends (Wang Ling)

Modeling Sequential Data with Recurrent Networks (Chris Dyer)

Memory Networks for Language Understanding (Antoine Bordes)

Text Analysis API - Sign up