Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

Our researchers at AYLIEN keep abreast of and contribute to the latest developments in the field of Machine Learning. Recently, two of our research scientists, John Glover and Sebastian Ruder, attended NIPS 2016 in Barcelona, Spain. In this post, Sebastian highlights some of the stand-out papers and trends from the conference.


The Conference on Neural Information Processing Systems (NIPS) is one of the two top conferences in machine learning. It took place for the first time in 1987 and is held every December, historically in close proximity to a ski resort. This year, it took place in sunny Barcelona. The conference (including tutorials and workshops) went on from Monday, December 5 to Saturday, December 10. The full conference program is available here.

Machine Learning seems to become more pervasive every month. However, it is still sometimes hard to keep track of the actual extent of this development. One of the most accurate barometers for this evolution is the growth of NIPS itself. The number of attendees skyrocketed at this year’s conference growing by over 50% year-over-year.


Image 1: The growth of the number of attendees at NIPS follows (the newly coined) Terry’s Law (named after Terrence Sejnowski, the president of the NIPS foundation; faster growth than Moore’s Law)

Unsurprisingly, Deep Learning (DL) was by far the most popular research topic, with about every fourth of more than 2,500 submitted papers (and 568 accepted papers) dealing with deep neural networks.


Image 2: Distribution of topics across all submitted papers (Source: The review process for NIPS 2016)

On the other hand, the distribution of research paper topics has quite a long tail and reflects the diversity of topics at the conference that span everything from theory to applications, from robotics to neuroscience, and from healthcare to self-driving cars.

Generative Adversarial Networks

One of the hottest developments within Deep Learning was Generative Adversarial Networks (GANs). The minimax game playing networks have by now won the favor of many luminaries in the field. Yann LeCun hails them as the most exciting development in ML in recent years. The organizers and attendees of NIPS seem to side with him: NIPS featured a tutorial by Ian Goodfellow about his brainchild, which led to a packed main conference hall.


Image 3: A full conference hall at the GAN tutorial

Though a fairly recent development, there are many cool extensions of GANs among the conference papers:

  • Reed et al. propose a model that allows you to specify not only what you want to draw (e.g. a bird) but also where to put it in an image.
  • Chen et al. disentangle factors of variation in GANs by representing them with latent codes. The resulting models allow you to adjust e.g. the type of a digit, its breadth and width, etc.

In spite of their popularity, we know alarmingly little about what makes GANs so capable of generating realistic-looking images. In addition, making them work in practice is an arduous endeavour and a lot of (undocumented) hacks are necessary to achieve the best performance. Soumith Chintala presents a collection of these hacks in his “How to train your GAN” talk at the Adversarial Training workshop.


Image 4: How to train your GAN (Source: Soumith Chintala)

Yann LeCun muses in his keynote that the development of GANs parallels the history of neural networks themselves: They were poorly understood and hard to get to work in the beginning and only took off once researchers figured out the right tricks and learned how to make them work. At this point, it seems unlikely that GANs will experience a winter anytime soon; the research community is still at the beginning in learning how to make the best use of them and it will be exciting to see what progress we can make in the coming years.

On the other hand, the success of GANs so far has been limited mostly to Computer Vision due to their difficulty in modelling discrete rather than continuous data. The Adversarial Training workshop showcased some promising work in this direction (see e.g. our own John Glover’s paper on modeling documents, this paper and this paper on generating text, and this paper on adversarial evaluation of dialogue models). It remains to be seen if 2017 will be the year in which GANs break through in NLP.

The Nuts and Bolts of Machine Learning

Andrew Ng gave one of the best tutorials of the conference with his take on building AI applications using Deep Learning. Drawing from his experience of managing the 1,300 people AI team at Baidu and hundreds of applied AI projects and equipped solely with two whiteboards, he shared many insights about how to build and deploy AI applications in production.

Besides better hardware, Ng attributes the success of Deep Learning to two factors: In contrast to traditional methods, deep NNs are able to learn more effectively from large amounts of data. Secondly, end-to-end (supervised) Deep Learning allows us to learn to map from inputs directly to outputs.

While this approach to training chatbots or self-driving cars is sufficient to write innovative research papers, Ng emphasized end-to-end DL is often not production-ready: A chatbot that maps from text directly to a response is not able to have a coherent conversation or fulfill a request, while mapping from an image directly to a steering command might have literally fatal side effects if the model has not encountered the corresponding part of the input space before. Rather, for a production model, we still want to have intermediate steps: For a chatbot, we prefer to have an inference engine that generates a response, while in a self-driving car, DL is used to identify obstacles, while the steering is performed by a traditional planning algorithm.


Image 5: Andrew Ng on end-to-end DL (right: end-to-end DL chatbot and chatbot with inference engine; left bottom: end-to-end DL self-driving car and self-driving car with intermediate steps)

Ng also shared that the most common mistakes he sees in project teams is that they track the wrong metrics: In an applied machine learning project, the only relevant metrics are the training error, the development error, and the test error. These metrics alone enable the project team to know what steps to take, as he demonstrated in the diagram below:


Image 6: Andrew Ng’s flowchart for applied ML projects

A key facilitator of the recent success of ML have been the advances in hardware that allowed faster computation and storage. Given that Moore’s Law will reach its limits sooner or later, one might reason that also the rise of ML might plateau. Ng, however, argued that the commitment by leading hardware manufacturers such as NVIDIA and Intel and the ensuing performance improvements to ML hardware would fuel further growth.

Among ML research areas, supervised learning is the undisputed driver of the recent success of ML and will likely continue to drive it for the foreseeable future. In second place, Ng saw neither unsupervised learning nor reinforcement learning, but transfer learning. We at AYLIEN are bullish on transfer learning for NLP and think that it has massive potential.

Recurrent Neural Networks

The conference also featured a symposium dedicated to Recurrent Neural Networks (RNNs). The symposium coincided with the 20 year anniversary of LSTM…


Image 7: Jürgen Schmidhuber kicking off the RNN symposium

… being rejected from NIPS 1996. The fact that papers that do not use LSTMs have been rare in the most recent NLP conferences (see our EMNLP blog post) is a testament to the perseverance of the authors of the original paper, Sepp Hochreiter and Jürgen Schmidhuber.

At NIPS, we had several papers that sought to improve RNNs in different ways:

Other improvements apply to Deep Learning in general:

  • Salimans and Kingma propose Weight Normalisation to accelerate training that can be applied in two lines of Python code.
  • Li et al. propose a multinomial variant of dropout that sets neurons to zero depending on the data distribution.

The Neural Abstract Machines & Program Induction (NAMPI) workshop also featured several speakers talking about RNNs:

  • Alex Graves focused on his recent work on Adaptive Computation Time (ACT) for RNNs that allows to decouple the processing time from the sequence length. He showed that a word-level language model with ACT could reach state-of-the-art with fewer computations.
  • Edward Grefenstette outlined several limitations and potential future research directions in the context of RNNs in his talk.

Improving classic algorithms

While Deep Learning is a fairly recent development, the conference featured also several improvements to algorithms that have been around for decades:

  • Ge et al. show in their best paper that the non-convex objective for matrix completion has no spurious local minima, i.e. every local minimum is a global minimum.
  • Bachem et al. present a method that guarantees accurate and fast seedings for large-scale k-means++ clustering. The presentation was one of the most polished ones of the conference and the code is open-source and can be installed via pip.
  • Ashtiani et al. show that we can make NP-hard k-means clustering problems solvable by allowing the model to pose queries for a few examples to a domain expert.

Reinforcement Learning

Reinforcement Learning (RL) was another much-discussed topic at NIPS with an excellent tutorial by Pieter Abbeel and John Schulman dedicated to RL. John Schulman also gave some practical advice for getting started with RL.

One of the best papers of the conference introduces Value Iteration Networks, which learn to plan by providing a differentiable approximation to a classic planning algorithm via a CNN. This paper was another cool example of one of the major benefits of deep neural networks: They allow us to learn increasingly complex behaviour as long as we can represent it in a differentiable way.

During the week of the conference, several research environments for RL were simultaneously released, among them OpenAI’s Universe, Deep Mind Lab, and FAIR’s Torchcraft. These will likely be a key driver in future RL research and should open up new research opportunities.

Learning-to-learn / Meta-learning

Another topic that came up in several discussions over the course of the conference was Learning-to-learn or Meta-learning:

  • Andrychowicz et al. learn an optimizer in a paper with the ingenious title “Learning to learn by gradient descent by gradient descent”.
  • Vinyals et al. learn how to one shot-learn in a paper that frames one-shot learning in the sequence-to-sequence framework and has inspired new approaches for one-shot learning.

Most of the existing papers on meta-learning demonstrate that wherever you are doing something that gives you gradients, you can optimize them using another algorithm via gradient descent. Prepare for a surge of “Meta-learning for X” and “(Meta-)+learning” papers in 2017. It’s LSTMs all the way down!

Meta-learning was also one of the key talking points at the RNN symposium. Jürgen Schmidhuber argued that a true meta-learner would be able to learn in the space of all programs and would have the ability to modify itself and elaborated on these ideas at his talk at the NAMPI workshop. Ilya Sutskever remarked that we currently have no good meta-learning models. However, there is hope as the plethora of new research environments should also bring progress in this area.

General Artificial Intelligence

Learning how to learn also plays a role in the pursuit of the elusive goal of attaining General Artificial Intelligence, which was a topic in several keynotes. Yann LeCun argued that in order to achieve General AI, machines need to learn common sense. While common sense is often vaguely mentioned in research papers, Yann LeCun gave a succinct explanation of what common sense is: “Predicting any part of the past, present or future percepts from whatever information is available.” He called this predictive learning, but notes that this is really unsupervised learning.

His talk also marked the appearance of a controversial and often tongue-in-cheek copied image of a cake, which he used to demonstrate that unsupervised learning is the most challenging task where we should concentrate our efforts, while RL is only the cherry on the icing of the cake.


Image 8: The Cake slide of Yann LeCun’s keynote

Drew Purves focused on the bilateral relationship between the environment and AI in what was probably the most aesthetically pleasing keynote of the conference (just look at those graphics!)


Image 9: Graphics by Max Cant of Drew Purves’ keynote (Source: Drew Purves)

He emphasized that while simulations of ecological tasks in naturalistic environments could be an important test bed for General AI, General AI is needed to maintain the biosphere in a state that will allow the continued existence of our civilization.


Image 10: Nature needs AI and AI needs Nature from Drew Purves’ keynote

While it is frequently — and incorrectly — claimed that neural networks work so well because they emulate the brain’s behaviour, Saket Navlakha argued during his keynote that we can still learn a great deal from the engineering principles of the brain. For instance, rather than pre-allocating a large number of neurons, the brain generates 1000s of synapses per minutes until its second year. Afterwards, until adolescence, the number of synapses is pruned and decreases by ~50%.


Image 11: Saket Navlakha’s keynote

It will be interesting to see how neuroscience can help us to advance our field further.

In the context of the Machine Intelligence workshop, another environment was introduced in the form of FAIR’s CommAI-env that allows to train agents through interaction with a teacher. During the panel discussion, the ability to learn hierarchical representations and to identify patterns was emphasized. However, although the field is making rapid progress on standard tasks such as object recognition, it is unclear if the focus on such specific tasks brings us indeed closer to General AI.

Natural Language Processing

While NLP is more of a niche topic at NIPS, there were a few papers with improvements relevant to NLP:

  • He et al. propose a dual learning framework for MT that has two agents translating in opposite directions teaching each other via reinforcement learning.
  • Sokolov et al. explore how to use structured prediction under bandit feedback.
  • Huang et al. extend Word Mover’s Distance, an unsupervised document similarity metric to the supervised setting.
  • Lee et al. model the helpfulness of reviews by taking into account position and presentation biases.

Finally, a workshop on learning methods for dialogue explored how end-to-end systems, linguistics and ML methods can be used to create dialogue agents.



Jürgen Schmidhuber, the father of the LSTM was not only present on several panels, but did his best to remind everyone that whatever your idea, he had had a similar idea two decades ago and you should better cite him lest he interrupt your tutorial.



Boston Robotics’ Spot proved that — even though everyone is excited by learning and learning-to-learn — traditional planning algorithms are enough to win the admiration of a hall full of learning enthusiasts.


Image 12: Boston Robotics’ Spot amid a crowd of fascinated onlookers


Apple, one of the most secretive companies in the world, has decided to be more open, to publish, and to engage with academia. This can only be good for the community. We’re looking forward to more apple research papers.


Image 13: Ruslan Salakhutdinov at the Apple lunch event


Uber announced their acquisition of Cambridge-based AI startup Geometric Intelligence and threw one of the most popular parties of NIPS.


Image 14: The Geometric Intelligence logo

Rocket AI

Talking about startups, the “launch” of Rocket AI and their patented Temporally Recurrent Optimal Learning had some people fooled (note the acronyms in the below tweets). Riva-Melissa Tez finally cleared up the confusion.


These were our impressions from NIPS 2016. We had a blast and hope to be back in 2017!


Text Analysis API - Sign up


Here at AYLIEN we have a team of researchers who like to keep abreast of, and regularly contribute to, the latest developments in the field of Natural Language Processing. Recently, one of our research scientists, Sebastian Ruder, attended EMNLP 2016 in Austin, Texas. In this post, Sebastian has highlighted some of the stand-out papers and trends from the conference.


Image: Jackie Cheung

I spent the past week in Austin, Texas at EMNLP 2016, the Conference on Empirical Methods in Natural Language Processing.

There were a lot of papers at the conference (179 long papers, 87 short papers, and 9 TACL papers all in all) — too many to read each single one. The entire program can be found here. In the following, I will highlight some trends and papers that caught my eye:

Reinforcement learning

One thing that stood out was that RL seems to be slowly finding its footing in NLP, with more and more people using it to solve complex problems:


Dialogue was a focus of the conference with all of the three keynote speakers dealing with different aspects of dialogue: Christopher Potts talked about pragmatics and how to reason about the intentions of the conversation partner; Stefanie Tellex concentrated on how to use dialogue for human-robot collaboration; finally, Andreas Stolcke focused on the problem of addressee detection in his talk.

Among the papers, a few that dealt with dialogue stood out:

  • Andreas and Klein model pragmatics in dialogue with neural speakers and listeners;
  • Liu et al. show how not to evaluate your dialogue system;
  • Ouchi and Tsuboi select addressees and responses in multi-party conversations;
  • Wen et al. study diverse architectures for dialogue modelling.


Seq2seq models were again front and center. It is not common for a method to have its own session two years after its introduction (Sutskever et al., 2014). While in the past years, many papers employed seq2seq e.g. for Neural Machine Translation, some papers this year focused on improving the seq2seq framework:

Semantic parsing

Since seq2seq’s use for dialogue modelling was popularised by Vinyals and Le, it is harder to get it to work with goal-oriented tasks that require an intermediate representation on which to act. Semantic parsing is used to convert a message into a more meaningful representation that can be used by another component of the system. As this technique is useful for sophisticated dialogue systems, it is great to see progress in this area:

X-to-text (or natural language generation)

While mapping from text-to-text with the seq2seq paradigm is still prevalent, EMNLP featured some cool papers on natural language generation from other inputs:


Parsing and syntax are a mainstay of every NLP conference and the community seems to particularly appreciate innovative models that push the state-of-the-art in parsing: The ACL ’16 outstanding paper by Andor et al. introduced a globally normalized model for parsing, while the best EMNLP ‘16 paper by Lee et al. combines a global parsing model with a local search over subtrees.

Word embeddings

There were still papers on word embeddings, but it felt less overwhelming than at the past EMNLP or ACL, with most methods trying to fix a particular flaw rather than training embeddings for embeddings’ sake. Pilevhar and Collier de-conflate senses in word embeddings, while Wieting et al. achieve state-of-the-art results for character-based embeddings.

Sentiment analysis

Sentiment analysis has been popular in recent years (as attested by the introductions of many recent papers on sentiment analysis). Sadly, many of the conference papers on sentiment analysis reduce to leveraging the latest deep neural network for the task to beat the previous state-of-the-art without providing additional insights. There are, however, some that break the mold: Teng et al. find an effective way to incorporate sentiment lexicons into a neural network, while Hu et al. incorporate structured knowledge into their sentiment analysis model.

Deep Learning

By now, it is clear to everyone: Deep Learning is here to stay. In fact, deep learning and neural networks claimed the two top spots of keywords that were used to describe the submitted papers. The majority of papers used at least an LSTM; using no neural network seems almost contrarian now and is something that needs to be justified. However, there are still many things that need to be improved — which leads us to…

Uphill Battles

While making incremental progress is important to secure grants and publish papers, we should not lose track of the long-term goals. In this spirit, one of the best workshops that I’ve attended was the Uphill Battles in Language Processing workshop, which featured 12 talks and not one, but four all-star panels on text understanding, natural language generation, dialogue and speech, and grounded language. Summaries of the panel discussions should be available soon at the workshop website.

This was my brief review of some of the trends of EMNLP 2016. I hope it was helpful.


News API - Sign up


Unsupervisedly learned word embeddings have seen tremendous success in numerous NLP tasks in recent years. So much so that in many NLP architectures, they are close to fully replacing more traditional distributional representations such as LSA features and Brown clusters.

You just have to look at last year’s EMNLP and ACL conferences, both of which had a very strong focus on word embeddings, and a recent post in Communications of the ACM in which word embeddings is hailed as the catalyst for NLP’s breakout. But are they worthy of the hype?

This post is a synopsis of two blogs written by AYLIEN Research Scientist, Sebastian Ruder. You can view Sebastian’s original posts, and more, on Machine Learning, NLP and Deep Learning on his blog.

In this overview we aim to give an in-depth understanding of word embeddings and their effectiveness. We’ll touch on where they originated, we’ll compare popular word embedding models and the challenges associated with them and we’ll and try to answer/debunk some common questions and misconceptions.

We will then look to disillusion word embeddings by relating them to literature in distributional semantics and highlighting the factors that actually account for the success of word embedding models.

A brief history of word embeddings

Vector space models have been used in distributional semantics since the 1990s. Since then, we have seen the development of a number models used for estimating continuous representations of words, Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) being two such examples.

The term word embeddings was originally coined by Bengio et al. in 2003 who trained them in a neural language model together with the model’s parameters. However, Collobert and Weston were arguably the first to demonstrate the power of pre-trained word embeddings in their 2008 paper A unified architecture for natural language processing, in which they establish word embeddings as a highly effective tool when used in downstream tasks, while also announcing a neural network architecture that many of today’s approaches were built upon. It was Mikolov et al. (2013), however, who really brought word embedding to the fore through the creation of word2vec, a toolkit enabling the training and use of pre-trained embeddings. A year later, Pennington et al. introduced us to GloVe, a competitive set of pre-trained embeddings, suggesting that word embeddings was suddenly among the mainstream.

Word embeddings are considered to be among a small number of successful applications of unsupervised learning at present. The fact that they do not require pricey annotation is probably their main benefit. Rather, they can be derived from already available unannotated corpora.

Word embedding models

Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.

nn_language_model-1Figure 1: A neural language model (Bengio et al., 2006)

The key difference between a network like this and a method like word2vec is its computational complexity, which explains why it wasn’t until 2013 that word embeddings became so prominent in the NLP space. The recent and rapid expansion and affordability in computational power has certainly aided its emergence.

The training objectives for GloVe and word2vec are another difference, with both geared towards producing word embeddings that encode general semantic relationships and can provide benefit in many downstream tasks. Regular neural networks, in comparison, generally produce task-specific embeddings with limitations in relation to their use elsewhere.

In comparing models, we will assume the following notational standards: We assume a training corpus containing a sequence of \(T\) training words \(w_1, w_2, w_3, \cdots, w_T\) that belong to a vocabulary \(V\) whose size is \(|V|\). Our models generally consider a context of \( n \) words. We associate every word with an input embedding \( v_w \) (the eponymous word embedding in the Embedding Layer) with \(d\) dimensions and an output embedding \( v’_w \) (another word representation whose role will soon become clearer). We finally optimize an objective function \(J_\theta\) with regard to our model parameters \(\theta\) and our model outputs some score \(f_\theta(x)\) for every input \( x \).

Classic neural language model

The classic neural language model proposed by Bengio et al. [1] in 2003 consists of a one-hidden layer feed-forward neural network that predicts the next word in a sequence as in Figure 2.

bengio_language_modelFigure 2: Classic neural language model (Bengio et al., 2003)

Their model maximizes what we’ve described above as the prototypical neural language model objective (For simplicity, the regularization term has been omitted):

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space f(w_t , w_{t-1} , \cdots , w_{t-n+1})\).

\( f(w_t , w_{t-1} , \cdots , w_{t-n+1}) \) is the output of the model, i.e. the probability \( p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) \) as computed by the softmax, where \(n \) is the number of previous words fed into the model.

Bengio et al. were among the first to introduce what has become to be known as a word embedding, a real-valued word feature vector in \(\mathbb{R}\). The foundations of their model can still be found in today’s neural language and word embedding models. They are:

1. Embedding Layer: This layer generates word embeddings by multiplying an index vector with a word embedding matrix;

2. Intermediate Layer(s): One or more layers that produce an intermediate representation of the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of word embeddings of \(n\) previous words;

3. Softmax Layer: The final layer that produces a probability distribution over words in \(V\).

Bengio et al. also determine two issues with current state-of-the-art-models:

– The first is that Layer 2. is replaceable with an LSTM, which is used by state-of-the-art neural language models [6], [7].

– They also identify the final softmax layer (more precisely: the normalization term) as the network’s main bottleneck, as the cost of computing the softmax is proportional to the number of words in \(V\), which is typically on the order of hundreds of thousands or millions.

Discovering methods that alleviate the computational cost related to computing the softmax over a large vocabulary [9] is therefore one of the main challenges in both neural language and word embedding models.

C&W model

After Bengio et al.’s initial efforts in neural language models, research in word embeddings stalled as computational power and algorithms were not yet at a level that enabled the training of a large vocabulary.

In 2008, Collobert and Weston [4] (thus C&W) demonstrated that word embeddings trained on an adequately large dataset carry syntactic and semantic meaning and improve performance on downstream tasks. In their 2011 paper, they further expand on this [8].

In order to avoid computing the expensive softmax, their solution is to employ an alternative objective function: rather than the cross-entropy criterion of Bengio et al., which maximizes the probability of the next word given the previous words, Collobert and Weston train a network to output a higher score \(f_\theta\) for a correct word sequence (a probable word sequence in Bengio’s model) than for an incorrect one. For this purpose, they use a pairwise ranking criterion, which looks like this:

\(J_\theta\ = \sum\limits_{x \in X} \sum\limits_{w \in V} \text{max} \lbrace 0, 1 – f_\theta(x) + f_\theta(x^{(w)}) \rbrace \).

They sample correct windows \(x\) containing \(n\) words from the set of all possible windows \(X\) in their corpus. For each window \(x\), they then produce a corrupted, incorrect version \(x^{(w)}\) by replacing \(x\)’s centre word with another word \(w\) from \(V\). Their objective now maximises the distance between the scores output by the model for the correct and the incorrect window with a margin of \(1\). Their model architecture, depicted in Figure 3 without the ranking objective, is analogous to Bengio et al.’s model.

nlp_almost_from_scratch_window_approachFigure 3: The C&W model without ranking objective (Collobert et al., 2011)

The resulting language model produces embeddings that already possess many of the relations word embeddings have become known for, e.g. countries are clustered close together and syntactically similar words occupy similar locations in the vector space. While their ranking objective eliminates the complexity of the softmax, they keep the intermediate fully-connected hidden layer (2.) of Bengio et al. around (the HardTanh layer in Figure 3), which constitutes another source of expensive computation. Partially due to this, their full model trains for seven weeks in total with \(|V| = 130000\).


Word2Vec is arguably the most popular of the word embedding models. Because word embeddings are a key element of deep learning models for NLP, it is generally assumed to belong to the same group. However, word2vec is not technically not be considered a component of deep learning, with the reasoning being that its architecture is neither deep nor uses non-linearities (in contrast to Bengio’s model and the C&W model).

Mikolov et al. [2] recommend two architectures for learning word embeddings that, when compared with previous models, are computationally less expensive.

Here are two key benefits that these architectures have over Bengio’s and the C&W model;

– They forgo the costly hidden layer.

– They allow the language model to take additional context into account.

The success their model can not only be attributed to these differences, it importantly also comes from specific training strategies, both of which we will now look at;

Continuous bag-of-words (CBOW)

Unlike a language model that can only base its predictions on past words, as it is assessed based on its ability to predict each next word in the corpus, a model that only aims to produce accurate word embeddings is not subject to such restriction. Mikolov et al. therefore use both the \(n\) words before and after the target word \( w_t \) to predict it as shown in Figure 4. This is known as a continuous bag of words (CBOW), owing to the fact that it uses continuous representations whose order is of no importance.

cbowFigure 4: Continuous bag-of-words (Mikolov et al., 2013)

The purpose of CBOW is only marginally different than that of the the language model one:

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space p(w_t \: | \: w_{t-n} , \cdots , w_{t-1}, w_{t+1}, \cdots , w_{t+n})\).

Rather than feeding \( n \) previous words into the model, the model receives a window of \( n \) words around the target word \( w_t \) at each time step \( t \).


While CBOW can be seen as a precognitive language model, skip-gram turns the language model objective on its head: rather than using the surrounding words to predict the centre word as with CBOW, skip-gram uses the centre word to predict the surrounding words as can be seen in Figure 5.

skip-gramFigure 5: Skip-gram (Mikolov et al., 2013)

The skip-gram objective thus sums the log probabilities of the surrounding \( n \) words to the left and to the right of the target word \( w_t \) to produce the following objective:

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \sum\limits_{-n \leq j \leq n , \neq 0} \text{log} \space p(w_{t+j} \: | \: w_t)\).


In contrast to word2vec, GloVe [5] seeks to make explicit what word2vec does implicitly: Encoding meaning as vector offsets in an embedding space — seemingly only a serendipitous by-product of word2vec — is the specified goal of GloVe.

merge_from_ofoct--2-Figure 6: Vector relations captured by GloVe (Stanford)

To be specific, the creators of GloVe illustrate that the ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and so look to encode this information as vector differences.

For this to be accomplished, they propose a weighted least squares objective \(J\) that directly aims to reduce the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:

\(J = \sum\limits_{i, j=1}^V f(X_{ij}) \: (w_i^T \tilde{w}_j + b_i + \tilde{b}_j – \text{log} \: X_{ij})^2 \)

where \(w_i\) and \(b_i\) are the word vector and bias respectively of word \(i\), \(\tilde{w}_j\) and \(b_j\) are the context word vector and bias respectively of word \(j\), \(X_{ij}\) is the number of times word \(i\) occurs in the context of word \(j\), and \(f\) is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.

As co-occurrence counts can be directly encoded in a word-context co-occurrence matrix, GloVe takes such a matrix rather than the entire corpus as input.

Word embeddings vs. distributional semantics models

Word embedding models such as word2vec and GloVe gained such popularity as they appeared to regularly and substantially outperform traditional Distributional Semantic Models (DSMs). Many attributed this to the neural architecture of word2vec, or the fact that it predicts words, which seemed to have a natural edge over solely relying on co-occurrence counts.

DSMs can be seen as count models as they “count” co-occurrences among words by operating on co-occurrence matrices. Neural word embedding models, in contrast, can be viewed as predict models, as they try to predict surrounding words.

In 2014, Baroni et al. [11] demonstrated that, in nearly all tasks, predict models consistently outperform count models, and therefore provided us with a comprehensive verification for the supposed superiority of word embedding models. Is this the end? No.

With GloVe, we have already seen that the differences are not as obvious: While GloVe is considered a predict model by Levy et al. (2015) [10], it is clearly factorizing a word-context co-occurrence matrix, which brings it close to traditional methods such as PCA and LSA. Even more, Levy et al. [12] demonstrate that word2vec implicitly factorizes a word-context PMI matrix.

While on the surface DSMs and word embedding models use varying algorithms to learn word representations – the former count, the latter predict – both types of model fundamentally act on the same underlying statistics of the data, i.e. the co-occurrence counts between words.

And so the question that we will focus on for the remainder of this post still remains:

Why do word embedding models still outperform DSM with very similar information?

Comparison models

To establish the elements that attribute to the success of neural word embedding models, and illustrate how they can be transferred to traditional processes, we will compare the following models;

Positive Pointwise Mutual Information (PPMI)

PMI is a typical measure for the strength of association between two words.It is defined as the log ratio between the joint probability of two words \(w\) and \(c\) and the product of their marginal probabilities: \(PMI(w,c) = \text{log} \: \frac{P(w,c)}{P(w)\:P(c)} \). As \( PMI(w,c) = \text{log} \: 0 = – \infty \) for pairs \( (w,c) \) that were never observed, PMI is in practice often replaced with positive PMI (PPMI), which replaces negative values with \(0\), yielding \(PPMI(w,c) = \text{max}(PMI(w,c),0)\).

Singular Value Decomposition (SVD)

SVD is among the more popular methods for dimensionality reduction and came about in NLP originally via latent semantic analysis (LSA). SVD factorizes the word-context co-occurrence matrix into the product of three matrices \(U \cdot \Sigma \times V^T \) where \(U\) and \(V\) are orthonormal matrices (i.e. square matrices whose rows and columns are orthogonal unit vectors) and \(\Sigma\) is a diagonal matrix of eigenvalues in

decreasing order. In practice, SVD is often used to factorize the matrix produced by PPMI. Generally, only the top \(d\) elements of \(\Sigma\) are kept, yielding \(W^{SVD} = U_d \cdot \Sigma_d\) and \(C^{SVD} = V_d\), which are commonly used as the word and context representations respectively.

Skip-gram with Negative Sampling (SGNS)

Aka word2vec, as shown above.

Global Vectors (GloVe)

As shown earlier in this post.


We will focus on the following hyper-parameters:


Word2vec suggests three methods of pre-processing a corpus, each of which can be applied to DSMs with ease.

Dynamic context window

Normally in DSMs, the context window is unweighted and of a unchanging size. Both SGNS and GloVe, however, use a scheme that assigns more weight to closer words, as closer words are generally considered to be more important to a word’s meaning. Additionally, in SGNS, the window size is not fixed, but the actual window size is dynamic and sampled uniformly between \(1\) and the maximum window size during training.

Subsampling frequent words

SGNS dilutes very frequent words by randomly removing words whose frequency \(f\) is higher than some threshold \(t\) with a probability \(p = 1 – \sqrt{\frac{t}{f}}\). As this subsampling is done before actually creating the windows, the context windows used by SGNS in practice are larger than indicated by the context window size.

Deleting rare words

During the pre-processing of SGNS, rare words are also deleted before creating the context windows, which increases the actual size of the context windows further. The actual performance impact of this is insignificant, however, according to Levy et al. (2015)

Association metric

In relation to measuring the association between two words, PMI is seen as useful metric. Since Levy and Goldberg (2014) have shown SGNS to implicitly factorize a PMI matrix, two variations stemming from this formulation can be introduced to regular PMI.

Shifted PMI

In SGNS, the greater the volume of negative samples \(k\), the more data is being used and so the estimation of the parameters should therefore improve. \(k\) affects the shift of the PMI matrix that is implicitly factorized by word2vec, i.e. \(k\) k shifts the PMI values by log \(k\).

If we transfer this to regular PMI, we obtain Shifted PPMI (SPPMI): \(SPPMI(w,c) = \text{max}(PMI(w,c) – \text{log} \: k,0)\).

Context distribution smoothing

In SGNS, the negative samples are sampled according to a _smoothed_ unigram distribution, i.e. an unigram distribution raised to the power of \(\alpha\), which is empirically set to \(\frac{3}{4}\). This leads to frequent words being sampled relatively less often than their frequency would indicate.

We can transfer this to PMI by equally raising the frequency of the context words \(f(c)\) to the power of \(\alpha\):

\(PMI(w, c) = \text{log} \frac{p(w,c)}{p(w)p_\alpha(c)}\) where \(p_\alpha(c) = \frac{f(c)^\alpha}{\sum_c f(c)^\alpha}\) and \(f(x)\) is the frequency of word \(x\).


Just like in pre-processing, three methods can be used to modify the word vectors produced by an algorithm.

Adding context vectors

The authors of GloVe recommend the addition of word vectors and context vectors to create the final output vectors, e.g. \(\vec{v}_{\text{cat}} = \vec{w}_{\text{cat}} + \vec{c}_{\text{cat}}\). This adds first-order similarity terms, i.e \(w \cdot v\). This method, however, cannot be applied to PMI, as the vectors produced by PMI are too infrequent.

Eigenvalue weighting

SVD produces the following matrices: \(W^{SVD} = U_d \cdot \Sigma_d \) and \(C^{SVD} = V_d\). These matrices, however, have different properties: \(C^{SVD}\) is orthonormal, while \(W^{SVD}\) is not.

SGNS is more symmetric in contrast. We can thus weight the eigenvalue matrix \(\Sigma_d\) with an additional parameter \(p\), which can be tuned, to yield the following:

\(W^{SVD} = U_d \cdot \Sigma_d^p\).

Vector normalisation

Finally, we can also normalise all vectors to unit length.


Levy et al. (2015) train all models on a dump of the English wikipedia and evaluate them on the commonly used word similarity and analogy datasets. You can read more about the experimental setup and training details in their paper. We summarise the most important results and takeaways below.


Levy et al. find that SVD — and not one of the word embedding algorithms — performs best on similarity tasks, while SGNS performs best on analogy datasets. They furthermore shed light on the importance of hyperparameters compared to other choices:

  1. Hyperparameters vs. algorithms:
    Hyperparameter settings are often more important than algorithm choice.
    No single algorithm consistently outperforms the other methods.
  2. Hyperparameters vs. more data:
    Training on a larger corpus helps for some tasks.
    In 3 out of 6 cases, tuning hyperparameters is more beneficial.

Debunking prior claims

Equipped with these insights, we can now debunk some generally held claims:

  1. Are embeddings superior to distributional methods?
    With the right hyperparameters, no approach has a consistent advantage over another.
  2. Is GloVe superior to SGNS?
    SNGS outperforms GloVe on all comparison tasks of Levy et al. This should necessarily be taken with a grain of salt as GloVe might perform better on other tasks.
  3. Is CBOW a good word2vec configuration?
    CBOW does not outperform SGNS on any task.


DON’T use shifted PPMI with SVD.

DON’T use SVD “correctly”, i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with \(p = 0.5\)).

DO use PPMI and SVD with short contexts (window size of \(2\)).

DO use many negative samples with SGNS.

DO always use context distribution smoothing (raise unigram distribution to the power of \(\alpha = 0.75\)) for all methods.

DO use SGNS as a baseline (robust, fast and cheap to train).

DO try adding context vectors in SGNS and GloVe.


These results are in contrast to the general consensus that word embeddings are superior to traditional methods. Rather, they indicate that it typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps.

Recent studies by Jurafsky’s group [13], [14] reflect these findings and illustrate that SVD, rather than SGNS, is commonly the preferred choice accurate word representations is important.

We hope this overview of word embeddings has helped to highlight some fantastic research that sheds light on the relationship between traditional distributional semantic and in-vogue embedding models.


[1]: Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155.

[2]: Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12.

[3]: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS, 1–9.

[4]: Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. Proceedings of the 25th International Conference on Machine Learning – ICML ’08, 20(1), 160–167.

[5]: Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543.

[6]: Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. AAAI. Retrieved from

[7]: Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. Retrieved from

[8]: Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research, 12 (Aug), 2493–2537. Retrieved from

[9]: Chen, W., Grangier, D., & Auli, M. (2015). Strategies for Training Large Vocabulary Neural Language Models, 12. Retrieved from

[10]: Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225. Retrieved from

[11]: Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, 238–247.

[12]: Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. Advances in Neural Information Processing Systems (NIPS), 2177–2185. Retrieved from

[13]: Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016). Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Retrieved from

[14]: Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. arXiv Preprint arXiv:1605.09096.


Text Analysis API - Sign up


Sentiment analysis is widely used to gauge public opinion towards products, to analyze customer satisfaction, and to detect trends. With the proliferation of customer reviews, more fine-grained aspect-based sentiment analysis (ABSA) has gained in popularity, as it allows aspects of a product or service to be examined in more detail. To this end, we have launched an ABSA service a while ago and demonstrated how the service can be used to gain insights into the strengths and weaknesses of a product.

For performing sentiment analysis on customer reviews (as well as with many other text classification tasks), we face the problem that there are many different categories of reviews such as books, electronics, restaurants, etc. (You only need to have a look at the Departments tab on Amazon to get a feeling for the diversity of these categories.) In Machine Learning and Natural Language Processing, we refer to these different categories as domains; every domain has their unique characteristics.

In practice, this means that a model that is trained on one domain will do worse on another domain depending on how dissimilar the two domains are. For instance, a model trained on the restaurants domain will do a lot worse on the books domain than on the comparatively more similar hotels domain. For aspect-based sentiment analysis, this problem is amplified as not only the domains but also the aspects belonging to those domains differ.

In addition, as the world becomes more globalized, customer reviews need to be analyzed in other languages besides English. We thus require large amounts of training data for a large number of language-domain pairs, which is infeasible in practice as annotation — particularly the meticulous annotation required for ABSA — is expensive.

There are two directions that complement each other, which we can take to address this deficit:

  1. We can create models that allow us to transfer existing knowledge and adapt trained models to new domains without incurring a large performance hit. This area is called domain adaptation and we will talk about this in future blog posts.
  2. We can create models that are more generic and that are able to generalize well even when trained on relatively few data by leveraging information inherent in the training data. In the rest of this blog post, we will talk about the approach we took in our EMNLP 2016 paper towards this goal.

Even though Deep Learning-based models constitute the state-of-the-art in many NLP tasks, they traditionally do well only with large amounts of data. Finding ways to help them generalize with only few data samples is thus an important research problem in its own right. Taken to the extreme, we would like to emulate the way humans learn from only few examples, also known as One-Shot Learning.

Reviews — just with any coherent text — have an underlying structure. In the discourse structure of the review, sentences are connected via different rhetorical relations. Intuitively, knowledge about the relations and the sentiment of surrounding sentences should inform the sentiment of the current sentence. If a reviewer of a restaurant has shown a positive sentiment towards the quality of the food, it is likely that his opinion will not change drastically over the course of the review. Additionally, overwhelmingly positive or negative sentences in the review help to disambiguate sentences whose sentiment is equivocal.

Existing Deep Learning models for sentiment analysis act only on the sentence level; while they are able to consider intra-sentence relations, they fail to capture inter-sentence relations that rely on discourse structure and provide valuable clues for sentiment prediction.

We propose a hierarchical bidirectional long short-term memory (H-LSTM) that is able to leverage both intra– and inter-sentence relations. Because our model only relies on sentences and their structure within a review, it is fully language-independent.


hierarchical_lstm-page-001 (1)

Figure 1: The hierarchical bidirectional LSTM (H-LSTM) for aspect-based sentiment analysis. Word embeddings are fed into a sentence-level bidirectional LSTM. Final states of forward and backward LSTM are concatenated together with the aspect embedding and fed into a bidirectional review-level LSTM. At every time step, the output of the forward and backward LSTM is concatenated and fed into a final layer, which outputs a probability distribution over sentiments.

You can view the architecture of our model in the image above. The model consists of the following components:


We use a Long Short-Term Memory (LSTM), which adds input, output, and forget gates to a recurrent cell, which allow it to model long-range dependencies that are essential for capturing sentiment.

For the \(t\)th word in a sentence, the LSTM takes as input the word embedding \(x_t\), the previous output \(h_{t-1}\) and cell state \(c_{t-1}\) and computes the next output \(h_t\) and cell state \(c_t\). Both \(h\) and \(c\) are initialized with zeros.

Bidirectional LSTM

Both on the review and on the sentence level, sentiment is dependent not only on preceding but also successive words and sentences. A Bidirectional LSTM (Bi-LSTM) allows us to look ahead by employing a forward LSTM, which processes the sequence in chronological order, and a backward LSTM, which processes the sequence in reverse order. The output \(h_t\) at a given time step is then the concatenation of the corresponding states of the forward and backward LSTM.

Hierarchical Bidirectional LSTM

Stacking a Bi-LSTM on the review level on top of sentence-level Bi-LSTMs yields the hierarchical bidirectional LSTM (H-LSTM) in Figure 1.

The sentence-level forward and backward LSTMs receive the sentence starting with the first and last word embedding \(x\_{1}\) and \(x\_l\) respectively. The final output \(h\_l\) of both LSTMs is then concatenated with the aspect vector \(a\) and fed as input into the review-level forward and backward LSTMs. The outputs of both LSTMs are concatenated and fed into a final softmax layer, which outputs a probability distribution over sentiments for each sentence.


The most popular benchmark for ABSA is the SemEval Aspect-based Sentiment Analysis task. We evaluate on the most recent edition of this task, the SemEval-2016 ABSA task. To demonstrate the language and domain independence of our model, we evaluate on datasets in five domains (restaurants, hotels, laptops, phones, cameras) and eight languages (English, Spanish, French, Russian, Dutch, Turkish, Arabic, Chinese) from the competition.

We compare our model using random (H-LSTM) and pre-trained word embeddings (HP-LSTM)  against the best model of the SemEval-2016 Aspect-based Sentiment Analysis task for each domain-language pair (Best) as well as against the two best single models of the competition: IIT-TUDA (Kumar et al., 2016), which uses large sentiment lexicons for every language, and XRCE (Brun et al., 2016), which uses a parser augmented with hand-crafted, domain-specific rules.

English Restaurants 88.1 88.1 86.7 82.1 81.4 83.0 85.3
Spanish Restaurants 83.6 83.6 79.6 75.7 79.5 81.8
French Restaurants 78.8 78.8 72.2 73.2 69.8 73.6 75.4
Russian Restaurants 77.9 73.6 75.1 73.9 78.1 77.4
Dutch Restaurants 77.8 77.0 75.0 73.6 82.2 84.8
Turkish Restaurants 84.3 84.3 74.2 73.6 76.7 79.2
Arabic Hotels 82.7 81.7 82.7 80.5 82.8 82.9
English Laptops 82.8 82.8 78.4 76.0 77.4 80.1
Dutch Phones 83.3 82.6 83.3 81.8 81.3 83.6
Chinese Cameras 80.5 78.2 77.6 78.6 78.8
Chinese Phones 73.3 72.4 70.3 74.1 73.3

Table 1: Results of our system with randomly initialized word embeddings (H-LSTM) and with pre-trained embeddings (HP-LSTM) for ABSA for each language and domain in comparison to the best system for each pair (Best), the best two single systems (XRCE, IIT-TUDA), a sentence-level CNN (CNN), and our sentence-level LSTM (LSTM).

As you can see in the table above, our hierarchical model achieves results superior to the sentence-level CNN and the sentence-level Bi-LSTM baselines for almost all domain-language pairs by taking the structure of the review into account.

In addition, our model shows results competitive with the best single models of the competition, while requiring no expensive hand-crafted features or external resources, thereby demonstrating its language and domain independence. Overall, our model compares favorably to the state-of-the-art, particularly for low-resource languages, where few hand-engineered features are available. It outperforms the state-of-the-art on four and five datasets using randomly initialized and pre-trained embeddings respectively. For more details, refer to our paper.


Text Analysis API - Sign up


From July 20th to July 28th 2016, I had the opportunity of  attending the 6th Lisbon Machine Learning School. The Lisbon Machine Learning School (LxMLS) is an annual event that brings together researchers and graduate students in the fields of NLP and Computational Linguistics, computer scientists with an interest in statistics and ML, and industry practitioners with a desire for a more in-depth understanding. Participants had a chance to join workshops and labs, where they got hands-on experience with building and exploring state-of-the-art deep learning models, as well as to attend talks and speeches by prominent deep learning and NLP researchers from a variety of academic and industrial organisations. You can find the entire programme here.

In this blog post, I am going to share some of the highlights, key insights and takeaways of the summer school. I am going to skip the lectures of the first and second day as they introduce basic Python, Linear Algebra, and Probability Theory concepts and focus on the later lectures and talks. First, we are going to talk about sequence models. We will then turn to structured prediction, a type of supervised ML common to NLP. We will then summarize the lecture on Syntax and Parsing and finally provide insights with regard to Deep Learning. The accompanying slides can be found as a reference at the end of this blog post.

Disclaimer: This blog post is not meant to give a comprehensive introduction of each of the topics discussed; it should rather give you an overview of the week-long event and provide you with pointers if you want to delve deeper into any of the topics.

Sequence Models

Noah Smith of the University Washington kicked off the third day of the summer school with a compelling lecture about sequence models. To test your understanding of sequence models, try to answer – without reading further – the following question: What is the most basic sequence model depicted in Figure 1?


Figure 1: The most basic sequence model

Correct! It is the bag-of-words (notice which words have “fallen” out of the bag-of-words). The bag-of-words model makes the strongest independence assumption of all sequence models: It supposes that each word is entirely independent of its predecessors. It is obvious why models that rely on this assumption do only a poor job at modelling language: every word naturally depends on the words that have preceded it.

Somewhat more sophisticated models thus relax this naive assumption to reduce the entropy: A 1st Order Markov model makes each word dependent on the word that immediately preceded it. This way, it is already able to capture some context of the context that can help to disambiguate a new word. More generally, \(m^{\text{th}}\) Order Markov Models make each word depend on its previous \(m\) words.

In mathematical terms, in \(m^{\text{th}}\) Order Markov Models, the probability of a text sequence (we assume here that such a sequence is delimited by start and stop symbols) can be calculated using the chain rule as the product of the probabilities of the individual words:

\(p(\text{start}, w_1, w_2, …, w_n, \text{stop}) = \prod\limits_{i=1}^{n+1} \gamma (w_i \: | \: w_{i-m}, …, w_{i-1}) \)

where \(\gamma\) is the probability of the current word \(w_i\) given its \(m\) previous words, i.e. the probability to transition from the previous words to the current word.

We can view bag-of-words and \(m^{\text{th}}\) Order Markov Models as occupying the following spectrum:


Figure 2: From bag-of-words to history-based models

As we go right in Figure 2, we make weaker independence assumption and in exchange gain richer expressive power, while requiring more parameters – until we eventually obtain the most expressive – and most rigid – model, a history-based model where each word depends on its entire history, i.e. all preceding words.

As a side-note, state-of-the-art sequence modelling models such as recurrent neural networks and LSTMs can be thought of as being located on the right side of this spectrum, as they don’t require an explicit specification of context words but are – theoretically – able to take into account the entire history.

In many cases, we would not only like to model just the observed sequence of symbols, but take some additional information into account. Hidden Markov Models (HMMs) allow us to associate with each symbol \(w_i\) some missing information, its “state” \(s_i\). The probability of a word sequence in an HMM then not only depends on the transition probability \(\gamma\) but also on the so-called emission probability \(\eta\):

\(p(\text{start}, w_1, w_2, …, w_n, \text{stop}) = \prod\limits_{i=1}^{n+1} \eta (w_i \: | \: s_i) \: \gamma (s_i \: | \: s_{i-1}) \)

Consequently, the HMM is a joint model over observable symbols and hidden/latent/unknown classes. HMMs have traditionally been used in part-of-speech tagging or named entity recognition where the hidden states are POS and NER tags respectively.

If we want to determine the most probable sequence of hidden states, we face a space of potential sequences that grows exponentially with the sequence length. The classic dynamic algorithm to cope with this problem is the Viterbi algorithm, which is used in HMMs, CRFs, and other sequence models to calculate the most probable sequence of hidden states: It lays out the symbol sequence and all possible states in a grid and proceeds left-to-right to compute the maximum probability to transition in every new state given the previous states. The most probable sequence can then be found by back-tracking as in Figure 3.


Figure 3: The Viterbi algorithm

A close relative is the forward-backward algorithm, which is used to calculate the probability of a word sequence and the probabilities of each word’s states, e.g. for language modelling. Indeed, the only difference between Viterbi and the forward-backward algorithm is that Viterbi takes the maximum of the probabilities of the previous state, while forward-backward takes the sum. In this sense, they correspond to the same abstract algorithm, which is instantiated in two different semirings where a semiring informally is a set of values and some operations that obey certain properties.

Finally, if we want to learn HMMs in an unsupervised way, we use the well-known Expectation Maximisation (EM) algorithm, which consists of two steps: During the E step, we calculate the probability of each possible transition and emission at every position with forward-backward (or Viterbi for “hard” EM); for the M step, we re-estimate the parameters with MLE.

Machine Translation

On the evening of the third day, Philipp Koehn, one of the pioneers of MT and inventor of phrase-based machine translation gave a talk on Machine Translation as Sequence Modelling, including a detailed review of different MT and alignment approaches. If you are interested in a comprehensive history of MT that takes you from IBM Model 1 all the way to phrase-based, syntax-based and eventually neural MT, while delving into the details of alignment, translation, and decoding, definitely check out the slides here.

Structured Prediction

HMMs can model sequences, but as their weights are tied to the generative process, strong independence assumptions need to be made to make their computation tractable. We will now turn to a category of models that are more expressive and can be used to predict more complex structures: Structured prediction — which was introduced by Xavier Carreras of Xerox Research on the morning of the fourth day — is used to refer to ML algorithms that don’t just predict scalar or real values, but more complex structures. As complex structures are common in language, so is structured prediction; example tasks of structured prediction in NLP include POS tagging, named entity recognition, machine translation, parsing, and many others.

A successful category of structured prediction models are log-linear models, which are so-called because they model log-probabilities using a linear predictor. Such models try to estimate the parameters \(w\) by calculating the following probability:

\(\text{log} \: \text{Pr}(\mathbf{y} \: | \: \mathbf{x}; \mathbf{w}) = \text{log} \:\frac {\text{exp}\{\mathbf{w} \cdot \mathbf{f}(\mathbf{x},\mathbf{y})\}}{Z(\mathbf{x};\mathbf{w})}\)

where \(\mathbf{x} = x_1, x_2, …, x_n \in \mathcal{X}\) is the sequence of symbols, \(\mathbf{y} = y_1, y_2, …, y_n \in \mathcal{Y}\) is the corresponding sequence of labels, \(\mathbf{f}(\mathbf{x},\mathbf{y})\) is a feature representation of \(\mathbf{x}\) and \(\mathbf{y}\), and \(Z(\mathbf{x};\mathbf{w}) = \sum\limits_{\mathbf{y}’ \in \mathcal{Y}} \text{exp}(\mathbf{w} \cdot \mathbf{f}(\mathbf{x},\mathbf{y}’)) \) is also referred to as the partition function.

Two approaches that can be used to estimate the model parameters \(w\) are:

  1. Maximum Entropy Markov Models (MEMMs), which assume that \(\text{Pr}(\mathbf{y} \: | \: \mathbf{x}; \mathbf{w})\) decomposes, i.e. that we can express it as a product of the individual label probabilities that only depend on the previous label (similar to HMMs).
  2. Conditional Random Fields (CRFs), which make a weaker assumption by only assuming that \(\mathbf{f}(\mathbf{x},\mathbf{y})\) decomposes.

In MEMMs, we assume – similarly to Markov Models – that the label \(y_i\) at the \(i\) th position does not depend on all past labels, but only on the previous label \(y_{i-1}\). In contrast to Markov Models, MEMMs allow us to condition the label \(y_i\) on the entire symbol sequence \(x_{1:n}\). Both assumptions combined lead to the following probability of label \(y_i\) in MEMMs:

\(\text{Pr}(y_i \: | \: x_{1:n}, y_{1:i-1}) = \text{Pr}(y_i \: | \: x_{1:n}, y_{i-1})\)

By this formulation, the objective of MEMMs reduces sequence modelling to multi-class logistic regression.

In CRFs, we factorize on label bigrams. Instead of greedily predicting the most probable label \(y_i\) at every position \(i\), we instead aim to find the sequence of labels with the maximum probability:

\(\underset{y \in \mathcal{Y}}{\text{argmax}} \sum_i \mathbf{w} \cdot \mathbf{f}(\mathbf{x}, i, y_{i-1}, y_i)\)

We then estimate the parameters \(w\) of our model using gradient-based methods where we can use forward-backward to compute the gradient.

CRFs vs. MEMMs

By choosing between MEMMs and CRFs, we make the choice between local and global normalisation. While MEMMs aim to predict the most probable label at every position, CRFs aim to find the most probable label sequence. This, however, leads to the so-called “Label Bias Problem” in MEMMs: As MEMMs choose the label with the highest probability, the model is biased towards more frequent labels, often irrespective of the local context.

As MEMMs reduce to multi-class classification, they are cheaper to train. On the other hand, CRFs are more flexible and thus easier to extend to more complex structures.

This distinction between local and global normalisation has been a recurring topic in sequence modelling and a key criterion when choosing an algorithm. For text generation tasks, global normalisation is still too expensive, however. Many state-of-the-art approaches thus employ beam search as a compromise between local and global normalisation. In most sequence modelling tasks, local normalisation is very popular due to its ease of use, but might fall out of favour as more advanced models and implementations for global normalisation become available. To this effect, a recent outstanding paper at ACL (Andor et al., 2016) shows that globally normalised models are strictly more expressive than locally normalised ones.

HMMs vs. CRFs

Another distinction that is worth investigating is the difference between generative and discriminative models: HMMs are generative models, while CRFs are discriminative. HMMs only take into account the previous word as its features are tied to the generative process. In contrast, CRF features are very flexible. They can look at the whole input \(x\) paired with a label bigram \((y_i , y_{i+1})\). In practice, for prediction tasks, such “good” discriminative features can improve accuracy a lot.

Regarding the parameter estimation, the distinction between generative and discriminative becomes apparent: HMMs focus on explaining the data, both \(x\) and \(y\), while CRFs focus on the mapping from \(x\) to \(y\). Which model is more appropriate depends on the task: CRFs are commonly used in tasks such as POS tagging and NER, while HMMs have traditionally lain at the heart of speech recognition.

Structured Prediction in NLP with Imitation Learning

Andreas Vlachos of the University of Sheffield gave a talk on using imitation learning for structured prediction in NLP, which followed the same distinction between local normalisation (aka incremental modelling), i.e. greedily predicting one label at a time and global normalisation (aka joint modelling), i.e. scoring the complete outputs with a CRF that we discussed above. Andreas talked about how imitation learning can be used to improve incremental modelling as it allows to a) explore the search space, b) to address error-propagation, and c) to train with regard to the task-specific loss function.

There are many popular imitation learning algorithms in the literature such as SEARN (Daumé III et al., 2009), Dagger (Ross et al. 2011), or V-DAgger (Vlachos and Clark, 2014). Recently, MIXER (Ranzato et al., 2016) has been proposed to directly optimise metrics for text generation, such as BLEU or ROUGE.

An interesting perspective is that imitation learning can be seen as inverse reinforcement learning: Whereas we want to learn the best policy in reinforcement learning, we know the optimal policy in imitation learning, i.e. the labels in the training data; we then infer the per-action reward function and learn a policy, i.e. a classifier that can generalise to unseen data.

Demo Day

Figure 4: Aylien stand at Demo Day

In the evening of the fourth day, we presented – along with other NLP companies and research labs – Aylien at the LxMLS Demo Day.

We presented an overview of our research directions at Aylien, as well as a 1D generative adversarial network demo and visualization.

Syntax and Parsing

Having looked at generic models that are able to cope with sequences and more complex structures, we now briefly mention some of the techniques that are commonly used to deal with one of language’s unique characteristics: syntax. To this end, Slav Petrov of Google Research gave an in-depth lecture about syntax and parsing on the fifth day of the summer school, which discussed, among others, successful parsers such as the Charniak and the Berkeley parser, context-free grammars and phrase-based parsing, projective and non-projective dependency parsing, as well as more recent transition and graph-based parsers.

To tie this to what we’ve already discussed, Figure 5 demonstrates how the distinction between generative and discriminative models applies to parsers.


Figure 5: Generative vs. discriminative parsing models

From Dependencies to Constituents

In the evening of the fifth day, André Martins of Unbabel gave talk a on an ACL 2015 paper of his, in which he shows that constituent parsing can be reduced to dependency parsing to get the best of both worlds: the informativeness of constituent parser output and the speed of dependency parsers.

Their approach works for any out-of-the-box dependency parser, is competitive for English and morphologically rich languages, and achieves results above the state of the art for discontinuous parsing (where edges are allowed to intersect).

Deep Learning

Finally, the two last days were dedicated to Deep Learning and featured prolific researchers from academia and industry labs as speakers. On the morning of the sixth day, Wang Ling of Google DeepMind gave one of the most gentle, family-friendly intro to Deep Learning I’ve seen – titled Deep Neural Networks Are Our Friends with a theme inspired by the Muppets.

The evening talk by Oriol Vinyals of Google DeepMind detailed some of his lessons learned when working on sequence-to-sequence models at Google and gave glimpses of interesting future challenges, among them, one-shot learning for NLP (Vinyals et al., 2016) and enabling neural networks to ponder decisions (Graves, 2016).

For the lecture on the last day, Chris Dyer of CMU and Google DeepMind discussed modelling sequential data with recurrent neural networks (RNNs) and shared some insights and intuitions with regard to working with RNNs and LSTMs.

Exploding / vanishing gradients

If you’ve worked with RNNs before, then you’re most likely familiar with the exploding/vanishing gradients problem: As the length of the sequence increases, computations of the model get amplified, which leads to either exploding or vanishing gradients and thus renders the model incapable to learn. The intuition why advanced models such as LSTMs and GRUs mitigate this problem is that they use summations instead of multiplications (which lead to exponential growth).

Deep LSTMs

Figure 6: Deep LSTMs

Deep or stacked LSTMs are by now a very common sight in the literature and state-of-the-art for many sequence modelling problems. Still, descriptions of implementations often omit details, which might be perceived as self-evident. This, however, means that it is not always clear how a model looks like exactly or how it differs from similar architectures. The same applies to Deep LSTMs. The most standard convention feeds the input not only to the first but (via skip connections) also to subsequent layers as in Figure 6. Additionally, dropout is generally applied only between layers and not on the recurrent connections as this would drop out more and more value over time.


Figure 7: Dropout in Deep LSTMs

Does Depth Matter?

Generally, depth helps. However, in comparison to other applications such as audio/visual processing, depth plays a less significant role in NLP. Hypotheses for this observation are: a) More transformation is required for speech processing, image recognition etc. than for common text applications; b) Less effort has been made to find good architectures (RNNs are expensive to train; have been widely used for less long); c) Backpropagation through time and depth is hard and we need better optimisers.

Generally, 2-8 layers are standard across text applications. Input skip connections are used often but by no means universally.

Only recently have also very deep architectures been proposed for NLP (Conneau et al., 2016).


Mini-batching is generally necessary to make use of optimised matrix-matrix multiplication. In practice, however, this usually requires bucketing training instances based on similar lengths and padding with \(0\)’s, which can be a nuisance. Because of this, this is – according to Chris Dyer – “the era of assembly language programming for neural networks. Make the future an easier place to program!”

Character-based models

Character-based models have gained more popularity recently and for some tasks such as language modelling, using character-based LSTMs blows the results of word-based models out of the water, achieving a significantly lower perplexity with a lot fewer parameters particularly for morphologically rich languages.


Figure 8: CharLSTM > Word Lookup


Finally, no overview of recurrent neural networks is complete without the mention of attention, one of the most influential, recently proposed notions with regard to LSTMs. Attention is closely related to “pooling” operations in convolutional neural networks (and other architectures) as it also allows to selectively focus on particular elements of the input. The most popular attention architecture pioneered by Bahdanau et al. (2015) seems to only care about “content” in that it relies on computing the dot product, i.e. the cosine similarity between vectors. It contains no obvious bias in favor of diagonals, short jumps, fertility, or other structures that might guide actual attention from a psycho-linguistic perspective. Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are many more opportunities for research.

Attention is similar to alignment, but there are important differences: a) alignment makes stochastic but hard decisions. Even if the alignment probability distribution is “flat”, the model picks one word or phrase at a time; b) in contrast, attention is “soft” (all words are interpolated based on their attention weights). Finally, there is a big difference between “flat” and “peaked” attention weights.

Memory Networks for Language Understanding

Antoine Bordes of Facebook AI Research gave the last talk of the summer school, in which he discussed Facebook AI Research’s two main research directions:

On the one hand, they are working on (Key-Value) Memory Networks, which can be used to jointly model symbolic and continuous systems. They can be trained end-to-end through back propagation and with SGD and provide great flexibility on how to design memories.

On the other hand, they are working on new tools for developing learning algorithms. They have created several datasets of reasonable sizes, such as bAbI, CBT, and MovieQA that are designed to ease interpretation of a model’s capabilities and to foster research.


Figure 9: LxMLS 2016 Group Picture

That was the Lisbon Machine Learning Summer School 2016! We had a blast and hope to be back next year!


Sequence Models (Noah Smith)

Machine Translation as Sequence Modelling (Philipp Koehn)

Learning Structured Predictors (Xavier Carreras)

Structured Prediction in NLP with Imitation Learning (Andreas Vlachos)

Syntax and Parsing – I, II (Slav Petrov)

Turbo Parser Redux: From Dependencies to Constituents (André Martins)

Deep Neural Networks Are Our Friends (Wang Ling)

Modeling Sequential Data with Recurrent Networks (Chris Dyer)

Memory Networks for Language Understanding (Antoine Bordes)

Text Analysis API - Sign up



Interest in Natural Language Processing (NLP) has risen rapidly in recent years; nowadays, NLP forms a key component of the roadmap of almost every major tech company. All share the goals of making advanced NLP capabilities accessible to developers and bringing them into the hands of consumers. Simultaneously, most existing areas in NLP are under active research, with new ideas being developed and tested at a breakneck pace. However, discussion of the potential and applications of such ideas is usually restricted to small interest groups, e.g. collaborating research teams and infrequent venues, e.g. NLP conferences.

We at Aylien are acutely aware of this dynamic: We seek to equip developers and businesses with the NLP tools they need to improve their businesses. At the same time, we are conducting cutting-edge research and continuously looking for ways to leverage this research to improve our services and help our clients.

In order to provide a regular, common forum for students, researchers, and industry professionals to discuss state-of-the-art NLP research and cutting-edge industry applications, we are thrilled to announce the NLP Dublin meetup. We hope that this group will facilitate the exchange of ideas within the Irish NLP community and bring people into contact with ideas and applications they might have otherwise not heard about. Every event will feature presentations about interesting areas of NLP, QA sessions, and ample time for discussions and networking.

The first meetup will take place on August 3. If you are interested in speaking at or sponsoring future meetups, please contact or click the banner below.