Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

In a recent blog post I outlined some interesting research directions for people who are just getting into NLP and ML (you can read the original post here). These ideas can also serve as a starting point and source of inspiration if you are considering starting a PhD in this field, which is why I’m highlighting them again here. At AYLIEN, we offer the opportunity for people to join our research team and complete a PhD, while also working on the application of that research directly as part of our products. If this sounds interesting, you should consider applying to work with us on these (or related) ideas. See this blog post for more information.

Table of contents:

It can be hard to find compelling topics to work on and know what questions are interesting to ask when you are just starting as a researcher in a new field. Machine learning research in particular moves so fast these days that it is difficult to find an opening.

This post aims to provide inspiration and ideas for research directions to junior researchers and those trying to get into research. It gathers a collection of research topics that are interesting to me, with a focus on NLP and transfer learning. As such, they might obviously not be of interest to everyone. If you are interested in Reinforcement Learning, OpenAI provides a selection of interesting RL-focused research topics. In case you’d like to collaborate with others or are interested in a broader range of topics, have a look at the Artificial Intelligence Open Network.

Most of these topics are not thoroughly thought out yet; in many cases, the general description is quite vague and subjective and many directions are possible. In addition, most of these are not low-hanging fruit, so serious effort is necessary to come up with a solution. I am happy to provide feedback with regard to any of these, but will not have time to provide more detailed guidance unless you have a working proof-of-concept. I will update this post periodically with new research directions and advances in already listed ones. Note that this collection does not attempt to review the extensive literature but only aims to give a glimpse of a topic; consequently, the references won’t be comprehensive.

I hope that this collection will peak your interest and serve as inspiration for your own research agenda.

Task-independent data augmentation for NLP

Data augmentation aims to create additional training data by producing variations of existing training examples through transformations, which can mirror those encountered in the real world. In Computer Vision (CV), common augmentation techniques are mirroring, random cropping, shearing, etc. Data augmentation is super useful in CV. For instance, it has been used to great effect in AlexNet (Krizhevsky et al., 2012) [1] to combat overfitting and in most state-of-the-art models since. In addition, data augmentation makes intuitive sense as it makes the training data more diverse and should thus increase a model’s generalization ability.

However, in NLP, data augmentation is not widely used. In my mind, this is for two reasons:

  1. Data in NLP is discrete. This prevents us from applying simple transformations directly to the input data. Most recently proposed augmentation methods in CV focus on such transformations, e.g. domain randomization (Tobin et al., 2017) [2].
  2. Small perturbations may change the meaning. Deleting a negation may change a sentence’s sentiment, while modifying a word in a paragraph might inadvertently change the answer to a question about that paragraph. This is not the case in CV where perturbing individual pixels does not change whether an image is a cat or dog and even stark changes such as interpolation of different images can be useful (Zhang et al., 2017) [3].

Existing approaches that I am aware of are either rule-based (Li et al., 2017) [5] or task-specific, e.g. for parsing (Wang and Eisner, 2016) [6] or zero-pronoun resolution (Liu et al., 2017) [7]. Xie et al. (2017) [39] replace words with samples from different distributions for language modelling and Machine Translation. Recent work focuses on creating adversarial examples either by replacing words or characters (Samanta and Mehta, 2017; Ebrahimi et al., 2017) [8, 9], concatenation (Jia and Liang, 2017) [11], or adding adversarial perturbations (Yasunaga et al., 2017) [10]. An adversarial setup is also used by Li et al. (2017) [16] who train a system to produce sequences that are indistinguishable from human-generated dialogue utterances.

Back-translation (Sennrich et al., 2015; Sennrich et al., 2016) [12, 13] is a common data augmentation method in Machine Translation (MT) that allows us to incorporate monolingual training data. For instance, when training a EN\(\rightarrow\)FR system, monolingual French text is translated to English using an FR\(\rightarrow\)EN system; the synthetic parallel data can then be used for training. Back-translation can also be used for paraphrasing (Mallinson et al., 2017) [14]. Paraphrasing has been used for data augmentation for QA (Dong et al., 2017) [15], but I am not aware of its use for other tasks.

Another method that is close to paraphrasing is generating sentences from a continuous space using a variational autoencoder (Bowman et al., 2016; Guu et al., 2017) [17, 19]. If the representations are disentangled as in (Hu et al., 2017) [18], then we are also not too far from style transfer (Shen et al., 2017) [20].

There are a few research directions that would be interesting to pursue:

  1. Evaluation study: Evaluate a range of existing data augmentation methods as well as techniques that have not been widely used for augmentation such as paraphrasing and style transfer on a diverse range of tasks including text classification and sequence labelling. Identify what types of data augmentation are robust across task and which are task-specific. This could be packaged as a software library to make future benchmarking easier (think CleverHans for NLP).
  2. Data augmentation with style transfer: Investigate if style transfer can be used to modify various attributes of training examples for more robust learning.
  3. Learn the augmentation: Similar to Dong et al. (2017) we could learn either to paraphrase or to generate transformations for a particular task.
  4. Learn a word embedding space for data augmentation: A typical word embedding space clusters synonyms and antonyms together; using nearest neighbours in this space for replacement is thus infeasible. Inspired by recent work (Mrkšić et al., 2017) [21], we could specialize the word embedding space to make it more suitable for data augmentation.
  5. Adversarial data augmentation: Related to recent work in interpretability (Ribeiro et al., 2016) [22], we could change the most salient words in an example, i.e. those that a model depends on for a prediction. This still requires a semantics-preserving replacement method, however.

Few-shot learning for NLP

Zero-shot, one-shot and few-shot learning are one of the most interesting recent research directions IMO. Following the key insight from Vinyals et al. (2016) [4] that a few-shot learning model should be explicitly trained to perform few-shot learning, we have seen several recent advances (Ravi and Larochelle, 2017; Snell et al., 2017) [23, 24].

Learning from few labeled samples is one of the hardest problems IMO and one of the core capabilities that separates the current generation of ML models from more generally applicable systems. Zero-shot learning has only been investigated in the context of learning word embeddings for unknown words AFAIK. Dataless classification (Song and Roth, 2014; Song et al., 2016) [25, 26] is an interesting related direction that embeds labels and documents in a joint space, but requires interpretable labels with good descriptions.

Potential research directions are the following:

  1. Standardized benchmarks: Create standardized benchmarks for few-shot learning for NLP. Vinyals et al. (2016) introduce a one-shot language modelling task for the Penn Treebank. The task, while useful, is dwarfed by the extensive evaluation on CV benchmarks and has not seen much use AFAIK. A few-shot learning benchmark for NLP should contain a large number of classes and provide a standardized split for reproducibility. Good candidate tasks would be topic classification or fine-grained entity recognition.
  2. Evaluation study: After creating such a benchmark, the next step would be to evaluate how well existing few-shot learning models from CV perform for NLP.
  3. Novel methods for NLP: Given a dataset for benchmarking and an empirical evaluation study, we could then start developing novel methods that can perform few-shot learning for NLP.

Transfer learning for NLP

Transfer learning has had a large impact on computer vision (CV) and has greatly lowered the entry threshold for people wanting to apply CV algorithms to their own problems. CV practicioners are no longer required to perform extensive feature-engineering for every new task, but can simply fine-tune a model pretrained on a large dataset with a small number of examples.

In NLP, however, we have so far only been pretraining the first layer of our models via pretrained embeddings. Recent approaches (Peters et al., 2017, 2018) [31, 32] add pretrained language model embedddings, but these still require custom architectures for every task. In my opinion, in order to unlock the true potential of transfer learning for NLP, we need to pretrain the entire model and fine-tune it on the target task, akin to fine-tuning ImageNet models. Language modelling, for instance, is a great task for pretraining and could be to NLP what ImageNet classification is to CV (Howard and Ruder, 2018) [33].

Here are some potential research directions in this context:

  1. Identify useful pretraining tasks: The choice of the pretraining task is very important as even fine-tuning a model on a related task might only provide limited success (Mou et al., 2016) [38]. Other tasks such as those explored in recent work on learning general-purpose sentence embeddings (Conneau et al., 2017; Subramanian et al., 2018; Nie et al., 2017) [34, 35, 40] might be complementary to language model pretraining or suitable for other target tasks.
  2. Fine-tuning of complex architectures: Pretraining is most useful when a model can be applied to many target tasks. However, it is still unclear how to pretrain more complex architectures, such as those used for pairwise classification tasks (Augenstein et al., 2018) or reasoning tasks such as QA or reading comprehension.

Multi-task learning

Multi-task learning (MTL) has become more commonly used in NLP. See here for a general overview of multi-task learning and here for MTL objectives for NLP. However, there is still much we don’t understand about multi-task learning in general.

The main questions regarding MTL give rise to many interesting research directions:

  1. Identify effective auxiliary tasks: One of the main questions is which tasks are useful for multi-task learning. Label entropy has been shown to be a predictor of MTL success (Alonso and Plank, 2017) [28], but this does not tell the whole story. In recent work (Augenstein et al., 2018) [27], we have found that auxiliary tasks with more data and more fine-grained labels are more useful. It would be useful if future MTL papers would not only propose a new model or auxiliary task, but also try to understand why a certain auxiliary task might be better than another closely related one.
  2. Alternatives to hard parameter sharing: Hard parameter sharing is still the default modus operandi for MTL, but places a strong constraint on the model to compress knowledge pertaining to different tasks with the same parameters, which often makes learning difficult. We need better ways of doing MTL that are easy to use and work reliably across many tasks. Recently proposed methods such as cross-stitch units (Misra et al., 2017; Ruder et al., 2017) [29, 30] and a label embedding layer (Augenstein et al., 2018) are promising steps in this direction.
  3. Artificial auxiliary tasks: The best auxiliary tasks are those, which are tailored to the target task and do not require any additional data. I have outlined a list of potential artificial auxiliary tasks here. However, it is not clear which of these work reliably across a number of diverse tasks or what variations or task-specific modifications are useful.

Cross-lingual learning

Creating models that perform well across languages and that can transfer knowledge from resource-rich to resource-poor languages is one of the most important research directions IMO. There has been much progress in learning cross-lingual representations that project different languages into a shared embedding space. Refer to Ruder et al. (2017) [36] for a survey.

Cross-lingual representations are commonly evaluated either intrinsically on similarity benchmarks or extrinsically on downstream tasks, such as text classification. While recent methods have advanced the state-of-the-art for many of these settings, we do not have a good understanding of the tasks or languages for which these methods fail and how to mitigate these failures in a task-independent manner, e.g. by injecting task-specific constraints (Mrkšić et al., 2017).

Task-independent architecture improvements

Novel architectures that outperform the current state-of-the-art and are tailored to specific tasks are regularly introduced, superseding the previous architecture. I have outlined best practices for different NLP tasks before, but without comparing such architectures on different tasks, it is often hard to gain insights from specialized architectures and tell which components would also be useful in other settings.

A particularly promising recent model is the Transformer (Vaswani et al., 2017) [37]. While the complete model might not be appropriate for every task, components such as multi-head attention or position-based encoding could be building blocks that are generally useful for many NLP tasks.


I hope you’ve found this collection of research directions useful. If you have suggestions on how to tackle some of these problems or ideas for related research topics, feel free to comment below.


  1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

  2. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. arXiv Preprint arXiv:1703.06907. Retrieved from

  3. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization, 1–11. Retrieved from

  4. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching Networks for One Shot Learning. NIPS 2016. Retrieved from

  5. Li, Y., Cohn, T., & Baldwin, T. (2017). Robust Training under Linguistic Adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Vol. 2, pp. 21–27).

  6. Wang, D., & Eisner, J. (2016). The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages. Tacl, 4, 491–505. Retrieved from

  7. Liu, T., Cui, Y., Yin, Q., Zhang, W., Wang, S., & Hu, G. (2017). Generating and Exploiting Large-scale Pseudo Training Data for Zero Pronoun Resolution. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 102–111).

  8. Samanta, S., & Mehta, S. (2017). Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812.

  9. Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2017). HotFlip: White-Box Adversarial Examples for NLP. Retrieved from

  10. Yasunaga, M., Kasai, J., & Radev, D. (2017). Robust Multilingual Part-of-Speech Tagging via Adversarial Training. In Proceedings of NAACL 2018. Retrieved from

  11. Jia, R., & Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.

  12. Sennrich, R., Haddow, B., & Birch, A. (2015). Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.

  13. Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891.

  14. Mallinson, J., Sennrich, R., & Lapata, M. (2017). Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (Vol. 1, pp. 881-893).

  15. Dong, L., Mallinson, J., Reddy, S., & Lapata, M. (2017). Learning to Paraphrase for Question Answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.

  16. Li, J., Monroe, W., Shi, T., Ritter, A., & Jurafsky, D. (2017). Adversarial Learning for Neural Dialogue Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Retrieved from

  17. Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2016). Generating Sentences from a Continuous Space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL). Retrieved from

  18. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward Controlled Generation of Text. In Proceedings of the 34th International Conference on Machine Learning.

  19. Guu, K., Hashimoto, T. B., Oren, Y., & Liang, P. (2017). Generating Sentences by Editing Prototypes.

  20. Shen, T., Lei, T., Barzilay, R., & Jaakkola, T. (2017). Style Transfer from Non-Parallel Text by Cross-Alignment. In Advances in Neural Information Processing Systems. Retrieved from

  21. Mrkšić, N., Vulić, I., Séaghdha, D. Ó., Leviant, I., Reichart, R., Gašić, M., … Young, S. (2017). Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints. TACL. Retrieved from

  22. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144). ACM.

  23. Ravi, S., & Larochelle, H. (2017). Optimization as a Model for Few-Shot Learning. In ICLR 2017.

  24. Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems.

  25. Song, Y., & Roth, D. (2014). On dataless hierarchical text classification. Proceedings of AAAI, 1579–1585. Retrieved from

  26. Song, Y., Upadhyay, S., Peng, H., & Roth, D. (2016). Cross-Lingual Dataless Classification for Many Languages. Ijcai, 2901–2907.

  27. Augenstein, I., Ruder, S., & Søgaard, A. (2018). Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces. In Proceedings of NAACL 2018.

  28. Alonso, H. M., & Plank, B. (2017). When is multitask learning effective? Multitask learning for semantic sequence prediction under varying data conditions. In EACL. Retrieved from

  29. Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  30. Ruder, S., Bingel, J., Augenstein, I., & Søgaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142.

  31. Peters, M. E., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017).

  32. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL.

  33. Howard, J., & Ruder, S. (2018). Fine-tuned Language Models for Text Classification. arXiv preprint arXiv:1801.06146.

  34. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.

  35. Subramanian, S., Trischler, A., Bengio, Y., & Pal, C. J. (2018). Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. In Proceedings of ICLR 2018.

  36. Ruder, S., Vulić, I., & Søgaard, A. (2017). A Survey of Cross-lingual Word Embedding Models. arXiv Preprint arXiv:1706.04902. Retrieved from

  37. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems.

  38. Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., & Jin, Z. (2016). How Transferable are Neural Networks in NLP Applications? Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing.

  39. Xie, Z., Wang, S. I., Li, J., Levy, D., Nie, A., Jurafsky, D., & Ng, A. Y. (2017). Data Noising as Smoothing in Neural Network Language Models. In Proceedings of ICLR 2017.

  40. Nie, A., Bennett, E. D., & Goodman, N. D. (2017). DisSent: Sentence Representation Learning from Explicit Discourse Relations. arXiv Preprint arXiv:1710.04334. Retrieved from


Four members of our research team spent the past week at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017) in Copenhagen, Denmark. The conference handbook can be found here and the proceedings can be found here.

The program consisted of two days of workshops and tutorials and three days of main conference. Videos of the conference talks and presentations can be found here.The conference was superbly organized, had a great venue, and a social event with fireworks.


Figure 1: Fireworks at the social event

With 225 long papers, 107 papers, and 9 TACL papers accepted, there was a clear uptick of submissions compared to last year. The number of long and short paper submissions to EMNLP this year was even higher than those at ACL for the first time within the last 13 years, as can be seen in Figure 2.


Figure 2: Long and short paper submissions at ACL and EMNLP from 2004-2017

In the following, we will outline our highlights and list some research papers that caught our eye. We will first list overall themes and will then touch upon specific research topics that are in line with our areas of focus. Also, we’re proud to say that we had four papers accepted to the conference and workshops this year! If you want to see the AYLIEN team’s research, check out the research sections of our website and our blog. With that said, let’s jump in!


Exciting Datasets

Evaluating your approach on CoNLL-2003 or PTB is appropriate for comparing against previous state-of-the-art, but kind of boring. The two following papers introduce datasets that allow you to test your model in more exciting settings:

  • Durrett et al. release a new domain adaptation dataset. The dataset evaluates models on their ability to identify products being bought and sold in online cybercrime forums.  
  • Kutuzov et al. evaluate their word embedding model on a new dataset that focuses on predicting insurgent armed groups based on geographical locations.
  • While he did not introduce a new dataset, Nando de Freitas made the point during his keynote that the best environment for learning and evaluating language is simulation.


Figure 3: Nando de Freitas’ vision for AI research

Return of the Clusters

Brown clusters, an agglomerative, hierarchical clustering of word types based on contexts that was introduced in 1992 seem to come in vogue again. They were found to be particularly helpful for cross-lingual applications, while clusters were key features in several approaches:

  • Mayhew et al. found that Brown cluster features were an important signal for cross-lingual NER.
  • Botha et al. use word clusters as a key feature in their small, efficient feed-forward neural networks.
  • Mekala et al.’s new document representations cluster word embeddings, which give it an edge for text classification.
  • In his talk at the SCLeM workshop, Noah Smith cites the benefits of using Brown clusters as features for tasks such as POS tagging and sentiment analysis.


Figure 4: Noah Smith on the benefits of clustering in his invited talk at the SCLeM workshop

Distant Supervision

Distant supervision can be leveraged to collect large amounts of noisy training data, which can be useful in many applications. Some papers used novel forms of distant supervision to create new corpora or to train a model more effectively:

  • Lan et al. use urls in tweets to collect a large corpus of paraphrase data. Paraphrase data is usually hard to create, so this approach facilitates the process significantly and enables a continuously expanding collection of paraphrases.
  • Felbo et al. show that training on fine-grained emoji detection is more effective for pre-training sentiment and emotion models. Previous approaches primarily pre-trained on positive and negative emoticons or emotion hashtags.

Data Selection

The current generation of deep learning models is excellent at learning from data. However, we often do not pay much attention to the actual data our model is using. In many settings, we can improve upon the model by selecting the most relevant data:

  • Fang et al. reframe active learning as reinforcement learning and explicitly learn a data selection policy. Active learning is one of the best ways to create a model with as few annotations as possible; any improvement to this process is beneficial.
  • Van der Wees et al. introduce dynamic data selection for NMT, which varies the selected subset of the training data between different training epochs. This approach has the potential to reduce the training time of NMT models at comparable or better performance.
  • Ruder and Plank use Bayesian Optimization to learn data selection policies for transfer learning and investigate how well these transfer across models, domains, and tasks. This approach brings us a step closer towards gaining a better understanding of what constitutes similarity between different tasks and domains.

Character-level Models

Characters are nowadays used as standard features in most sequence models. The Subword and Character-level Models in NLP workshop discussed approaches in more detail, with invited talks on subword language models and character-level NMT.

  • Schmaltz et al. find that character-based sequence-to-sequence models outperform word-based models and models with character convolutions for sentence correction.
  • Ryan Cotterell gave a great, movie-inspired tutorial on combining the best of FSTs (cowboys) and sequence-to-sequence models (aliens) for string-to-string transduction. While evaluated on morphological segmentation, the tutorial raised awareness in an entertaining way that often the best of both worlds, i.e. a combination of traditional and neural approaches performs best.


Figure 5: Ryan Cotterell on combining FSTs and seq2seq models for string-to-string transduction

Word Embeddings

Research in word embeddings has matured and now mainly tries to 1) address deficits of word2vec, such as its ability of dealing with OOV words; 2) extend it to new settings, e.g. modelling the relations of words over time; and 3) understand the induced representations better:

  • Pinter et al. propose an approach for generating OOV word embeddings by training a character-based BiLSTM to generate embeddings that are close to pre-trained ones. This approach is promising as it provides us with a more sophisticated way to deal with out-of-vocabulary words than replacing them with an <UNK> token.
  • Herbelot and Baroni slightly modify word2vec to allow it to learn embeddings for OOV words from few data.
  • Rosin et al. propose a model for analyzing when two words relate to each other.
  • Kutuzov et al. propose another model that analyzes how two words relate to each other over time.
  • Hasan and Curry improve the performance of word embeddings on word similarity tasks by re-embedding them in a manifold.
  • Yang et al. introduce a simple approach to learning cross-domain word embeddings. Creating embeddings tuned on a small, in-domain corpus is still a challenge, so it is nice to see more approaches addressing this pain point.
  • Mimno and Thompson try to understand the geometry of word2vec better. They show that the learned word embeddings are positioned diametrically opposite of their context vectors in the embedding space.

Cross-lingual transfer

An increasing number of papers evaluate their methods on multiple languages. In addition, there was an excellent tutorial on cross-lingual word representations, which summarized and tried to unify much of the existing literature. Slides of the tutorial are available here.

  • Malaviya et al. train a many-to-one NMT to translate 1017 languages into English and use this model to predict information missing from typological databases.
  • Mayhew et al. introduce a cheap translation method for cross-lingual NER that only requires a bilingual dictionary. They even perform a case study on Uyghur, a truly low-resource language.
  • Kim et al. present a cross-lingual transfer learning model for POS tagging without parallel data. Parallel data is expensive to create and rarely available for low-resource languages, so this approach fills an important need.
  • Vulic et al. propose a new cross-lingual transfer method for inducing VerbNets for different languages. The method leverages vector space specialisation, an effective word embedding post-processing technique similar to retro-fitting.
  • Braud et al. propose a robust, cross-lingual discourse segmentation model that only relies on POS tags. They show that dependency information is less useful than expected; it is important to evaluate our models on multiple languages, so we do not overfit to features that are specific to analytic languages, such as English.


Figure 6: Anders Søgaard demonstrating the similarities between different cross-lingual embedding models at the cross-lingual representations tutorial


The Workshop on New Frontiers of Summarization brought researchers together to discuss key issues related to automatic summarization. Much of the research on summarization sought to develop new datasets and tasks:

  • Katja Filippova (Google Research, Switzerland) gave an interesting talk on sentence compression and passage summarization for Q&A. She described how they went from syntax-based methods to Deep Learning.
  • Volkse et al. created a new summarization corpus by looking for ‘TL;DR’ on Reddit. This is another example of a creative use of distant supervision, leveraging information that is already contained in the data in order to create a new corpus.
  • Falke and Gurevych won the best resource paper award for creating a new summary corpus that is based on concept maps rather than textual summaries. The concept map can be explored using a graph-based document exploration system, which is available as a demo here.
  • Pasunuru et al. use multi-task learning to improve abstractive summarization by leveraging entailment generation.
  • Isonuma et al. also use multi-task learning with document classification in conjunction with curriculum learning.
  • Li et al. propose a new task, reader-aware multi-document summarization, which uses comments of articles, along with a dataset for this task.
  • Naranyan et al. propose another new task, split and rephrase, which aims to split a complex sentence into a sequence of shorter sentences with the same meaning, and also release a new dataset.
  • Ghalandari revisits the traditional centroid-based method and proposes a new strong baseline for multi-document summarization.


Data and model-inherent bias is an issue that is receiving more attention in the community. Some papers investigate and propose methods to address the bias in certain datasets and evaluations:

  • Chaganty et al. investigate bias in the evaluation of knowledge base population models and propose an importance sampling-based evaluation to mitigate the bias.
  • Dan Jurasky gave a truly insightful keynote about his three year-long study analyzing the body camera recordings his team obtained from the Oakland police department for racial bias. Besides describing the first contemporary linguistic study of officer-community member interaction, he also provided entertaining insights on the language of food (cheaper restaurants use terms related to addiction, more expensive venues use language related to indulgence) and the challenges of interdisciplinary publishing.
  • Dubossarsky et al. analyze the bias in word representation models and propose that recently proposed laws of semantic change must be revised.
  • Zhao et al. won the best paper award for an approach using Lagrangian relaxation to inject constraints based on corpus-level label statistics. An important finding of their work is bias amplification: While some bias is inherent in all datasets, they observed that models trained on the data amplified its bias. While a gendered dataset might only contain women in 30% of examples, the situation at prediction time might thus be even more dire.


Figure 7: Zhao et al.’s proposed method for reducing bias amplification

Argument mining & debate analysis

Argument mining is closely related to summarization. In order to summarize argumentative texts, we have to understand claims and their justifications. This research area had the 4th Workshop on Argument Mining dedicated to it:

  • Hidey et al. analyse the semantic types of claims (e.g. agreement, interpretation) and premises (ethos, logos, pathos) in the Subreddit Change My View. This is another creative use of reddit to create a dataset and analyze linguistic patterns.
  • Wachsmut et al. presented an argument web search engine, which can be queried here.
  • Potash and Rumshinsky predict the winner of debates, based on audience favorability.
  • Swamy et al. also forecast winners for the Oscars, the US presidential primaries, and many other contests based on user predictions on Twitter. They create a dataset to test their approach.
  • Zhang et al. analyze the rhetorical role of questions in discourse.
  • Liu et al. show that argument-based features are also helpful for predicting review helpfulness.

Multi-agent communication

Multi-agent communication is a niche topic, which has nevertheless received some recent interest, notably in the representation learning community. Most papers deal with a scenario where two agents play a communicative referential game. The task is interesting, as the agents are required to cooperate and have been observed to develop a common pseudo-language in the process.

  • Andreas and Klein investigate the structure encoded by RNN representations for messages in a communication game. They find that the mistakes are similar to the ones made by humans. In addition, they find that negation is encoded as a linear relationship in the vector space.
  • Kottur et al. show in their best short paper that language does not emerge naturally when two agents are cooperating, but that they can be coerced to develop compositional expressions.


Figure 8: The multi-agent setup in the paper of Kottur et al.

Relation extraction

Extracting relations from documents is more compelling than simply extracting entities or concepts. Some papers improve upon existing approaches using better distant supervision or adversarial training:

  • Liu et al. reduce the noise in distantly supervised relation extraction with a soft-label method.
  • Zhang et al. publish TACRED, a large supervised dataset knowledge base population, as well as a new model.
  • Wu et al. improve the precision of relation extraction with adversarial training.

Document and sentence representations

Learning better sentence representations is closely related to learning more general word representations. While word embeddings still have to be contextualized, sentence representations are promising as they can be directly applied to many different tasks:

  • Mekala et al. propose a novel technique for building document vectors from word embeddings, with good results for text classification. They use a combination of adding and concatenating word embeddings to represent multiple topics of a document, based on word clusters.
  • Conneau et al. learn sentence representations from the SNLI dataset and evaluate them on 12 different tasks.

These were our highlights. Naturally, we weren’t able to attend every session and see every paper. What were your highlights from the conference or which papers from the proceedings did you like most? Let us know in the comments below.

Text Analysis API - Sign up


In Machine Learning, the traditional assumption is that the data our model is applied to is the same as the data we used for training. This assumption is proven false as soon as we move into the real world: many of the data sources we encounter will be very different than our original training data (same meaning here that it comes from the same distribution). In practice, this causes the performance of our model to deteriorate significantly.

Domain adaptation is a prominent approach to transfer learning that can help to bridge this discrepancy between the training and test data. Domain adaptation methods typically seek to identify features that are shared between the domains or learn representations that are general enough to be useful for both domains. In this blog post, I will discuss the motivation for, and the findings of the recent paper that I published with Barbara Planck. In it, we outline a complementary approach to domain adaptation – rather than learning a model that can adapt between the domains, we will learn to select data that is useful for training our model.

Preventing Negative Transfer

The main motivation behind selecting data for transfer learning is to prevent negative transfer. Negative transfer occurs if the information from our source training data is not only unhelpful but actually counter-productive for doing well on our target domain. The classic example for negative transfer comes from sentiment analysis: if we train a model to predict the sentiment of book reviews, we can expect the model to do well on domains that are similar to book reviews. Transferring a model trained on book reviews to reviews of electronics, however, results in negative transfer, as many of the terms our model learned to associate with a certain sentiment for books, e.g. “page-turner”, “gripping”, or — worse — “dangerous” and “electrifying”, will be meaningless or have different connotations for electronics reviews.

In the classic scenario of adapting from one source to one target domain, the only thing we can do about this is to create a model that is capable of disentangling these shifts in meaning. However, adapting between two very dissimilar domains still fails often or leads to painfully poor performance.

In the real world, we typically have access to multiple data sources. In this case, we can only train our model on the data that is most helpful for our target domain. It is unclear, however, what the best way to determine the helpfulness of source data with respect to a target domain is. Existing work generally relies on measures of similarity between the source and the target domain.

Bayesian Optimization for Data Selection

Our hypothesis is that the best way to select training data for transfer learning depends on the task and the target domain. In addition, while existing measures only consider data in relation to the target domain, we also argue that some training examples are inherently more helpful than others.

For these reasons, we propose to learn a data selection measure for transfer learning. We do this using Bayesian Optimization, a framework that has been used successfully to optimize hyperparameters in neural networks and which can be used to optimize any black-box function. We learn this function by defining several features relating to the similarity of the training data to the target domain as well as to its diversity. Over the course of several iterations, the data selection model then learns the importance of each of those features for the relevant task.

Evaluation & Conclusion

We evaluate our approach on three tasks, sentiment analysis, part-of-speech tagging, and dependency parsing and compare our approach to random selection as well as existing methods that select either the most similar source domain or the most similar training examples.

For sentiment analysis on reviews, training on the most similar domain is a strong baseline as review categories are clearly delimited. We significantly improve upon this baseline and demonstrate that diversity complements similarity. We even achieve performance competitive with a state-of-the-art domain adaptation approach, despite not performing any adaptation.

We observe smaller, but consistent improvements for part-of-speech tagging and dependency parsing. Lastly, we evaluate how well learned measures transfer across models, tasks, and domains. We find that learning a data selection measure can be learned with a simpler model, which is used as a proxy for a state-of-the-art model. Transfer across domains is robust, while transfer across tasks holds — as one would expect — for related tasks such as POS tagging and parsing, but fails for dissimilar tasks, e.g. parsing and sentiment analysis.

In the paper, we demonstrate the importance of selecting relevant data for transfer learning. We show that taking into account task and domain-specific characteristics and learning an appropriate data selection measure outperforms off-the-shelf metrics. We find that diversity complements similarity in selecting appropriate training data and that learned measures can be transferred robustly across models, domains, and tasks.

This work will be presented at the 2017 Conference on Empirical Methods in Natural Language Processing. More details can be found in the paper here.

Text Analysis API - Sign up


Our researchers at AYLIEN keep abreast of and contribute to the latest developments in the field of Machine Learning. Recently, two of our research scientists, John Glover and Sebastian Ruder, attended NIPS 2016 in Barcelona, Spain. In this post, Sebastian highlights some of the stand-out papers and trends from the conference.


The Conference on Neural Information Processing Systems (NIPS) is one of the two top conferences in machine learning. It took place for the first time in 1987 and is held every December, historically in close proximity to a ski resort. This year, it took place in sunny Barcelona. The conference (including tutorials and workshops) went on from Monday, December 5 to Saturday, December 10. The full conference program is available here.

Machine Learning seems to become more pervasive every month. However, it is still sometimes hard to keep track of the actual extent of this development. One of the most accurate barometers for this evolution is the growth of NIPS itself. The number of attendees skyrocketed at this year’s conference growing by over 50% year-over-year.


Image 1: The growth of the number of attendees at NIPS follows (the newly coined) Terry’s Law (named after Terrence Sejnowski, the president of the NIPS foundation; faster growth than Moore’s Law)

Unsurprisingly, Deep Learning (DL) was by far the most popular research topic, with about every fourth of more than 2,500 submitted papers (and 568 accepted papers) dealing with deep neural networks.


Image 2: Distribution of topics across all submitted papers (Source: The review process for NIPS 2016)

On the other hand, the distribution of research paper topics has quite a long tail and reflects the diversity of topics at the conference that span everything from theory to applications, from robotics to neuroscience, and from healthcare to self-driving cars.

Generative Adversarial Networks

One of the hottest developments within Deep Learning was Generative Adversarial Networks (GANs). The minimax game playing networks have by now won the favor of many luminaries in the field. Yann LeCun hails them as the most exciting development in ML in recent years. The organizers and attendees of NIPS seem to side with him: NIPS featured a tutorial by Ian Goodfellow about his brainchild, which led to a packed main conference hall.


Image 3: A full conference hall at the GAN tutorial

Though a fairly recent development, there are many cool extensions of GANs among the conference papers:

  • Reed et al. propose a model that allows you to specify not only what you want to draw (e.g. a bird) but also where to put it in an image.
  • Chen et al. disentangle factors of variation in GANs by representing them with latent codes. The resulting models allow you to adjust e.g. the type of a digit, its breadth and width, etc.

In spite of their popularity, we know alarmingly little about what makes GANs so capable of generating realistic-looking images. In addition, making them work in practice is an arduous endeavour and a lot of (undocumented) hacks are necessary to achieve the best performance. Soumith Chintala presents a collection of these hacks in his “How to train your GAN” talk at the Adversarial Training workshop.


Image 4: How to train your GAN (Source: Soumith Chintala)

Yann LeCun muses in his keynote that the development of GANs parallels the history of neural networks themselves: They were poorly understood and hard to get to work in the beginning and only took off once researchers figured out the right tricks and learned how to make them work. At this point, it seems unlikely that GANs will experience a winter anytime soon; the research community is still at the beginning in learning how to make the best use of them and it will be exciting to see what progress we can make in the coming years.

On the other hand, the success of GANs so far has been limited mostly to Computer Vision due to their difficulty in modelling discrete rather than continuous data. The Adversarial Training workshop showcased some promising work in this direction (see e.g. our own John Glover’s paper on modeling documents, this paper and this paper on generating text, and this paper on adversarial evaluation of dialogue models). It remains to be seen if 2017 will be the year in which GANs break through in NLP.

The Nuts and Bolts of Machine Learning

Andrew Ng gave one of the best tutorials of the conference with his take on building AI applications using Deep Learning. Drawing from his experience of managing the 1,300 people AI team at Baidu and hundreds of applied AI projects and equipped solely with two whiteboards, he shared many insights about how to build and deploy AI applications in production.

Besides better hardware, Ng attributes the success of Deep Learning to two factors: In contrast to traditional methods, deep NNs are able to learn more effectively from large amounts of data. Secondly, end-to-end (supervised) Deep Learning allows us to learn to map from inputs directly to outputs.

While this approach to training chatbots or self-driving cars is sufficient to write innovative research papers, Ng emphasized end-to-end DL is often not production-ready: A chatbot that maps from text directly to a response is not able to have a coherent conversation or fulfill a request, while mapping from an image directly to a steering command might have literally fatal side effects if the model has not encountered the corresponding part of the input space before. Rather, for a production model, we still want to have intermediate steps: For a chatbot, we prefer to have an inference engine that generates a response, while in a self-driving car, DL is used to identify obstacles, while the steering is performed by a traditional planning algorithm.


Image 5: Andrew Ng on end-to-end DL (right: end-to-end DL chatbot and chatbot with inference engine; left bottom: end-to-end DL self-driving car and self-driving car with intermediate steps)

Ng also shared that the most common mistakes he sees in project teams is that they track the wrong metrics: In an applied machine learning project, the only relevant metrics are the training error, the development error, and the test error. These metrics alone enable the project team to know what steps to take, as he demonstrated in the diagram below:


Image 6: Andrew Ng’s flowchart for applied ML projects

A key facilitator of the recent success of ML have been the advances in hardware that allowed faster computation and storage. Given that Moore’s Law will reach its limits sooner or later, one might reason that also the rise of ML might plateau. Ng, however, argued that the commitment by leading hardware manufacturers such as NVIDIA and Intel and the ensuing performance improvements to ML hardware would fuel further growth.

Among ML research areas, supervised learning is the undisputed driver of the recent success of ML and will likely continue to drive it for the foreseeable future. In second place, Ng saw neither unsupervised learning nor reinforcement learning, but transfer learning. We at AYLIEN are bullish on transfer learning for NLP and think that it has massive potential.

Recurrent Neural Networks

The conference also featured a symposium dedicated to Recurrent Neural Networks (RNNs). The symposium coincided with the 20 year anniversary of LSTM…


Image 7: Jürgen Schmidhuber kicking off the RNN symposium

… being rejected from NIPS 1996. The fact that papers that do not use LSTMs have been rare in the most recent NLP conferences (see our EMNLP blog post) is a testament to the perseverance of the authors of the original paper, Sepp Hochreiter and Jürgen Schmidhuber.

At NIPS, we had several papers that sought to improve RNNs in different ways:

Other improvements apply to Deep Learning in general:

  • Salimans and Kingma propose Weight Normalisation to accelerate training that can be applied in two lines of Python code.
  • Li et al. propose a multinomial variant of dropout that sets neurons to zero depending on the data distribution.

The Neural Abstract Machines & Program Induction (NAMPI) workshop also featured several speakers talking about RNNs:

  • Alex Graves focused on his recent work on Adaptive Computation Time (ACT) for RNNs that allows to decouple the processing time from the sequence length. He showed that a word-level language model with ACT could reach state-of-the-art with fewer computations.
  • Edward Grefenstette outlined several limitations and potential future research directions in the context of RNNs in his talk.

Improving classic algorithms

While Deep Learning is a fairly recent development, the conference featured also several improvements to algorithms that have been around for decades:

  • Ge et al. show in their best paper that the non-convex objective for matrix completion has no spurious local minima, i.e. every local minimum is a global minimum.
  • Bachem et al. present a method that guarantees accurate and fast seedings for large-scale k-means++ clustering. The presentation was one of the most polished ones of the conference and the code is open-source and can be installed via pip.
  • Ashtiani et al. show that we can make NP-hard k-means clustering problems solvable by allowing the model to pose queries for a few examples to a domain expert.

Reinforcement Learning

Reinforcement Learning (RL) was another much-discussed topic at NIPS with an excellent tutorial by Pieter Abbeel and John Schulman dedicated to RL. John Schulman also gave some practical advice for getting started with RL.

One of the best papers of the conference introduces Value Iteration Networks, which learn to plan by providing a differentiable approximation to a classic planning algorithm via a CNN. This paper was another cool example of one of the major benefits of deep neural networks: They allow us to learn increasingly complex behaviour as long as we can represent it in a differentiable way.

During the week of the conference, several research environments for RL were simultaneously released, among them OpenAI’s Universe, Deep Mind Lab, and FAIR’s Torchcraft. These will likely be a key driver in future RL research and should open up new research opportunities.

Learning-to-learn / Meta-learning

Another topic that came up in several discussions over the course of the conference was Learning-to-learn or Meta-learning:

  • Andrychowicz et al. learn an optimizer in a paper with the ingenious title “Learning to learn by gradient descent by gradient descent”.
  • Vinyals et al. learn how to one shot-learn in a paper that frames one-shot learning in the sequence-to-sequence framework and has inspired new approaches for one-shot learning.

Most of the existing papers on meta-learning demonstrate that wherever you are doing something that gives you gradients, you can optimize them using another algorithm via gradient descent. Prepare for a surge of “Meta-learning for X” and “(Meta-)+learning” papers in 2017. It’s LSTMs all the way down!

Meta-learning was also one of the key talking points at the RNN symposium. Jürgen Schmidhuber argued that a true meta-learner would be able to learn in the space of all programs and would have the ability to modify itself and elaborated on these ideas at his talk at the NAMPI workshop. Ilya Sutskever remarked that we currently have no good meta-learning models. However, there is hope as the plethora of new research environments should also bring progress in this area.

General Artificial Intelligence

Learning how to learn also plays a role in the pursuit of the elusive goal of attaining General Artificial Intelligence, which was a topic in several keynotes. Yann LeCun argued that in order to achieve General AI, machines need to learn common sense. While common sense is often vaguely mentioned in research papers, Yann LeCun gave a succinct explanation of what common sense is: “Predicting any part of the past, present or future percepts from whatever information is available.” He called this predictive learning, but notes that this is really unsupervised learning.

His talk also marked the appearance of a controversial and often tongue-in-cheek copied image of a cake, which he used to demonstrate that unsupervised learning is the most challenging task where we should concentrate our efforts, while RL is only the cherry on the icing of the cake.


Image 8: The Cake slide of Yann LeCun’s keynote

Drew Purves focused on the bilateral relationship between the environment and AI in what was probably the most aesthetically pleasing keynote of the conference (just look at those graphics!)


Image 9: Graphics by Max Cant of Drew Purves’ keynote (Source: Drew Purves)

He emphasized that while simulations of ecological tasks in naturalistic environments could be an important test bed for General AI, General AI is needed to maintain the biosphere in a state that will allow the continued existence of our civilization.


Image 10: Nature needs AI and AI needs Nature from Drew Purves’ keynote

While it is frequently — and incorrectly — claimed that neural networks work so well because they emulate the brain’s behaviour, Saket Navlakha argued during his keynote that we can still learn a great deal from the engineering principles of the brain. For instance, rather than pre-allocating a large number of neurons, the brain generates 1000s of synapses per minutes until its second year. Afterwards, until adolescence, the number of synapses is pruned and decreases by ~50%.


Image 11: Saket Navlakha’s keynote

It will be interesting to see how neuroscience can help us to advance our field further.

In the context of the Machine Intelligence workshop, another environment was introduced in the form of FAIR’s CommAI-env that allows to train agents through interaction with a teacher. During the panel discussion, the ability to learn hierarchical representations and to identify patterns was emphasized. However, although the field is making rapid progress on standard tasks such as object recognition, it is unclear if the focus on such specific tasks brings us indeed closer to General AI.

Natural Language Processing

While NLP is more of a niche topic at NIPS, there were a few papers with improvements relevant to NLP:

  • He et al. propose a dual learning framework for MT that has two agents translating in opposite directions teaching each other via reinforcement learning.
  • Sokolov et al. explore how to use structured prediction under bandit feedback.
  • Huang et al. extend Word Mover’s Distance, an unsupervised document similarity metric to the supervised setting.
  • Lee et al. model the helpfulness of reviews by taking into account position and presentation biases.

Finally, a workshop on learning methods for dialogue explored how end-to-end systems, linguistics and ML methods can be used to create dialogue agents.



Jürgen Schmidhuber, the father of the LSTM was not only present on several panels, but did his best to remind everyone that whatever your idea, he had had a similar idea two decades ago and you should better cite him lest he interrupt your tutorial.



Boston Robotics’ Spot proved that — even though everyone is excited by learning and learning-to-learn — traditional planning algorithms are enough to win the admiration of a hall full of learning enthusiasts.


Image 12: Boston Robotics’ Spot amid a crowd of fascinated onlookers


Apple, one of the most secretive companies in the world, has decided to be more open, to publish, and to engage with academia. This can only be good for the community. We’re looking forward to more apple research papers.


Image 13: Ruslan Salakhutdinov at the Apple lunch event


Uber announced their acquisition of Cambridge-based AI startup Geometric Intelligence and threw one of the most popular parties of NIPS.


Image 14: The Geometric Intelligence logo

Rocket AI

Talking about startups, the “launch” of Rocket AI and their patented Temporally Recurrent Optimal Learning had some people fooled (note the acronyms in the below tweets). Riva-Melissa Tez finally cleared up the confusion.


These were our impressions from NIPS 2016. We had a blast and hope to be back in 2017!


Text Analysis API - Sign up


Here at AYLIEN we have a team of researchers who like to keep abreast of, and regularly contribute to, the latest developments in the field of Natural Language Processing. Recently, one of our research scientists, Sebastian Ruder, attended EMNLP 2016 in Austin, Texas. In this post, Sebastian has highlighted some of the stand-out papers and trends from the conference.


Image: Jackie Cheung

I spent the past week in Austin, Texas at EMNLP 2016, the Conference on Empirical Methods in Natural Language Processing.

There were a lot of papers at the conference (179 long papers, 87 short papers, and 9 TACL papers all in all) — too many to read each single one. The entire program can be found here. In the following, I will highlight some trends and papers that caught my eye:

Reinforcement learning

One thing that stood out was that RL seems to be slowly finding its footing in NLP, with more and more people using it to solve complex problems:


Dialogue was a focus of the conference with all of the three keynote speakers dealing with different aspects of dialogue: Christopher Potts talked about pragmatics and how to reason about the intentions of the conversation partner; Stefanie Tellex concentrated on how to use dialogue for human-robot collaboration; finally, Andreas Stolcke focused on the problem of addressee detection in his talk.

Among the papers, a few that dealt with dialogue stood out:

  • Andreas and Klein model pragmatics in dialogue with neural speakers and listeners;
  • Liu et al. show how not to evaluate your dialogue system;
  • Ouchi and Tsuboi select addressees and responses in multi-party conversations;
  • Wen et al. study diverse architectures for dialogue modelling.


Seq2seq models were again front and center. It is not common for a method to have its own session two years after its introduction (Sutskever et al., 2014). While in the past years, many papers employed seq2seq e.g. for Neural Machine Translation, some papers this year focused on improving the seq2seq framework:

Semantic parsing

Since seq2seq’s use for dialogue modelling was popularised by Vinyals and Le, it is harder to get it to work with goal-oriented tasks that require an intermediate representation on which to act. Semantic parsing is used to convert a message into a more meaningful representation that can be used by another component of the system. As this technique is useful for sophisticated dialogue systems, it is great to see progress in this area:

X-to-text (or natural language generation)

While mapping from text-to-text with the seq2seq paradigm is still prevalent, EMNLP featured some cool papers on natural language generation from other inputs:


Parsing and syntax are a mainstay of every NLP conference and the community seems to particularly appreciate innovative models that push the state-of-the-art in parsing: The ACL ’16 outstanding paper by Andor et al. introduced a globally normalized model for parsing, while the best EMNLP ‘16 paper by Lee et al. combines a global parsing model with a local search over subtrees.

Word embeddings

There were still papers on word embeddings, but it felt less overwhelming than at the past EMNLP or ACL, with most methods trying to fix a particular flaw rather than training embeddings for embeddings’ sake. Pilevhar and Collier de-conflate senses in word embeddings, while Wieting et al. achieve state-of-the-art results for character-based embeddings.

Sentiment analysis

Sentiment analysis has been popular in recent years (as attested by the introductions of many recent papers on sentiment analysis). Sadly, many of the conference papers on sentiment analysis reduce to leveraging the latest deep neural network for the task to beat the previous state-of-the-art without providing additional insights. There are, however, some that break the mold: Teng et al. find an effective way to incorporate sentiment lexicons into a neural network, while Hu et al. incorporate structured knowledge into their sentiment analysis model.

Deep Learning

By now, it is clear to everyone: Deep Learning is here to stay. In fact, deep learning and neural networks claimed the two top spots of keywords that were used to describe the submitted papers. The majority of papers used at least an LSTM; using no neural network seems almost contrarian now and is something that needs to be justified. However, there are still many things that need to be improved — which leads us to…

Uphill Battles

While making incremental progress is important to secure grants and publish papers, we should not lose track of the long-term goals. In this spirit, one of the best workshops that I’ve attended was the Uphill Battles in Language Processing workshop, which featured 12 talks and not one, but four all-star panels on text understanding, natural language generation, dialogue and speech, and grounded language. Summaries of the panel discussions should be available soon at the workshop website.

This was my brief review of some of the trends of EMNLP 2016. I hope it was helpful.


News API - Sign up


Unsupervisedly learned word embeddings have seen tremendous success in numerous NLP tasks in recent years. So much so that in many NLP architectures, they are close to fully replacing more traditional distributional representations such as LSA features and Brown clusters.

You just have to look at last year’s EMNLP and ACL conferences, both of which had a very strong focus on word embeddings, and a recent post in Communications of the ACM in which word embeddings is hailed as the catalyst for NLP’s breakout. But are they worthy of the hype?

This post is a synopsis of two blogs written by AYLIEN Research Scientist, Sebastian Ruder. You can view Sebastian’s original posts, and more, on Machine Learning, NLP and Deep Learning on his blog.

In this overview we aim to give an in-depth understanding of word embeddings and their effectiveness. We’ll touch on where they originated, we’ll compare popular word embedding models and the challenges associated with them and we’ll and try to answer/debunk some common questions and misconceptions.

We will then look to disillusion word embeddings by relating them to literature in distributional semantics and highlighting the factors that actually account for the success of word embedding models.

A brief history of word embeddings

Vector space models have been used in distributional semantics since the 1990s. Since then, we have seen the development of a number models used for estimating continuous representations of words, Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) being two such examples.

The term word embeddings was originally coined by Bengio et al. in 2003 who trained them in a neural language model together with the model’s parameters. However, Collobert and Weston were arguably the first to demonstrate the power of pre-trained word embeddings in their 2008 paper A unified architecture for natural language processing, in which they establish word embeddings as a highly effective tool when used in downstream tasks, while also announcing a neural network architecture that many of today’s approaches were built upon. It was Mikolov et al. (2013), however, who really brought word embedding to the fore through the creation of word2vec, a toolkit enabling the training and use of pre-trained embeddings. A year later, Pennington et al. introduced us to GloVe, a competitive set of pre-trained embeddings, suggesting that word embeddings was suddenly among the mainstream.

Word embeddings are considered to be among a small number of successful applications of unsupervised learning at present. The fact that they do not require pricey annotation is probably their main benefit. Rather, they can be derived from already available unannotated corpora.

Word embedding models

Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.

nn_language_model-1Figure 1: A neural language model (Bengio et al., 2006)

The key difference between a network like this and a method like word2vec is its computational complexity, which explains why it wasn’t until 2013 that word embeddings became so prominent in the NLP space. The recent and rapid expansion and affordability in computational power has certainly aided its emergence.

The training objectives for GloVe and word2vec are another difference, with both geared towards producing word embeddings that encode general semantic relationships and can provide benefit in many downstream tasks. Regular neural networks, in comparison, generally produce task-specific embeddings with limitations in relation to their use elsewhere.

In comparing models, we will assume the following notational standards: We assume a training corpus containing a sequence of \(T\) training words \(w_1, w_2, w_3, \cdots, w_T\) that belong to a vocabulary \(V\) whose size is \(|V|\). Our models generally consider a context of \( n \) words. We associate every word with an input embedding \( v_w \) (the eponymous word embedding in the Embedding Layer) with \(d\) dimensions and an output embedding \( v’_w \) (another word representation whose role will soon become clearer). We finally optimize an objective function \(J_\theta\) with regard to our model parameters \(\theta\) and our model outputs some score \(f_\theta(x)\) for every input \( x \).

Classic neural language model

The classic neural language model proposed by Bengio et al. [1] in 2003 consists of a one-hidden layer feed-forward neural network that predicts the next word in a sequence as in Figure 2.

bengio_language_modelFigure 2: Classic neural language model (Bengio et al., 2003)

Their model maximizes what we’ve described above as the prototypical neural language model objective (For simplicity, the regularization term has been omitted):

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space f(w_t , w_{t-1} , \cdots , w_{t-n+1})\).

\( f(w_t , w_{t-1} , \cdots , w_{t-n+1}) \) is the output of the model, i.e. the probability \( p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1}) \) as computed by the softmax, where \(n \) is the number of previous words fed into the model.

Bengio et al. were among the first to introduce what has become to be known as a word embedding, a real-valued word feature vector in \(\mathbb{R}\). The foundations of their model can still be found in today’s neural language and word embedding models. They are:

1. Embedding Layer: This layer generates word embeddings by multiplying an index vector with a word embedding matrix;

2. Intermediate Layer(s): One or more layers that produce an intermediate representation of the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of word embeddings of \(n\) previous words;

3. Softmax Layer: The final layer that produces a probability distribution over words in \(V\).

Bengio et al. also determine two issues with current state-of-the-art-models:

– The first is that Layer 2. is replaceable with an LSTM, which is used by state-of-the-art neural language models [6], [7].

– They also identify the final softmax layer (more precisely: the normalization term) as the network’s main bottleneck, as the cost of computing the softmax is proportional to the number of words in \(V\), which is typically on the order of hundreds of thousands or millions.

Discovering methods that alleviate the computational cost related to computing the softmax over a large vocabulary [9] is therefore one of the main challenges in both neural language and word embedding models.

C&W model

After Bengio et al.’s initial efforts in neural language models, research in word embeddings stalled as computational power and algorithms were not yet at a level that enabled the training of a large vocabulary.

In 2008, Collobert and Weston [4] (thus C&W) demonstrated that word embeddings trained on an adequately large dataset carry syntactic and semantic meaning and improve performance on downstream tasks. In their 2011 paper, they further expand on this [8].

In order to avoid computing the expensive softmax, their solution is to employ an alternative objective function: rather than the cross-entropy criterion of Bengio et al., which maximizes the probability of the next word given the previous words, Collobert and Weston train a network to output a higher score \(f_\theta\) for a correct word sequence (a probable word sequence in Bengio’s model) than for an incorrect one. For this purpose, they use a pairwise ranking criterion, which looks like this:

\(J_\theta\ = \sum\limits_{x \in X} \sum\limits_{w \in V} \text{max} \lbrace 0, 1 – f_\theta(x) + f_\theta(x^{(w)}) \rbrace \).

They sample correct windows \(x\) containing \(n\) words from the set of all possible windows \(X\) in their corpus. For each window \(x\), they then produce a corrupted, incorrect version \(x^{(w)}\) by replacing \(x\)’s centre word with another word \(w\) from \(V\). Their objective now maximises the distance between the scores output by the model for the correct and the incorrect window with a margin of \(1\). Their model architecture, depicted in Figure 3 without the ranking objective, is analogous to Bengio et al.’s model.

nlp_almost_from_scratch_window_approachFigure 3: The C&W model without ranking objective (Collobert et al., 2011)

The resulting language model produces embeddings that already possess many of the relations word embeddings have become known for, e.g. countries are clustered close together and syntactically similar words occupy similar locations in the vector space. While their ranking objective eliminates the complexity of the softmax, they keep the intermediate fully-connected hidden layer (2.) of Bengio et al. around (the HardTanh layer in Figure 3), which constitutes another source of expensive computation. Partially due to this, their full model trains for seven weeks in total with \(|V| = 130000\).


Word2Vec is arguably the most popular of the word embedding models. Because word embeddings are a key element of deep learning models for NLP, it is generally assumed to belong to the same group. However, word2vec is not technically not be considered a component of deep learning, with the reasoning being that its architecture is neither deep nor uses non-linearities (in contrast to Bengio’s model and the C&W model).

Mikolov et al. [2] recommend two architectures for learning word embeddings that, when compared with previous models, are computationally less expensive.

Here are two key benefits that these architectures have over Bengio’s and the C&W model;

– They forgo the costly hidden layer.

– They allow the language model to take additional context into account.

The success their model can not only be attributed to these differences, it importantly also comes from specific training strategies, both of which we will now look at;

Continuous bag-of-words (CBOW)

Unlike a language model that can only base its predictions on past words, as it is assessed based on its ability to predict each next word in the corpus, a model that only aims to produce accurate word embeddings is not subject to such restriction. Mikolov et al. therefore use both the \(n\) words before and after the target word \( w_t \) to predict it as shown in Figure 4. This is known as a continuous bag of words (CBOW), owing to the fact that it uses continuous representations whose order is of no importance.

cbowFigure 4: Continuous bag-of-words (Mikolov et al., 2013)

The purpose of CBOW is only marginally different than that of the the language model one:

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space p(w_t \: | \: w_{t-n} , \cdots , w_{t-1}, w_{t+1}, \cdots , w_{t+n})\).

Rather than feeding \( n \) previous words into the model, the model receives a window of \( n \) words around the target word \( w_t \) at each time step \( t \).


While CBOW can be seen as a precognitive language model, skip-gram turns the language model objective on its head: rather than using the surrounding words to predict the centre word as with CBOW, skip-gram uses the centre word to predict the surrounding words as can be seen in Figure 5.

skip-gramFigure 5: Skip-gram (Mikolov et al., 2013)

The skip-gram objective thus sums the log probabilities of the surrounding \( n \) words to the left and to the right of the target word \( w_t \) to produce the following objective:

\(J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \sum\limits_{-n \leq j \leq n , \neq 0} \text{log} \space p(w_{t+j} \: | \: w_t)\).


In contrast to word2vec, GloVe [5] seeks to make explicit what word2vec does implicitly: Encoding meaning as vector offsets in an embedding space — seemingly only a serendipitous by-product of word2vec — is the specified goal of GloVe.

merge_from_ofoct--2-Figure 6: Vector relations captured by GloVe (Stanford)

To be specific, the creators of GloVe illustrate that the ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and so look to encode this information as vector differences.

For this to be accomplished, they propose a weighted least squares objective \(J\) that directly aims to reduce the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:

\(J = \sum\limits_{i, j=1}^V f(X_{ij}) \: (w_i^T \tilde{w}_j + b_i + \tilde{b}_j – \text{log} \: X_{ij})^2 \)

where \(w_i\) and \(b_i\) are the word vector and bias respectively of word \(i\), \(\tilde{w}_j\) and \(b_j\) are the context word vector and bias respectively of word \(j\), \(X_{ij}\) is the number of times word \(i\) occurs in the context of word \(j\), and \(f\) is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.

As co-occurrence counts can be directly encoded in a word-context co-occurrence matrix, GloVe takes such a matrix rather than the entire corpus as input.

Word embeddings vs. distributional semantics models

Word embedding models such as word2vec and GloVe gained such popularity as they appeared to regularly and substantially outperform traditional Distributional Semantic Models (DSMs). Many attributed this to the neural architecture of word2vec, or the fact that it predicts words, which seemed to have a natural edge over solely relying on co-occurrence counts.

DSMs can be seen as count models as they “count” co-occurrences among words by operating on co-occurrence matrices. Neural word embedding models, in contrast, can be viewed as predict models, as they try to predict surrounding words.

In 2014, Baroni et al. [11] demonstrated that, in nearly all tasks, predict models consistently outperform count models, and therefore provided us with a comprehensive verification for the supposed superiority of word embedding models. Is this the end? No.

With GloVe, we have already seen that the differences are not as obvious: While GloVe is considered a predict model by Levy et al. (2015) [10], it is clearly factorizing a word-context co-occurrence matrix, which brings it close to traditional methods such as PCA and LSA. Even more, Levy et al. [12] demonstrate that word2vec implicitly factorizes a word-context PMI matrix.

While on the surface DSMs and word embedding models use varying algorithms to learn word representations – the former count, the latter predict – both types of model fundamentally act on the same underlying statistics of the data, i.e. the co-occurrence counts between words.

And so the question that we will focus on for the remainder of this post still remains:

Why do word embedding models still outperform DSM with very similar information?

Comparison models

To establish the elements that attribute to the success of neural word embedding models, and illustrate how they can be transferred to traditional processes, we will compare the following models;

Positive Pointwise Mutual Information (PPMI)

PMI is a typical measure for the strength of association between two words.It is defined as the log ratio between the joint probability of two words \(w\) and \(c\) and the product of their marginal probabilities: \(PMI(w,c) = \text{log} \: \frac{P(w,c)}{P(w)\:P(c)} \). As \( PMI(w,c) = \text{log} \: 0 = – \infty \) for pairs \( (w,c) \) that were never observed, PMI is in practice often replaced with positive PMI (PPMI), which replaces negative values with \(0\), yielding \(PPMI(w,c) = \text{max}(PMI(w,c),0)\).

Singular Value Decomposition (SVD)

SVD is among the more popular methods for dimensionality reduction and came about in NLP originally via latent semantic analysis (LSA). SVD factorizes the word-context co-occurrence matrix into the product of three matrices \(U \cdot \Sigma \times V^T \) where \(U\) and \(V\) are orthonormal matrices (i.e. square matrices whose rows and columns are orthogonal unit vectors) and \(\Sigma\) is a diagonal matrix of eigenvalues in

decreasing order. In practice, SVD is often used to factorize the matrix produced by PPMI. Generally, only the top \(d\) elements of \(\Sigma\) are kept, yielding \(W^{SVD} = U_d \cdot \Sigma_d\) and \(C^{SVD} = V_d\), which are commonly used as the word and context representations respectively.

Skip-gram with Negative Sampling (SGNS)

Aka word2vec, as shown above.

Global Vectors (GloVe)

As shown earlier in this post.


We will focus on the following hyper-parameters:


Word2vec suggests three methods of pre-processing a corpus, each of which can be applied to DSMs with ease.

Dynamic context window

Normally in DSMs, the context window is unweighted and of a unchanging size. Both SGNS and GloVe, however, use a scheme that assigns more weight to closer words, as closer words are generally considered to be more important to a word’s meaning. Additionally, in SGNS, the window size is not fixed, but the actual window size is dynamic and sampled uniformly between \(1\) and the maximum window size during training.

Subsampling frequent words

SGNS dilutes very frequent words by randomly removing words whose frequency \(f\) is higher than some threshold \(t\) with a probability \(p = 1 – \sqrt{\frac{t}{f}}\). As this subsampling is done before actually creating the windows, the context windows used by SGNS in practice are larger than indicated by the context window size.

Deleting rare words

During the pre-processing of SGNS, rare words are also deleted before creating the context windows, which increases the actual size of the context windows further. The actual performance impact of this is insignificant, however, according to Levy et al. (2015)

Association metric

In relation to measuring the association between two words, PMI is seen as useful metric. Since Levy and Goldberg (2014) have shown SGNS to implicitly factorize a PMI matrix, two variations stemming from this formulation can be introduced to regular PMI.

Shifted PMI

In SGNS, the greater the volume of negative samples \(k\), the more data is being used and so the estimation of the parameters should therefore improve. \(k\) affects the shift of the PMI matrix that is implicitly factorized by word2vec, i.e. \(k\) k shifts the PMI values by log \(k\).

If we transfer this to regular PMI, we obtain Shifted PPMI (SPPMI): \(SPPMI(w,c) = \text{max}(PMI(w,c) – \text{log} \: k,0)\).

Context distribution smoothing

In SGNS, the negative samples are sampled according to a _smoothed_ unigram distribution, i.e. an unigram distribution raised to the power of \(\alpha\), which is empirically set to \(\frac{3}{4}\). This leads to frequent words being sampled relatively less often than their frequency would indicate.

We can transfer this to PMI by equally raising the frequency of the context words \(f(c)\) to the power of \(\alpha\):

\(PMI(w, c) = \text{log} \frac{p(w,c)}{p(w)p_\alpha(c)}\) where \(p_\alpha(c) = \frac{f(c)^\alpha}{\sum_c f(c)^\alpha}\) and \(f(x)\) is the frequency of word \(x\).


Just like in pre-processing, three methods can be used to modify the word vectors produced by an algorithm.

Adding context vectors

The authors of GloVe recommend the addition of word vectors and context vectors to create the final output vectors, e.g. \(\vec{v}_{\text{cat}} = \vec{w}_{\text{cat}} + \vec{c}_{\text{cat}}\). This adds first-order similarity terms, i.e \(w \cdot v\). This method, however, cannot be applied to PMI, as the vectors produced by PMI are too infrequent.

Eigenvalue weighting

SVD produces the following matrices: \(W^{SVD} = U_d \cdot \Sigma_d \) and \(C^{SVD} = V_d\). These matrices, however, have different properties: \(C^{SVD}\) is orthonormal, while \(W^{SVD}\) is not.

SGNS is more symmetric in contrast. We can thus weight the eigenvalue matrix \(\Sigma_d\) with an additional parameter \(p\), which can be tuned, to yield the following:

\(W^{SVD} = U_d \cdot \Sigma_d^p\).

Vector normalisation

Finally, we can also normalise all vectors to unit length.


Levy et al. (2015) train all models on a dump of the English wikipedia and evaluate them on the commonly used word similarity and analogy datasets. You can read more about the experimental setup and training details in their paper. We summarise the most important results and takeaways below.


Levy et al. find that SVD — and not one of the word embedding algorithms — performs best on similarity tasks, while SGNS performs best on analogy datasets. They furthermore shed light on the importance of hyperparameters compared to other choices:

  1. Hyperparameters vs. algorithms:
    Hyperparameter settings are often more important than algorithm choice.
    No single algorithm consistently outperforms the other methods.
  2. Hyperparameters vs. more data:
    Training on a larger corpus helps for some tasks.
    In 3 out of 6 cases, tuning hyperparameters is more beneficial.

Debunking prior claims

Equipped with these insights, we can now debunk some generally held claims:

  1. Are embeddings superior to distributional methods?
    With the right hyperparameters, no approach has a consistent advantage over another.
  2. Is GloVe superior to SGNS?
    SNGS outperforms GloVe on all comparison tasks of Levy et al. This should necessarily be taken with a grain of salt as GloVe might perform better on other tasks.
  3. Is CBOW a good word2vec configuration?
    CBOW does not outperform SGNS on any task.


DON’T use shifted PPMI with SVD.

DON’T use SVD “correctly”, i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with \(p = 0.5\)).

DO use PPMI and SVD with short contexts (window size of \(2\)).

DO use many negative samples with SGNS.

DO always use context distribution smoothing (raise unigram distribution to the power of \(\alpha = 0.75\)) for all methods.

DO use SGNS as a baseline (robust, fast and cheap to train).

DO try adding context vectors in SGNS and GloVe.


These results are in contrast to the general consensus that word embeddings are superior to traditional methods. Rather, they indicate that it typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps.

Recent studies by Jurafsky’s group [13], [14] reflect these findings and illustrate that SVD, rather than SGNS, is commonly the preferred choice accurate word representations is important.

We hope this overview of word embeddings has helped to highlight some fantastic research that sheds light on the relationship between traditional distributional semantic and in-vogue embedding models.


[1]: Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155.

[2]: Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12.

[3]: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS, 1–9.

[4]: Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. Proceedings of the 25th International Conference on Machine Learning – ICML ’08, 20(1), 160–167.

[5]: Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543.

[6]: Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. AAAI. Retrieved from

[7]: Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. Retrieved from

[8]: Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research, 12 (Aug), 2493–2537. Retrieved from

[9]: Chen, W., Grangier, D., & Auli, M. (2015). Strategies for Training Large Vocabulary Neural Language Models, 12. Retrieved from

[10]: Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225. Retrieved from

[11]: Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, 238–247.

[12]: Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. Advances in Neural Information Processing Systems (NIPS), 2177–2185. Retrieved from

[13]: Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016). Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Retrieved from

[14]: Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. arXiv Preprint arXiv:1605.09096.


Text Analysis API - Sign up


Sentiment analysis is widely used to gauge public opinion towards products, to analyze customer satisfaction, and to detect trends. With the proliferation of customer reviews, more fine-grained aspect-based sentiment analysis (ABSA) has gained in popularity, as it allows aspects of a product or service to be examined in more detail. To this end, we have launched an ABSA service a while ago and demonstrated how the service can be used to gain insights into the strengths and weaknesses of a product.

For performing sentiment analysis on customer reviews (as well as with many other text classification tasks), we face the problem that there are many different categories of reviews such as books, electronics, restaurants, etc. (You only need to have a look at the Departments tab on Amazon to get a feeling for the diversity of these categories.) In Machine Learning and Natural Language Processing, we refer to these different categories as domains; every domain has their unique characteristics.

In practice, this means that a model that is trained on one domain will do worse on another domain depending on how dissimilar the two domains are. For instance, a model trained on the restaurants domain will do a lot worse on the books domain than on the comparatively more similar hotels domain. For aspect-based sentiment analysis, this problem is amplified as not only the domains but also the aspects belonging to those domains differ.

In addition, as the world becomes more globalized, customer reviews need to be analyzed in other languages besides English. We thus require large amounts of training data for a large number of language-domain pairs, which is infeasible in practice as annotation — particularly the meticulous annotation required for ABSA — is expensive.

There are two directions that complement each other, which we can take to address this deficit:

  1. We can create models that allow us to transfer existing knowledge and adapt trained models to new domains without incurring a large performance hit. This area is called domain adaptation and we will talk about this in future blog posts.
  2. We can create models that are more generic and that are able to generalize well even when trained on relatively few data by leveraging information inherent in the training data. In the rest of this blog post, we will talk about the approach we took in our EMNLP 2016 paper towards this goal.

Even though Deep Learning-based models constitute the state-of-the-art in many NLP tasks, they traditionally do well only with large amounts of data. Finding ways to help them generalize with only few data samples is thus an important research problem in its own right. Taken to the extreme, we would like to emulate the way humans learn from only few examples, also known as One-Shot Learning.

Reviews — just with any coherent text — have an underlying structure. In the discourse structure of the review, sentences are connected via different rhetorical relations. Intuitively, knowledge about the relations and the sentiment of surrounding sentences should inform the sentiment of the current sentence. If a reviewer of a restaurant has shown a positive sentiment towards the quality of the food, it is likely that his opinion will not change drastically over the course of the review. Additionally, overwhelmingly positive or negative sentences in the review help to disambiguate sentences whose sentiment is equivocal.

Existing Deep Learning models for sentiment analysis act only on the sentence level; while they are able to consider intra-sentence relations, they fail to capture inter-sentence relations that rely on discourse structure and provide valuable clues for sentiment prediction.

We propose a hierarchical bidirectional long short-term memory (H-LSTM) that is able to leverage both intra– and inter-sentence relations. Because our model only relies on sentences and their structure within a review, it is fully language-independent.


hierarchical_lstm-page-001 (1)

Figure 1: The hierarchical bidirectional LSTM (H-LSTM) for aspect-based sentiment analysis. Word embeddings are fed into a sentence-level bidirectional LSTM. Final states of forward and backward LSTM are concatenated together with the aspect embedding and fed into a bidirectional review-level LSTM. At every time step, the output of the forward and backward LSTM is concatenated and fed into a final layer, which outputs a probability distribution over sentiments.

You can view the architecture of our model in the image above. The model consists of the following components:


We use a Long Short-Term Memory (LSTM), which adds input, output, and forget gates to a recurrent cell, which allow it to model long-range dependencies that are essential for capturing sentiment.

For the \(t\)th word in a sentence, the LSTM takes as input the word embedding \(x_t\), the previous output \(h_{t-1}\) and cell state \(c_{t-1}\) and computes the next output \(h_t\) and cell state \(c_t\). Both \(h\) and \(c\) are initialized with zeros.

Bidirectional LSTM

Both on the review and on the sentence level, sentiment is dependent not only on preceding but also successive words and sentences. A Bidirectional LSTM (Bi-LSTM) allows us to look ahead by employing a forward LSTM, which processes the sequence in chronological order, and a backward LSTM, which processes the sequence in reverse order. The output \(h_t\) at a given time step is then the concatenation of the corresponding states of the forward and backward LSTM.

Hierarchical Bidirectional LSTM

Stacking a Bi-LSTM on the review level on top of sentence-level Bi-LSTMs yields the hierarchical bidirectional LSTM (H-LSTM) in Figure 1.

The sentence-level forward and backward LSTMs receive the sentence starting with the first and last word embedding \(x\_{1}\) and \(x\_l\) respectively. The final output \(h\_l\) of both LSTMs is then concatenated with the aspect vector \(a\) and fed as input into the review-level forward and backward LSTMs. The outputs of both LSTMs are concatenated and fed into a final softmax layer, which outputs a probability distribution over sentiments for each sentence.


The most popular benchmark for ABSA is the SemEval Aspect-based Sentiment Analysis task. We evaluate on the most recent edition of this task, the SemEval-2016 ABSA task. To demonstrate the language and domain independence of our model, we evaluate on datasets in five domains (restaurants, hotels, laptops, phones, cameras) and eight languages (English, Spanish, French, Russian, Dutch, Turkish, Arabic, Chinese) from the competition.

We compare our model using random (H-LSTM) and pre-trained word embeddings (HP-LSTM)  against the best model of the SemEval-2016 Aspect-based Sentiment Analysis task for each domain-language pair (Best) as well as against the two best single models of the competition: IIT-TUDA (Kumar et al., 2016), which uses large sentiment lexicons for every language, and XRCE (Brun et al., 2016), which uses a parser augmented with hand-crafted, domain-specific rules.

English Restaurants 88.1 88.1 86.7 82.1 81.4 83.0 85.3
Spanish Restaurants 83.6 83.6 79.6 75.7 79.5 81.8
French Restaurants 78.8 78.8 72.2 73.2 69.8 73.6 75.4
Russian Restaurants 77.9 73.6 75.1 73.9 78.1 77.4
Dutch Restaurants 77.8 77.0 75.0 73.6 82.2 84.8
Turkish Restaurants 84.3 84.3 74.2 73.6 76.7 79.2
Arabic Hotels 82.7 81.7 82.7 80.5 82.8 82.9
English Laptops 82.8 82.8 78.4 76.0 77.4 80.1
Dutch Phones 83.3 82.6 83.3 81.8 81.3 83.6
Chinese Cameras 80.5 78.2 77.6 78.6 78.8
Chinese Phones 73.3 72.4 70.3 74.1 73.3

Table 1: Results of our system with randomly initialized word embeddings (H-LSTM) and with pre-trained embeddings (HP-LSTM) for ABSA for each language and domain in comparison to the best system for each pair (Best), the best two single systems (XRCE, IIT-TUDA), a sentence-level CNN (CNN), and our sentence-level LSTM (LSTM).

As you can see in the table above, our hierarchical model achieves results superior to the sentence-level CNN and the sentence-level Bi-LSTM baselines for almost all domain-language pairs by taking the structure of the review into account.

In addition, our model shows results competitive with the best single models of the competition, while requiring no expensive hand-crafted features or external resources, thereby demonstrating its language and domain independence. Overall, our model compares favorably to the state-of-the-art, particularly for low-resource languages, where few hand-engineered features are available. It outperforms the state-of-the-art on four and five datasets using randomly initialized and pre-trained embeddings respectively. For more details, refer to our paper.


Text Analysis API - Sign up


From July 20th to July 28th 2016, I had the opportunity of  attending the 6th Lisbon Machine Learning School. The Lisbon Machine Learning School (LxMLS) is an annual event that brings together researchers and graduate students in the fields of NLP and Computational Linguistics, computer scientists with an interest in statistics and ML, and industry practitioners with a desire for a more in-depth understanding. Participants had a chance to join workshops and labs, where they got hands-on experience with building and exploring state-of-the-art deep learning models, as well as to attend talks and speeches by prominent deep learning and NLP researchers from a variety of academic and industrial organisations. You can find the entire programme here.

In this blog post, I am going to share some of the highlights, key insights and takeaways of the summer school. I am going to skip the lectures of the first and second day as they introduce basic Python, Linear Algebra, and Probability Theory concepts and focus on the later lectures and talks. First, we are going to talk about sequence models. We will then turn to structured prediction, a type of supervised ML common to NLP. We will then summarize the lecture on Syntax and Parsing and finally provide insights with regard to Deep Learning. The accompanying slides can be found as a reference at the end of this blog post.

Disclaimer: This blog post is not meant to give a comprehensive introduction of each of the topics discussed; it should rather give you an overview of the week-long event and provide you with pointers if you want to delve deeper into any of the topics.

Sequence Models

Noah Smith of the University Washington kicked off the third day of the summer school with a compelling lecture about sequence models. To test your understanding of sequence models, try to answer – without reading further – the following question: What is the most basic sequence model depicted in Figure 1?


Figure 1: The most basic sequence model

Correct! It is the bag-of-words (notice which words have “fallen” out of the bag-of-words). The bag-of-words model makes the strongest independence assumption of all sequence models: It supposes that each word is entirely independent of its predecessors. It is obvious why models that rely on this assumption do only a poor job at modelling language: every word naturally depends on the words that have preceded it.

Somewhat more sophisticated models thus relax this naive assumption to reduce the entropy: A 1st Order Markov model makes each word dependent on the word that immediately preceded it. This way, it is already able to capture some context of the context that can help to disambiguate a new word. More generally, \(m^{\text{th}}\) Order Markov Models make each word depend on its previous \(m\) words.

In mathematical terms, in \(m^{\text{th}}\) Order Markov Models, the probability of a text sequence (we assume here that such a sequence is delimited by start and stop symbols) can be calculated using the chain rule as the product of the probabilities of the individual words:

\(p(\text{start}, w_1, w_2, …, w_n, \text{stop}) = \prod\limits_{i=1}^{n+1} \gamma (w_i \: | \: w_{i-m}, …, w_{i-1}) \)

where \(\gamma\) is the probability of the current word \(w_i\) given its \(m\) previous words, i.e. the probability to transition from the previous words to the current word.

We can view bag-of-words and \(m^{\text{th}}\) Order Markov Models as occupying the following spectrum:


Figure 2: From bag-of-words to history-based models

As we go right in Figure 2, we make weaker independence assumption and in exchange gain richer expressive power, while requiring more parameters – until we eventually obtain the most expressive – and most rigid – model, a history-based model where each word depends on its entire history, i.e. all preceding words.

As a side-note, state-of-the-art sequence modelling models such as recurrent neural networks and LSTMs can be thought of as being located on the right side of this spectrum, as they don’t require an explicit specification of context words but are – theoretically – able to take into account the entire history.

In many cases, we would not only like to model just the observed sequence of symbols, but take some additional information into account. Hidden Markov Models (HMMs) allow us to associate with each symbol \(w_i\) some missing information, its “state” \(s_i\). The probability of a word sequence in an HMM then not only depends on the transition probability \(\gamma\) but also on the so-called emission probability \(\eta\):

\(p(\text{start}, w_1, w_2, …, w_n, \text{stop}) = \prod\limits_{i=1}^{n+1} \eta (w_i \: | \: s_i) \: \gamma (s_i \: | \: s_{i-1}) \)

Consequently, the HMM is a joint model over observable symbols and hidden/latent/unknown classes. HMMs have traditionally been used in part-of-speech tagging or named entity recognition where the hidden states are POS and NER tags respectively.

If we want to determine the most probable sequence of hidden states, we face a space of potential sequences that grows exponentially with the sequence length. The classic dynamic algorithm to cope with this problem is the Viterbi algorithm, which is used in HMMs, CRFs, and other sequence models to calculate the most probable sequence of hidden states: It lays out the symbol sequence and all possible states in a grid and proceeds left-to-right to compute the maximum probability to transition in every new state given the previous states. The most probable sequence can then be found by back-tracking as in Figure 3.


Figure 3: The Viterbi algorithm

A close relative is the forward-backward algorithm, which is used to calculate the probability of a word sequence and the probabilities of each word’s states, e.g. for language modelling. Indeed, the only difference between Viterbi and the forward-backward algorithm is that Viterbi takes the maximum of the probabilities of the previous state, while forward-backward takes the sum. In this sense, they correspond to the same abstract algorithm, which is instantiated in two different semirings where a semiring informally is a set of values and some operations that obey certain properties.

Finally, if we want to learn HMMs in an unsupervised way, we use the well-known Expectation Maximisation (EM) algorithm, which consists of two steps: During the E step, we calculate the probability of each possible transition and emission at every position with forward-backward (or Viterbi for “hard” EM); for the M step, we re-estimate the parameters with MLE.

Machine Translation

On the evening of the third day, Philipp Koehn, one of the pioneers of MT and inventor of phrase-based machine translation gave a talk on Machine Translation as Sequence Modelling, including a detailed review of different MT and alignment approaches. If you are interested in a comprehensive history of MT that takes you from IBM Model 1 all the way to phrase-based, syntax-based and eventually neural MT, while delving into the details of alignment, translation, and decoding, definitely check out the slides here.

Structured Prediction

HMMs can model sequences, but as their weights are tied to the generative process, strong independence assumptions need to be made to make their computation tractable. We will now turn to a category of models that are more expressive and can be used to predict more complex structures: Structured prediction — which was introduced by Xavier Carreras of Xerox Research on the morning of the fourth day — is used to refer to ML algorithms that don’t just predict scalar or real values, but more complex structures. As complex structures are common in language, so is structured prediction; example tasks of structured prediction in NLP include POS tagging, named entity recognition, machine translation, parsing, and many others.

A successful category of structured prediction models are log-linear models, which are so-called because they model log-probabilities using a linear predictor. Such models try to estimate the parameters \(w\) by calculating the following probability:

\(\text{log} \: \text{Pr}(\mathbf{y} \: | \: \mathbf{x}; \mathbf{w}) = \text{log} \:\frac {\text{exp}\{\mathbf{w} \cdot \mathbf{f}(\mathbf{x},\mathbf{y})\}}{Z(\mathbf{x};\mathbf{w})}\)

where \(\mathbf{x} = x_1, x_2, …, x_n \in \mathcal{X}\) is the sequence of symbols, \(\mathbf{y} = y_1, y_2, …, y_n \in \mathcal{Y}\) is the corresponding sequence of labels, \(\mathbf{f}(\mathbf{x},\mathbf{y})\) is a feature representation of \(\mathbf{x}\) and \(\mathbf{y}\), and \(Z(\mathbf{x};\mathbf{w}) = \sum\limits_{\mathbf{y}’ \in \mathcal{Y}} \text{exp}(\mathbf{w} \cdot \mathbf{f}(\mathbf{x},\mathbf{y}’)) \) is also referred to as the partition function.

Two approaches that can be used to estimate the model parameters \(w\) are:

  1. Maximum Entropy Markov Models (MEMMs), which assume that \(\text{Pr}(\mathbf{y} \: | \: \mathbf{x}; \mathbf{w})\) decomposes, i.e. that we can express it as a product of the individual label probabilities that only depend on the previous label (similar to HMMs).
  2. Conditional Random Fields (CRFs), which make a weaker assumption by only assuming that \(\mathbf{f}(\mathbf{x},\mathbf{y})\) decomposes.

In MEMMs, we assume – similarly to Markov Models – that the label \(y_i\) at the \(i\) th position does not depend on all past labels, but only on the previous label \(y_{i-1}\). In contrast to Markov Models, MEMMs allow us to condition the label \(y_i\) on the entire symbol sequence \(x_{1:n}\). Both assumptions combined lead to the following probability of label \(y_i\) in MEMMs:

\(\text{Pr}(y_i \: | \: x_{1:n}, y_{1:i-1}) = \text{Pr}(y_i \: | \: x_{1:n}, y_{i-1})\)

By this formulation, the objective of MEMMs reduces sequence modelling to multi-class logistic regression.

In CRFs, we factorize on label bigrams. Instead of greedily predicting the most probable label \(y_i\) at every position \(i\), we instead aim to find the sequence of labels with the maximum probability:

\(\underset{y \in \mathcal{Y}}{\text{argmax}} \sum_i \mathbf{w} \cdot \mathbf{f}(\mathbf{x}, i, y_{i-1}, y_i)\)

We then estimate the parameters \(w\) of our model using gradient-based methods where we can use forward-backward to compute the gradient.

CRFs vs. MEMMs

By choosing between MEMMs and CRFs, we make the choice between local and global normalisation. While MEMMs aim to predict the most probable label at every position, CRFs aim to find the most probable label sequence. This, however, leads to the so-called “Label Bias Problem” in MEMMs: As MEMMs choose the label with the highest probability, the model is biased towards more frequent labels, often irrespective of the local context.

As MEMMs reduce to multi-class classification, they are cheaper to train. On the other hand, CRFs are more flexible and thus easier to extend to more complex structures.

This distinction between local and global normalisation has been a recurring topic in sequence modelling and a key criterion when choosing an algorithm. For text generation tasks, global normalisation is still too expensive, however. Many state-of-the-art approaches thus employ beam search as a compromise between local and global normalisation. In most sequence modelling tasks, local normalisation is very popular due to its ease of use, but might fall out of favour as more advanced models and implementations for global normalisation become available. To this effect, a recent outstanding paper at ACL (Andor et al., 2016) shows that globally normalised models are strictly more expressive than locally normalised ones.

HMMs vs. CRFs

Another distinction that is worth investigating is the difference between generative and discriminative models: HMMs are generative models, while CRFs are discriminative. HMMs only take into account the previous word as its features are tied to the generative process. In contrast, CRF features are very flexible. They can look at the whole input \(x\) paired with a label bigram \((y_i , y_{i+1})\). In practice, for prediction tasks, such “good” discriminative features can improve accuracy a lot.

Regarding the parameter estimation, the distinction between generative and discriminative becomes apparent: HMMs focus on explaining the data, both \(x\) and \(y\), while CRFs focus on the mapping from \(x\) to \(y\). Which model is more appropriate depends on the task: CRFs are commonly used in tasks such as POS tagging and NER, while HMMs have traditionally lain at the heart of speech recognition.

Structured Prediction in NLP with Imitation Learning

Andreas Vlachos of the University of Sheffield gave a talk on using imitation learning for structured prediction in NLP, which followed the same distinction between local normalisation (aka incremental modelling), i.e. greedily predicting one label at a time and global normalisation (aka joint modelling), i.e. scoring the complete outputs with a CRF that we discussed above. Andreas talked about how imitation learning can be used to improve incremental modelling as it allows to a) explore the search space, b) to address error-propagation, and c) to train with regard to the task-specific loss function.

There are many popular imitation learning algorithms in the literature such as SEARN (Daumé III et al., 2009), Dagger (Ross et al. 2011), or V-DAgger (Vlachos and Clark, 2014). Recently, MIXER (Ranzato et al., 2016) has been proposed to directly optimise metrics for text generation, such as BLEU or ROUGE.

An interesting perspective is that imitation learning can be seen as inverse reinforcement learning: Whereas we want to learn the best policy in reinforcement learning, we know the optimal policy in imitation learning, i.e. the labels in the training data; we then infer the per-action reward function and learn a policy, i.e. a classifier that can generalise to unseen data.

Demo Day

Figure 4: Aylien stand at Demo Day

In the evening of the fourth day, we presented – along with other NLP companies and research labs – Aylien at the LxMLS Demo Day.

We presented an overview of our research directions at Aylien, as well as a 1D generative adversarial network demo and visualization.

Syntax and Parsing

Having looked at generic models that are able to cope with sequences and more complex structures, we now briefly mention some of the techniques that are commonly used to deal with one of language’s unique characteristics: syntax. To this end, Slav Petrov of Google Research gave an in-depth lecture about syntax and parsing on the fifth day of the summer school, which discussed, among others, successful parsers such as the Charniak and the Berkeley parser, context-free grammars and phrase-based parsing, projective and non-projective dependency parsing, as well as more recent transition and graph-based parsers.

To tie this to what we’ve already discussed, Figure 5 demonstrates how the distinction between generative and discriminative models applies to parsers.


Figure 5: Generative vs. discriminative parsing models

From Dependencies to Constituents

In the evening of the fifth day, André Martins of Unbabel gave talk a on an ACL 2015 paper of his, in which he shows that constituent parsing can be reduced to dependency parsing to get the best of both worlds: the informativeness of constituent parser output and the speed of dependency parsers.

Their approach works for any out-of-the-box dependency parser, is competitive for English and morphologically rich languages, and achieves results above the state of the art for discontinuous parsing (where edges are allowed to intersect).

Deep Learning

Finally, the two last days were dedicated to Deep Learning and featured prolific researchers from academia and industry labs as speakers. On the morning of the sixth day, Wang Ling of Google DeepMind gave one of the most gentle, family-friendly intro to Deep Learning I’ve seen – titled Deep Neural Networks Are Our Friends with a theme inspired by the Muppets.

The evening talk by Oriol Vinyals of Google DeepMind detailed some of his lessons learned when working on sequence-to-sequence models at Google and gave glimpses of interesting future challenges, among them, one-shot learning for NLP (Vinyals et al., 2016) and enabling neural networks to ponder decisions (Graves, 2016).

For the lecture on the last day, Chris Dyer of CMU and Google DeepMind discussed modelling sequential data with recurrent neural networks (RNNs) and shared some insights and intuitions with regard to working with RNNs and LSTMs.

Exploding / vanishing gradients

If you’ve worked with RNNs before, then you’re most likely familiar with the exploding/vanishing gradients problem: As the length of the sequence increases, computations of the model get amplified, which leads to either exploding or vanishing gradients and thus renders the model incapable to learn. The intuition why advanced models such as LSTMs and GRUs mitigate this problem is that they use summations instead of multiplications (which lead to exponential growth).

Deep LSTMs

Figure 6: Deep LSTMs

Deep or stacked LSTMs are by now a very common sight in the literature and state-of-the-art for many sequence modelling problems. Still, descriptions of implementations often omit details, which might be perceived as self-evident. This, however, means that it is not always clear how a model looks like exactly or how it differs from similar architectures. The same applies to Deep LSTMs. The most standard convention feeds the input not only to the first but (via skip connections) also to subsequent layers as in Figure 6. Additionally, dropout is generally applied only between layers and not on the recurrent connections as this would drop out more and more value over time.


Figure 7: Dropout in Deep LSTMs

Does Depth Matter?

Generally, depth helps. However, in comparison to other applications such as audio/visual processing, depth plays a less significant role in NLP. Hypotheses for this observation are: a) More transformation is required for speech processing, image recognition etc. than for common text applications; b) Less effort has been made to find good architectures (RNNs are expensive to train; have been widely used for less long); c) Backpropagation through time and depth is hard and we need better optimisers.

Generally, 2-8 layers are standard across text applications. Input skip connections are used often but by no means universally.

Only recently have also very deep architectures been proposed for NLP (Conneau et al., 2016).


Mini-batching is generally necessary to make use of optimised matrix-matrix multiplication. In practice, however, this usually requires bucketing training instances based on similar lengths and padding with \(0\)’s, which can be a nuisance. Because of this, this is – according to Chris Dyer – “the era of assembly language programming for neural networks. Make the future an easier place to program!”

Character-based models

Character-based models have gained more popularity recently and for some tasks such as language modelling, using character-based LSTMs blows the results of word-based models out of the water, achieving a significantly lower perplexity with a lot fewer parameters particularly for morphologically rich languages.


Figure 8: CharLSTM > Word Lookup


Finally, no overview of recurrent neural networks is complete without the mention of attention, one of the most influential, recently proposed notions with regard to LSTMs. Attention is closely related to “pooling” operations in convolutional neural networks (and other architectures) as it also allows to selectively focus on particular elements of the input. The most popular attention architecture pioneered by Bahdanau et al. (2015) seems to only care about “content” in that it relies on computing the dot product, i.e. the cosine similarity between vectors. It contains no obvious bias in favor of diagonals, short jumps, fertility, or other structures that might guide actual attention from a psycho-linguistic perspective. Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are many more opportunities for research.

Attention is similar to alignment, but there are important differences: a) alignment makes stochastic but hard decisions. Even if the alignment probability distribution is “flat”, the model picks one word or phrase at a time; b) in contrast, attention is “soft” (all words are interpolated based on their attention weights). Finally, there is a big difference between “flat” and “peaked” attention weights.

Memory Networks for Language Understanding

Antoine Bordes of Facebook AI Research gave the last talk of the summer school, in which he discussed Facebook AI Research’s two main research directions:

On the one hand, they are working on (Key-Value) Memory Networks, which can be used to jointly model symbolic and continuous systems. They can be trained end-to-end through back propagation and with SGD and provide great flexibility on how to design memories.

On the other hand, they are working on new tools for developing learning algorithms. They have created several datasets of reasonable sizes, such as bAbI, CBT, and MovieQA that are designed to ease interpretation of a model’s capabilities and to foster research.


Figure 9: LxMLS 2016 Group Picture

That was the Lisbon Machine Learning Summer School 2016! We had a blast and hope to be back next year!


Sequence Models (Noah Smith)

Machine Translation as Sequence Modelling (Philipp Koehn)

Learning Structured Predictors (Xavier Carreras)

Structured Prediction in NLP with Imitation Learning (Andreas Vlachos)

Syntax and Parsing – I, II (Slav Petrov)

Turbo Parser Redux: From Dependencies to Constituents (André Martins)

Deep Neural Networks Are Our Friends (Wang Ling)

Modeling Sequential Data with Recurrent Networks (Chris Dyer)

Memory Networks for Language Understanding (Antoine Bordes)

Text Analysis API - Sign up



Interest in Natural Language Processing (NLP) has risen rapidly in recent years; nowadays, NLP forms a key component of the roadmap of almost every major tech company. All share the goals of making advanced NLP capabilities accessible to developers and bringing them into the hands of consumers. Simultaneously, most existing areas in NLP are under active research, with new ideas being developed and tested at a breakneck pace. However, discussion of the potential and applications of such ideas is usually restricted to small interest groups, e.g. collaborating research teams and infrequent venues, e.g. NLP conferences.

We at Aylien are acutely aware of this dynamic: We seek to equip developers and businesses with the NLP tools they need to improve their businesses. At the same time, we are conducting cutting-edge research and continuously looking for ways to leverage this research to improve our services and help our clients.

In order to provide a regular, common forum for students, researchers, and industry professionals to discuss state-of-the-art NLP research and cutting-edge industry applications, we are thrilled to announce the NLP Dublin meetup. We hope that this group will facilitate the exchange of ideas within the Irish NLP community and bring people into contact with ideas and applications they might have otherwise not heard about. Every event will feature presentations about interesting areas of NLP, QA sessions, and ample time for discussions and networking.

The first meetup will take place on August 3. If you are interested in speaking at or sponsoring future meetups, please contact or click the banner below.