Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

The global health and wellness market is valued at almost $4 trillion annually, and the media plays a central role in consumer buying decisions. So spotting trends in the media coverage about this can generate a lot of hugely valuable insights, as well as telling us some interesting facts about the global health and wellness industry. With this in mind, we decided to look into what was published in the Health category last month and see what we find.

In February, our News API gathered, analyzed, and indexed over 2.5 million new stories as they were published, giving us a huge enriched dataset of news content to look into. About 70,000 or so of these articles were categorized as belonging to the Health category (seems low right? if you’re wondering why, take a look at this previous blog, where we looked into the most popular categories and saw that Sports, News, and a few others have vastly greater publishing volumes than the rest).

The News API is an extremely powerful tool for looking into enriched news content at scale and generating intelligent insights about what the world is talking about. In this blog, we’re going to look into these almost 70,000 stories to ask the following questions:

  1. What publishing patterns could we identify over the course of the month?
  2. What consumers were these stories aimed at?
  3. What was the sentiment in these stories?
  4. Who and what was being talked about in these stories?

If you want to dive into this content for yourself, grab an API key to start your free trial, and get up and running in minutes.

How many stories are published in the Health category every day?

To find out what patterns of publication the stories in the Health category followed, we used the Time Series endpoint to see the daily count of new stories from the past 8 weeks and plotted the results on the chart below.

You can see a clear pattern in publication volumes in the chart below, with the average weekday volume of new Health stories being around 3,000 stories. On the weekends (when most journalists/writers are off work), this dips by about two thirds to 1,000 new stories per day.

We can also see two noticeable spikes that break this trend in the first and last weeks of February. By narrowing down the time period of our search, we can see that this increased coverage in the first week was due to the announcement of a new health industry venture between Amazon and Berkshire Hathaway, while the second spike was caused by the release of a study that found a correlation between alcohol consumption and early-onset dementia.


Does the Health & Wellness coverage talk about men or women more?

With the huge spend on health and wellness that we mentioned in the introduction, we thought it would be interesting to see who the health coverage was aimed at – men or women. The difference we found is interesting – in stories in the Health category, the word “women” appeared in the title almost four times more than “men”:


Was the Sentiment of Health & Wellness stories different for Men and Women?

Knowing that there were much more Health & Wellness stories aimed at women than men is a valuable insight in itself, but using the News API we can go further and analyze the sentiment of each of these stories.

The News APIs Trends endpoint allows you to do exactly this – every time the News API collects a new story, it analyzes the sentiment of each one. Using the Trends endpoint, you can search for the volume of stories containing positive, negative, and neutral tone.

You can see below that stories in the health category with “women” in the title were generally evenly balanced between positive, negative, and neutral. On the other hand, stories in the same category that contained “men” in the title tended to be more negative than positive, with 47% of these storeies containing a negative tone.


What entities did the Health stories about women mention?

So far we have seen that in the 70,000 news stories published about health last month, much more are about women’s health than men’s health, and the stories about women’s health tend to have a more positive tone than the stories about men’s health.

With the News API, we can dive further into these stories and see exactly what people, things, and organizations are mentioned the most. Using the Trends endpoint, we gathered all of the entities mentioned in Health stories with “women” in the title and plotted them on the chart below. You can see that ‘study’ is frequently mentioned in these stories.


We were curious to see why the word ‘study’ was mentioned so much, so checked the results in the Time Series endpoint, which shows us the volume of stories over time. Specifically, we searched for stories in the Health category that mentioned both “women” and “study” in the body. This search will show when Health stories about women quoted studies. You can see two big spikes:


With the Stories endpoint, we can go further into the trends from last month and see the actual news stories that made up these spikes. We took a random selection of three stories from each of the two spikes in stories that mention “study” in the title.

On February 15th, you can see a WHO report on women’s choices while in labor dominated the news:

While on the 20th, a study on the health impact of cleaning products from the University of Bergen prompted a spike in stories:

That concludes our high-level look into February’s publishing trends in the Health category. If you have an interest in this category, use the link below to grab an API key and start digging into the data yourself. Our News API gathers, analyzes, and indexes over 2.5 million new stories each month in near-real time, so whatever industry you’re in, the News API can help you generate timely, actionable insights.

News API - Sign up


In a recent blog post I outlined some interesting research directions for people who are just getting into NLP and ML (you can read the original post here). These ideas can also serve as a starting point and source of inspiration if you are considering starting a PhD in this field, which is why I’m highlighting them again here. At AYLIEN, we offer the opportunity for people to join our research team and complete a PhD, while also working on the application of that research directly as part of our products. If this sounds interesting, you should consider applying to work with us on these (or related) ideas. See this blog post for more information.

Table of contents:

It can be hard to find compelling topics to work on and know what questions are interesting to ask when you are just starting as a researcher in a new field. Machine learning research in particular moves so fast these days that it is difficult to find an opening.

This post aims to provide inspiration and ideas for research directions to junior researchers and those trying to get into research. It gathers a collection of research topics that are interesting to me, with a focus on NLP and transfer learning. As such, they might obviously not be of interest to everyone. If you are interested in Reinforcement Learning, OpenAI provides a selection of interesting RL-focused research topics. In case you’d like to collaborate with others or are interested in a broader range of topics, have a look at the Artificial Intelligence Open Network.

Most of these topics are not thoroughly thought out yet; in many cases, the general description is quite vague and subjective and many directions are possible. In addition, most of these are not low-hanging fruit, so serious effort is necessary to come up with a solution. I am happy to provide feedback with regard to any of these, but will not have time to provide more detailed guidance unless you have a working proof-of-concept. I will update this post periodically with new research directions and advances in already listed ones. Note that this collection does not attempt to review the extensive literature but only aims to give a glimpse of a topic; consequently, the references won’t be comprehensive.

I hope that this collection will peak your interest and serve as inspiration for your own research agenda.

Task-independent data augmentation for NLP

Data augmentation aims to create additional training data by producing variations of existing training examples through transformations, which can mirror those encountered in the real world. In Computer Vision (CV), common augmentation techniques are mirroring, random cropping, shearing, etc. Data augmentation is super useful in CV. For instance, it has been used to great effect in AlexNet (Krizhevsky et al., 2012) [1] to combat overfitting and in most state-of-the-art models since. In addition, data augmentation makes intuitive sense as it makes the training data more diverse and should thus increase a model’s generalization ability.

However, in NLP, data augmentation is not widely used. In my mind, this is for two reasons:

  1. Data in NLP is discrete. This prevents us from applying simple transformations directly to the input data. Most recently proposed augmentation methods in CV focus on such transformations, e.g. domain randomization (Tobin et al., 2017) [2].
  2. Small perturbations may change the meaning. Deleting a negation may change a sentence’s sentiment, while modifying a word in a paragraph might inadvertently change the answer to a question about that paragraph. This is not the case in CV where perturbing individual pixels does not change whether an image is a cat or dog and even stark changes such as interpolation of different images can be useful (Zhang et al., 2017) [3].

Existing approaches that I am aware of are either rule-based (Li et al., 2017) [5] or task-specific, e.g. for parsing (Wang and Eisner, 2016) [6] or zero-pronoun resolution (Liu et al., 2017) [7]. Xie et al. (2017) [39] replace words with samples from different distributions for language modelling and Machine Translation. Recent work focuses on creating adversarial examples either by replacing words or characters (Samanta and Mehta, 2017; Ebrahimi et al., 2017) [8, 9], concatenation (Jia and Liang, 2017) [11], or adding adversarial perturbations (Yasunaga et al., 2017) [10]. An adversarial setup is also used by Li et al. (2017) [16] who train a system to produce sequences that are indistinguishable from human-generated dialogue utterances.

Back-translation (Sennrich et al., 2015; Sennrich et al., 2016) [12, 13] is a common data augmentation method in Machine Translation (MT) that allows us to incorporate monolingual training data. For instance, when training a EN\(\rightarrow\)FR system, monolingual French text is translated to English using an FR\(\rightarrow\)EN system; the synthetic parallel data can then be used for training. Back-translation can also be used for paraphrasing (Mallinson et al., 2017) [14]. Paraphrasing has been used for data augmentation for QA (Dong et al., 2017) [15], but I am not aware of its use for other tasks.

Another method that is close to paraphrasing is generating sentences from a continuous space using a variational autoencoder (Bowman et al., 2016; Guu et al., 2017) [17, 19]. If the representations are disentangled as in (Hu et al., 2017) [18], then we are also not too far from style transfer (Shen et al., 2017) [20].

There are a few research directions that would be interesting to pursue:

  1. Evaluation study: Evaluate a range of existing data augmentation methods as well as techniques that have not been widely used for augmentation such as paraphrasing and style transfer on a diverse range of tasks including text classification and sequence labelling. Identify what types of data augmentation are robust across task and which are task-specific. This could be packaged as a software library to make future benchmarking easier (think CleverHans for NLP).
  2. Data augmentation with style transfer: Investigate if style transfer can be used to modify various attributes of training examples for more robust learning.
  3. Learn the augmentation: Similar to Dong et al. (2017) we could learn either to paraphrase or to generate transformations for a particular task.
  4. Learn a word embedding space for data augmentation: A typical word embedding space clusters synonyms and antonyms together; using nearest neighbours in this space for replacement is thus infeasible. Inspired by recent work (Mrkšić et al., 2017) [21], we could specialize the word embedding space to make it more suitable for data augmentation.
  5. Adversarial data augmentation: Related to recent work in interpretability (Ribeiro et al., 2016) [22], we could change the most salient words in an example, i.e. those that a model depends on for a prediction. This still requires a semantics-preserving replacement method, however.

Few-shot learning for NLP

Zero-shot, one-shot and few-shot learning are one of the most interesting recent research directions IMO. Following the key insight from Vinyals et al. (2016) [4] that a few-shot learning model should be explicitly trained to perform few-shot learning, we have seen several recent advances (Ravi and Larochelle, 2017; Snell et al., 2017) [23, 24].

Learning from few labeled samples is one of the hardest problems IMO and one of the core capabilities that separates the current generation of ML models from more generally applicable systems. Zero-shot learning has only been investigated in the context of learning word embeddings for unknown words AFAIK. Dataless classification (Song and Roth, 2014; Song et al., 2016) [25, 26] is an interesting related direction that embeds labels and documents in a joint space, but requires interpretable labels with good descriptions.

Potential research directions are the following:

  1. Standardized benchmarks: Create standardized benchmarks for few-shot learning for NLP. Vinyals et al. (2016) introduce a one-shot language modelling task for the Penn Treebank. The task, while useful, is dwarfed by the extensive evaluation on CV benchmarks and has not seen much use AFAIK. A few-shot learning benchmark for NLP should contain a large number of classes and provide a standardized split for reproducibility. Good candidate tasks would be topic classification or fine-grained entity recognition.
  2. Evaluation study: After creating such a benchmark, the next step would be to evaluate how well existing few-shot learning models from CV perform for NLP.
  3. Novel methods for NLP: Given a dataset for benchmarking and an empirical evaluation study, we could then start developing novel methods that can perform few-shot learning for NLP.

Transfer learning for NLP

Transfer learning has had a large impact on computer vision (CV) and has greatly lowered the entry threshold for people wanting to apply CV algorithms to their own problems. CV practicioners are no longer required to perform extensive feature-engineering for every new task, but can simply fine-tune a model pretrained on a large dataset with a small number of examples.

In NLP, however, we have so far only been pretraining the first layer of our models via pretrained embeddings. Recent approaches (Peters et al., 2017, 2018) [31, 32] add pretrained language model embedddings, but these still require custom architectures for every task. In my opinion, in order to unlock the true potential of transfer learning for NLP, we need to pretrain the entire model and fine-tune it on the target task, akin to fine-tuning ImageNet models. Language modelling, for instance, is a great task for pretraining and could be to NLP what ImageNet classification is to CV (Howard and Ruder, 2018) [33].

Here are some potential research directions in this context:

  1. Identify useful pretraining tasks: The choice of the pretraining task is very important as even fine-tuning a model on a related task might only provide limited success (Mou et al., 2016) [38]. Other tasks such as those explored in recent work on learning general-purpose sentence embeddings (Conneau et al., 2017; Subramanian et al., 2018; Nie et al., 2017) [34, 35, 40] might be complementary to language model pretraining or suitable for other target tasks.
  2. Fine-tuning of complex architectures: Pretraining is most useful when a model can be applied to many target tasks. However, it is still unclear how to pretrain more complex architectures, such as those used for pairwise classification tasks (Augenstein et al., 2018) or reasoning tasks such as QA or reading comprehension.

Multi-task learning

Multi-task learning (MTL) has become more commonly used in NLP. See here for a general overview of multi-task learning and here for MTL objectives for NLP. However, there is still much we don’t understand about multi-task learning in general.

The main questions regarding MTL give rise to many interesting research directions:

  1. Identify effective auxiliary tasks: One of the main questions is which tasks are useful for multi-task learning. Label entropy has been shown to be a predictor of MTL success (Alonso and Plank, 2017) [28], but this does not tell the whole story. In recent work (Augenstein et al., 2018) [27], we have found that auxiliary tasks with more data and more fine-grained labels are more useful. It would be useful if future MTL papers would not only propose a new model or auxiliary task, but also try to understand why a certain auxiliary task might be better than another closely related one.
  2. Alternatives to hard parameter sharing: Hard parameter sharing is still the default modus operandi for MTL, but places a strong constraint on the model to compress knowledge pertaining to different tasks with the same parameters, which often makes learning difficult. We need better ways of doing MTL that are easy to use and work reliably across many tasks. Recently proposed methods such as cross-stitch units (Misra et al., 2017; Ruder et al., 2017) [29, 30] and a label embedding layer (Augenstein et al., 2018) are promising steps in this direction.
  3. Artificial auxiliary tasks: The best auxiliary tasks are those, which are tailored to the target task and do not require any additional data. I have outlined a list of potential artificial auxiliary tasks here. However, it is not clear which of these work reliably across a number of diverse tasks or what variations or task-specific modifications are useful.

Cross-lingual learning

Creating models that perform well across languages and that can transfer knowledge from resource-rich to resource-poor languages is one of the most important research directions IMO. There has been much progress in learning cross-lingual representations that project different languages into a shared embedding space. Refer to Ruder et al. (2017) [36] for a survey.

Cross-lingual representations are commonly evaluated either intrinsically on similarity benchmarks or extrinsically on downstream tasks, such as text classification. While recent methods have advanced the state-of-the-art for many of these settings, we do not have a good understanding of the tasks or languages for which these methods fail and how to mitigate these failures in a task-independent manner, e.g. by injecting task-specific constraints (Mrkšić et al., 2017).

Task-independent architecture improvements

Novel architectures that outperform the current state-of-the-art and are tailored to specific tasks are regularly introduced, superseding the previous architecture. I have outlined best practices for different NLP tasks before, but without comparing such architectures on different tasks, it is often hard to gain insights from specialized architectures and tell which components would also be useful in other settings.

A particularly promising recent model is the Transformer (Vaswani et al., 2017) [37]. While the complete model might not be appropriate for every task, components such as multi-head attention or position-based encoding could be building blocks that are generally useful for many NLP tasks.


I hope you’ve found this collection of research directions useful. If you have suggestions on how to tackle some of these problems or ideas for related research topics, feel free to comment below.


  1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

  2. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. arXiv Preprint arXiv:1703.06907. Retrieved from

  3. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization, 1–11. Retrieved from

  4. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching Networks for One Shot Learning. NIPS 2016. Retrieved from

  5. Li, Y., Cohn, T., & Baldwin, T. (2017). Robust Training under Linguistic Adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Vol. 2, pp. 21–27).

  6. Wang, D., & Eisner, J. (2016). The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages. Tacl, 4, 491–505. Retrieved from

  7. Liu, T., Cui, Y., Yin, Q., Zhang, W., Wang, S., & Hu, G. (2017). Generating and Exploiting Large-scale Pseudo Training Data for Zero Pronoun Resolution. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 102–111).

  8. Samanta, S., & Mehta, S. (2017). Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812.

  9. Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2017). HotFlip: White-Box Adversarial Examples for NLP. Retrieved from

  10. Yasunaga, M., Kasai, J., & Radev, D. (2017). Robust Multilingual Part-of-Speech Tagging via Adversarial Training. In Proceedings of NAACL 2018. Retrieved from

  11. Jia, R., & Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.

  12. Sennrich, R., Haddow, B., & Birch, A. (2015). Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.

  13. Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891.

  14. Mallinson, J., Sennrich, R., & Lapata, M. (2017). Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (Vol. 1, pp. 881-893).

  15. Dong, L., Mallinson, J., Reddy, S., & Lapata, M. (2017). Learning to Paraphrase for Question Answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.

  16. Li, J., Monroe, W., Shi, T., Ritter, A., & Jurafsky, D. (2017). Adversarial Learning for Neural Dialogue Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Retrieved from

  17. Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2016). Generating Sentences from a Continuous Space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL). Retrieved from

  18. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward Controlled Generation of Text. In Proceedings of the 34th International Conference on Machine Learning.

  19. Guu, K., Hashimoto, T. B., Oren, Y., & Liang, P. (2017). Generating Sentences by Editing Prototypes.

  20. Shen, T., Lei, T., Barzilay, R., & Jaakkola, T. (2017). Style Transfer from Non-Parallel Text by Cross-Alignment. In Advances in Neural Information Processing Systems. Retrieved from

  21. Mrkšić, N., Vulić, I., Séaghdha, D. Ó., Leviant, I., Reichart, R., Gašić, M., … Young, S. (2017). Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints. TACL. Retrieved from

  22. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144). ACM.

  23. Ravi, S., & Larochelle, H. (2017). Optimization as a Model for Few-Shot Learning. In ICLR 2017.

  24. Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems.

  25. Song, Y., & Roth, D. (2014). On dataless hierarchical text classification. Proceedings of AAAI, 1579–1585. Retrieved from

  26. Song, Y., Upadhyay, S., Peng, H., & Roth, D. (2016). Cross-Lingual Dataless Classification for Many Languages. Ijcai, 2901–2907.

  27. Augenstein, I., Ruder, S., & Søgaard, A. (2018). Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces. In Proceedings of NAACL 2018.

  28. Alonso, H. M., & Plank, B. (2017). When is multitask learning effective? Multitask learning for semantic sequence prediction under varying data conditions. In EACL. Retrieved from

  29. Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  30. Ruder, S., Bingel, J., Augenstein, I., & Søgaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142.

  31. Peters, M. E., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017).

  32. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL.

  33. Howard, J., & Ruder, S. (2018). Fine-tuned Language Models for Text Classification. arXiv preprint arXiv:1801.06146.

  34. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.

  35. Subramanian, S., Trischler, A., Bengio, Y., & Pal, C. J. (2018). Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. In Proceedings of ICLR 2018.

  36. Ruder, S., Vulić, I., & Søgaard, A. (2017). A Survey of Cross-lingual Word Embedding Models. arXiv Preprint arXiv:1706.04902. Retrieved from

  37. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems.

  38. Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., & Jin, Z. (2016). How Transferable are Neural Networks in NLP Applications? Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing.

  39. Xie, Z., Wang, S. I., Li, J., Levy, D., Nie, A., Jurafsky, D., & Ng, A. Y. (2017). Data Noising as Smoothing in Neural Network Language Models. In Proceedings of ICLR 2017.

  40. Nie, A., Bennett, E. D., & Goodman, N. D. (2017). DisSent: Sentence Representation Learning from Explicit Discourse Relations. arXiv Preprint arXiv:1710.04334. Retrieved from


At AYLIEN we are using recent advances in Artificial Intelligence to try to understand natural language. Part of what we do is building products such as our Text Analysis Platform, Text Analysis API and News API to help people extract meaning and insight from text. We are also a research lab, conducting research that we believe will make valuable contributions to the field of Artificial Intelligence, as well as driving further product development (have a look at our research page for more details).

We are excited to announce that we are currently accepting applications from students and researchers for funded PhD and Masters opportunities, as part of the Irish Research Council Employment Based Programme.

The Employment Based Programme (EBP) enables students to complete their PhD or Masters degree while working with us here at AYLIEN.

For students and researchers, we feel that this is a great opportunity to work in industry with a team of talented scientists and engineers, and with the resources and infrastructure to support your work.

About us

We’re an award-winning VC-backed text analysis company specialising in cutting-edge AI, deep learning and natural language processing research to offer developers and solution builders a package of APIs that bring intelligent analysis to a wide range of apps and processes, helping them make sense of large volumes of unstructured data and content.

With thousands of users worldwide and a growing customer base that includes great companies such as Microsoft, Storyful, and Sony, we’re growing fast and enjoy working as part of a diverse and super smart team here at our office in Dublin, Ireland.

About the IRC Employment Based Programme

The Irish Research Council’s Employment Based Programme (EBP) is a unique national initiative, providing students with an opportunity to work in a co-educational environment involving a higher education institution and an employment partner.

The EBP provides a co-educational opportunity for researchers as they will be employed directly by AYLIEN, while also being full time students working on their research degree. One of the key benefits of such an arrangement is that you will be given a chance to see your academic outputs being transferred into a practical setting. This immersive aspect of the programme will enable you to work with some really bright minds who can help you generate research ideas and bring benefits to your work that may otherwise not have come to light under a traditional academic Masters of PhD route.


The Scholarship funding consists of €24,000pa towards salary and a maximum of €8,000pa for tuition, travel and equipment expenses. Depending on candidates’ level of seniority and expertise, the salary amount may be increased.

Details & requirements

First and foremost, your thesis topic must be something you are passionate about. While prior experience with the topic is important, it is not crucial. We can work with you to establish a suitable topic that overlaps with both the supervisor’s general area of interest/research and our own research and product directions. For a detailed list of research areas which we find particularly interesting, take a look at this blog post.

We are particularly interested in applicants with interests in the following areas (but are open to other suggestions):

  • Representation Learning
  • Domain Adaptation and Transfer Learning
  • Sentiment Analysis
  • Question Answering
  • Dialogue Systems
  • Entity and Relation Extraction
  • Topic Modeling
  • Document Classification
  • Taxonomy Inference
  • Document Summarization
  • Machine Translation

Suggested read: Survival Guide to a PhD by Andrej Karpathy

You have the option to complete a Masters (1 year, or 2 years if structured) or a PhD (3 years, or 4 years if structured) degree.

AYLIEN will co-fund your scholarship and provide you with professional guidance and mentoring throughout the programme. It is a prerequisite that you spend 50-70% of your time based on site with us and the remainder of the time at your higher educational institute (HEI).

Open to students with a bachelor’s degree or higher (worldwide) and you will ideally be based within a commutable distance of our office in Dublin City Centre.


It would be ideal if you have already identified or engaged with a potential supervisor at a university in Ireland. However, if not, we will help you with finding a suitable supervisor.

Important dates and deadlines

Please note: all times stated are Ireland time.

Call open: 1 February 2018

Application deadline (for getting in touch with us): 23 March 2018 (18:00)

Final application deadline (with Irish Research Council): 12 April 2018 (16:00)

Supervisor, Employment Mentor and Referee Deadline: 19 April 2018 (16:00)

Research Office Endorsement Deadline: 26 April 2018 (16:00)

Outcome of Scheme: 25 June 2018

Scholarship Start Date: 1 October 2018

How to apply

To express your interest, please forward your CV and accompanying note with topic suggestions to


Each month on the AYLIEN blog, we look into the previous month’s news to see what trends and insights we can extract using our News API. 2018 started with the news of a serious comedown from the highs of Bitcoin mania in December, and Donald Trump continuing to blaze a trail through American politics with new controversies. So last month was not exactly short of subject matter to dive into.

Using the News API, we looked into 2.6 million of the stories published last month and analyzed the coverage of two topics that caught our attention in January:

  • the media’s coverage of the rapid rise and fall in the value of Bitcoin
  • the reaction to President Trump’s description of African countries as “shitholes”


Bitcoin’s rise and fall in the media

Previously on the AYLIEN blog, we looked at the rising amount of stories being published about Bitcoin in November, and saw that the media coverage of Bitcoin was closely entwined with the cryptocurrency’s fortunes. This explains the huge number of people who bought into them and drove the price up, despite the relatively tiny number of people who understand cryptocurrency.

So with the recent crash in the price of Bitcoin, we decided to look into whether the media had a role in the plummeting confidence of these same masses of people in Bitcoin’s prospects.

How did the media’s publishing patterns relate to the price of Bitcoin?

To look into the relationship between the media and Bitcoin, we compared the volume of stories published about Bitcoin with the daily closing price of Bitcoin (which we downloaded in CSV format from We did this using the Time Series endpoint, requesting the daily volume of stories published with ‘Bitcoin’ in the title.

You can see that the number of stories about Bitcoin skyrocketed on the day that the cryptocurrency crossed the $10,000 milestone. As its value continued to rise, the press interest remained (even rising a little), but after this initial infatuation, you can see that the only spikes in story volume occurred when Bitcoin lost value.

When you look closely at the patterns of publication volume, you can see that after three news cycles of Bitcoin price rallies, the media were only interested in Bitcoin losing value. For example, notice how the price of Bitcoin increased by $2,000 in one day on January 5th, but this did not prompt any spike in press coverage. On the other hand, when Bitcoin lost value the following week, you can see two spikes in media interest. Using the Time Series endpoint like this lets us begin to extract insights into the press coverage without even looking at a single story.

What were these stories talking about?

Once we see the relationship between Bitcoin price and publishing volume spikes, we can see a definite correlation and guess that the media were interested in stories about the misfortune of Bitcoin investors. But using the News API, we can actually look into every one of these stories and analyze what was being written about at scale.

To do this, we used the keywords parameter of the Trends endpoint to retrieve the most-mentioned people, places, and things from stories published in January with ‘Bitcoin’ in the title. After that, we converted the JSON returned by the News API into CSV format using this free, easy-to-use tool and visualized the file in Tableau.

You can see from the chart that besides the obvious keywords like ‘blockchain’ and ‘currency,’ ‘South Korea’ and ‘volatility’ are two highly-prevalent keywords that cannot be explained away by a conceptual relation to Bitcoin. This means that the media had a big focus on South Korea’s clampdown on the cryptocurrency and its volatility in general.

What content were people sharing about Bitcoin last month?

Knowing that publishers were focused on the falling values of Bitcoin is great, as it gives us a hint about what narrative the media was pushing around it. But we can also take a look at what people were sharing on social media, which can give us an insight into what aspects of the saga the public were most interested in.

Looking at the most-shared stories on Facebook, you can see that people were most interested in stories about government clampdowns on the cryptocurrency and its bad fortune in general. This shows that there was a popular appetite for the negative coverage.

  1. Bitcoin prices fall as South Korea says ban still an option,” The Associated Press. 55,046 shares.
  2. France wants tougher rules on bitcoin to avoid criminal use,” The Associated Press. 54,963 shares.
  3. Facebook is banning all ads promoting cryptocurrencies — including bitcoin and ICOs,” recode. 18,384 shares.

Donald Trump’s description of African countries as “sh*tholes”

In terms of news coverage, there’s really no getting away from Donald Trump. For better or worse, the media is fixated on the President and January was no different. Last month, the President of the United States referred to Haiti and the countries of Africa as “shitholes” during talks on immigration reform, took a test used to diagnose dementia, and gave a speech to world leaders at Davos.

We decided to see which stories the press were most interested in by using the Stories endpoint again. From the peaks in publishing volume about Trump, you can see that the most interest in the President on the day after the “shitholes” comment, followed by his speeches at Davos and the State of the Union, on the last day of the month.


What sentiment did the media show towards each event?

So we can see the events that prompted media coverage of the President in January. But with the News API, we can look into the sentiment of the coverage of each event. This gives us an insight into how people felt about the events influencing the news cycle.
To do this, we use the Time Series endpoint and searched for the daily volume of stories with ‘Trump’ in the title, with queries for positive, negative, and neutral sentiment polarities in the story body.

Below you can see the difference between the overwhelmingly negative sentiment in the coverage around the time of the “shithole” controversy, whereas the more balanced coverage on January 31st, the day Trump made his State of the Union speech.

Most-shared stories on Facebook about Trump

Knowing what events the media published most about is great, but we can go further and see exactly what stories people read and shared the most.

  1. Trump attacks protections for immigrants from ‘shithole’ countries in Oval Office meeting,” Washington Post. 872,383 shares.
  2. Trump referred to Haiti and African countries as ‘shithole’ nations,” NBC News. 396,978 shares.
  3. Trump, Defending His Mental Fitness, Says He’s a ‘Very Stable Genius’’” The New York Times. 277,664 shares.

These are just two topics that interested us last month, but with the vast amount of enriched content that the News API gives you access to, you can dive into any topic, popular or niche, and start generating insights. Sign up for your free trial below and dive in!

News API - Sign up


Advances in of Natural Language Processing and Machine Learning are broadening the scope of what technology can do in people’s everyday lives, and because of this, there is an unprecedented number of people developing a curiosity in the fields. And with the availability of educational content online, it has never been easier to go from curiosity to proficiency.

We gathered some of our favorite resources together so you will have a jumping off point into studying these fields on your own. Some of the resources here are suitable for absolute beginners in either Natural Language Processing or Machine Learning, and others are suitable for those with an understanding of one who wish to learn more about the other.

We’ve split these resources into two categories:

  • Online courses and textbooks for structured learning experiences and reference material
  • NLP and Machine Learning blogs to benefit from the work of some researchers and students who distill current advances in research into interesting and readable posts.

The resources on this post are 12 of the best, not the 12 best, and as such should be taken as suggestions on where to start learning without spending a cent, nothing more!

6 free Natural Language Processing & Machine Learning courses & educational resources:

  1. Speech and Language Processing by Dan Jurafsky and James Martin was first printed in 1999 and its third edition was printed last year. It’s a comprehensive and highly readable introduction to NLP that progresses through the concepts quickly.
    Screenshot (1106)
  2. Andrew Ng’s course on Machine Learning is probably the best standalone introduction to the topic, because of both the content (delivered by Ng himself) and the structure (weekly readings, videos, and assignments). After this, you can proceed to Ng’s Deep Learning class with a solid foundation.

    Screenshot (1105)
  3. The Deep Learning book by Goodfellow, Bengio, and Courville is an authoritative textbook on the subject. Some (minor) criticism leveled at it focused on verbose definitions, but if you’re new to the subject, you’ll appreciate this greatly as it gives a bit of context to new concepts.
  4. The video lectures and resources for Stanford’s Natural Language Processing with Deep Learning are great for those who have completed an introduction to Machine Learning/Deep Learning and want to apply what they’ve learned to Natural Language Processing. The programming assignments are in Python.
  5. is an extensive collection of in-depth educational content, with tutorial series on topics from introductory Machine Learning and Natural Language Processing to training a self-driving car in Grand Theft Auto with Deep Learning. While the Stanford series offers a glimpse into a university class on Deep Learning, these videos (by YouTuber Sentdex) cover the same topics in a much more informal setting. If you’re interested in the “how” of Machine Learning rather than the “why,” you should start here instead of Ng’s class or the Stanford videos.
  6. scikit-learn, a popular Python library for Machine Learning, has a number of hands-on tutorials, including some on text data. Screenshot (1108)


6 Natural Language Processing & Machine Learning blogs to follow

  1. Sebastian Ruder, a research scientist focusing on Transfer Learning and Natural Language Processing here at AYLIEN, is the author of this great blog
    Screenshot (1110)
  2. Vered Schwartz authors the cautiously titled Probably Approximately a Scientific Blog, which explains Natural Language Processing concepts and research in accurate and interesting ways (like this explanation of one of the challenges of NLP – ambiguity – via Dad jokes).
  3. Sujit Pal is a developer who frequently updates his blog, Salmon Run. Since Sajit is coming from a programming rather than a scientific background, this blog is great for programmers who want to learn from a proficient in Machine Learning.
  4. Ben Frederickson writes posts on his blog about technical and NLP-related subjects like this post on Unicode and lots of other stuff, like this great post on recommending music with different algorithms.
     Screenshot (1112) 
  5. Although it hasn’t been updated in a while, Kavita Ganesan’s Text Analytics 101 has a list of useful explainers for NLP concepts such as such as N-grams and not-strictly-NLP-things you might find useful, such as comparisons of CrowdFlower and Amazon’s Mechanical Turk.
     Screenshot (1114)
  6. Finally (this isn’t a blog), to keep up with current developments in NLP and Machine Learning research, interesting articles about the subject, and the newest software libraries, sign up for NLP News, a fortnightly newsletter, curated with thought by Sebastian. 
    Screenshot (1116)

    Looking into some of these educational resources and keeping an eye on these blogs is a great way to become more proficient in Natural Language Processing and Machine Learning. Be sure to keep an eye on the research sections of our website and our blog to read about new research from the Science team!

    While these are great resources for starting a journey to become proficient in Natural Language Processing and Machine Learning, you can leverage these technologies in minutes with our three NLP solutions.

    Text Analysis API - Sign up


Frequent and ongoing communication with customers and users is key to the success of any business. That’s where tools like Intercom and Zendesk excel by helping companies listen and talk to their customers in a seamless and channel-agnostic manner.

Last year we decided to move all our customer communication to Intercom, and both we and our customers have been extremely happy with the experience and the results so far. However, every now and then our support channels get abused by internet trolls (or extremely-angry-for-no-apparent-reason visitors) who have too much time on their hands and come all the way to our website to try and harass us:

Screen Shot 2017-12-25 at 3.49.30 PM

Screen Shot 2017-12-25 at 3.55.15 PM


This is not cool, and it’s exactly what our support team don’t need on a busy Monday morning! (or any other time)

So like any responsible CEO, when I saw this, I decided to take action! Here was my plan:

  • Build an offensive speech detector that given a message, determines whether it’s offensive or not; and
  • Use this detector to analyze all incoming messages on Intercom and identify the offensive ones, and then respond to the offenders with a funny random GIF meme.

I spent a few hours during my Christmas break building this. Here’s a glimpse of what the end result looks like:

Offensive Speech detector _ AYLIEN - Edited - Edited

EDIT: we deployed this detector only for the first few weeks after we posted this blog, so if you’re reading this message and decide to try it out on our Intercom, your message will reach us now. Instead of a personalized meme, your insults will be met with the hurt feelings of our sales team. 

Pretty cool, huh? For the rest of this blog I will explain how I went about building this, and how you can build your own Intercom troll police bot in 3 steps:

  • Step 1: Train an offensive speech detector model using AYLIEN’s Text Analysis Platform (TAP)
  • Step 2: Set up a callback mechanism using Intercom’s webhooks functionality and AWS Lambda to monitor and analyze incoming messages
  • Step 3: Connect all the pieces together and launch the bot

Before proceeding, please make sure you have the following requirements satisfied:

  • An active TAP account (TAP is currently in private beta, but you can get 2 months free as a beta user by signing up here)
  • An Amazon Web Services (AWS) account
  • An Intercom account

Step1. Training an offensive speech detector model

First, we need to find a way to identify offensive messages. This can be framed as a text classification task, where we train a model that for any given message, predicts a label such as “offensive” or “not offensive” based on the contents of the message. It’s pretty similar to a spam detector that for each incoming email, tries to determine whether it’s spam or not, and classifies it accordingly.

We need to train our offensive speech detector model on labeled data that contains examples of both offensive and non-offensive messages. Luckily, there’s a great dataset available for this purpose that we can obtain from here and train our model on.

Great, so now we have the data to train the model. But how do we actually do it?

We’re going to use our newest product offering, AYLIEN Text Analysis Platform (TAP) for building this model. TAP allows users to upload their datasets as CSV files, and train custom NLP models for text classification and sentiment analysis tasks from within their browser. These models can then be exposed as APIs, and called from anywhere.

Our steps to follow are:

  • Uploading the dataset
  • Creating training and test data
  • Training the model
  • Deploying the model

Uploading the dataset

Let’s download the labeled_data.csv file from the “hate speech and offensive language” repository linked above, and once downloaded, head to the My Datasets section in TAP to create a new dataset.

Create a new dataset by clicking on Create Dataset and then click on Upload CSV:

Screen Shot 2017-12-27 at 12.41.02 PM


Select and upload labeled_data.csv, and click on Convert which will take you to the Preview screen:

Screen Shot 2017-12-27 at 3.51.28 PM


Assign the Document role to the tweet column, and the Label role to the class column, and click on Convert to Dataset in the bottom right corner, to convert the CSV file to a dataset:

Screen Shot 2017-12-27 at 3.55.29 PM


Please note that the original dataset uses numerical values for labels (0-2), which have the following meaning:

  • 0 – Hate speech
  • 1 – Offensive language
  • 2 – Neither

For clarity we have renamed the labels in our dataset to match the above.

Creating training and test data

In the dataset view click on the Split button to split the dataset into training and test collections:

Screen Shot 2017-12-27 at 3.57.26 PM


Set the split ratio to 95% and hit Split Dataset to split this dataset. Once the split job is finished, click on the Train button to proceed to training.

Training the model

In the first step of the training process, click to expand the Parameters section, and click to enable “Whether or not to remove default stopwords” to ask the model to ignore common words such as “the” or “it”.

Screen Shot 2017-12-27 at 3.59.41 PM


Afterwards click on Next Step to start the training process.

Evaluating the model

Once the model is trained, we can have a quick look at the evaluation stats to see how well our model is performing on the test data:

Screen Shot 2018-01-16 at 8.55.26 AM


Additionally, we can use Live Evaluation to interact with the newly trained model and get a better sense of how it works:

Screen Shot 2018-01-16 at 9.15.23 AM


Deploying the model

Now click on the Deploy tab and then click on Deploy Model:

Screen Shot 2017-12-27 at 4.16.31 PM


Once the deployment process is finished, you will be provided with the details for the API:

Screen Shot 2017-12-27 at 4.18.25 PM


We have now trained an offensive speech detection model that we can access from AWS Lambda. Now let’s build the pipeline for retrieving and processing incoming Intercom messages.

Step 2. Monitoring and processing incoming Intercom messages

Now that we have built our offensive speech detection model, we need a way to run each incoming message on Intercom through the model to determine whether it’s offensive or not, and respond with a funny meme if it is.

We will use Intercom’s handy webhooks capability to achieve this. Webhooks act as web-wide callback functions, allowing one web service to notify another upon an event, by emitting a HTTP request each time that event occurs.

To make this work, we need to point Intercom to a web service that it can ping upon every new message that is posted on Intercom. You can implement the web service in pretty much any programming language and host it somewhere on the internet. Given that our web service in this case is fairly minimal and light, we’re going to use AWS Lambda which makes it very easy to build, host and expose small microservices such as this one without managing any backend infrastructure.

The overall workflow is as follows:

  • User submits a message on Intercom
  • Intercom notifies our AWS Lambda web service by sending a webhook
  • Our Lambda service analyzes the incoming message using TAP, and if it’s deemed to be offensive, sends back a random funny meme to the Intercom chat (courtesy of Giphy!)

Building the AWS Lambda microservice

To recap from above, we need to build a service that is accessible as an API for Intercom’s webhook to hit and notify us about new messages. Luckily, Lambda makes building services like this extremely easy.

Navigate to AWS Lambda’s dashboard and hit Create function or click here to create a new Lambda function.

Screen Shot 2017-12-26 at 6.16.03 PM


In the Create function form, choose the “Author from scratch” option to build your function from scratch:

Screen Shot 2017-12-26 at 6.17.17 PM


Next, enter a name and create a new role for the function. We’re going to use Node.js for implementing the service in this instance, but you can choose from any of the available Runtimes:

Screen Shot 2017-12-26 at 6.20.50 PM


Now that our function is created, we need to implement the logic for our service. Replace the contents of index.js with the following script:

Be sure to replace the four placeholders with real values:

  • You can retrieve your TAP_MODEL_ID and TAP_API_KEY from the Deploy screen in TAP
  • You can retrieve your INTERCOM_ACCESS_TOKEN by going to Authorization > Access token
  • Finally, you can retrieve your INTERCOM_ADMIN_ID either from the webhook payload (it’s located in–see Step 3) or by calling the List Admins endpoint in the Intercom API

Note that we must provide the two packages required by our script, “request-promise” and “striptags”, in a node_modules folder. Lambda allows us to upload a ZIP bundle that contains the scripts and their dependencies using the dropdown on the top left corner of the code editor view:

Screen Shot 2017-12-27 at 11.35.14 AM


You can download the entire ZIP bundle including the dependencies from here. Simply upload this as a ZIP bundle in Lambda, and you should have both the script and its dependencies ready to go. To create your own ZIP bundle you can create a new folder on your computer, put index.js there and install the two packages using npm, then zip the entire folder and upload it.

Finally, we need to expose this Lambda function as an API that is accessible by Intercom for sending a webhook. We can achieve this in Lambda using API Gateway. Let’s add a new trigger of type “API Gateway” to our Lambda function:

Screen Shot 2017-12-26 at 6.22.39 PM


You will notice that the API Gateway must be configured. Let’s click on it to view and set the configuration parameters:

Screen Shot 2017-12-27 at 10.35.21 AM


Note that for simplicity we have set the Security policy to open, which means the API gateway is openly accessible by anyone that knows the URI. For a production application you will most likely need to secure this endpoint, for example by choosing Open with access key which will require an API key for sending requests to the service.

Make sure you hit the Save button at the top right corner after each change:

Screen Shot 2017-12-27 at 10.35.47 AM


Now that we have our API gateway created and configured, we need to retrieve its endpoint URI and provide it to the Intercom webhook. Copy the “Invoke URL” value from the API Gateway section in the Lambda function view:

Screen Shot 2017-12-27 at 10.52.24 AM


Creating the Intercom webhook

Our Lambda microservice is created and exposed as an API. The next step is to instruct Intercom to hit the web service for every new message by sending a webhook.

In order to do this, head to the Intercom Developer Hub located here. From your dashboard navigate to Webhooks:

Screen Shot 2017-12-27 at 10.58.10 AM


Then click on Create webhook in the top right corner to create a new webhook, and paste the Lambda service’s URI from the previous step into the “Webhook URL” field.

Screen Shot 2017-12-27 at 11.01.43 AM

The key events that we would like Intercom to notify our service upon are “New message from a user or lead” and “Reply from a user or lead”, both of which indicate a new message from a user has been posted.

Please note that for simplicity we are not encrypting the outgoing notifications in this example. In a real-world scenario you will most likely want to leverage this facility, since otherwise the recipient of the webhook will receive all your Intercom messages unencrypted.

Step 3. Connecting the pieces and launching the bot

We have created our AWS Lambda service, and instructed Intercom to ping it every time a new message is posted by a user. Now each time a user posts a message on Intercom, a notification similar to the one below will be sent to our web service:

The Lambda service parses these notifications, invokes TAP to see if they are offensive, and if it finds a message offensive, hits the Intercom API to respond to the offender with a random funny meme.

With the webhook being active and the service being exposed and configured, we are ready to test our bot.

Head to your Intercom widget and, well, send an offensive message–and be prepared to get busted by the troll bot!

AYLIEN _ Text Analysis API _ Natural Language Processing - Edited


And that’s it. We can now sit back, relax, and enjoy roasting trolls! 🙂

Note: To stop the bot, all you need to do is disable the webhook from the Intercom developer hub dashboard, to prevent it from invoking the Lambda script.

Things to try next:

  • Adjust the minimum threshold for the confidence score (currently set to 0.5 in index.js) based on your preferences. A lower value will result in a higher number of meme responses and potentially more false positives, whereas a higher value will only trigger a meme response if the classifier is confident about a message being offensive.
  • Download a dump of your previous (non-offensive) messages from Intercom as explained here and add the cleaned up messages to the “Not offensive” label in your TAP dataset and train a new model. This should improve the accuracy of the model and enable it to distinguish offensive and non-offensive messages better.


Text Analysis Platform


Last week, Snapchat unveiled a major redesign of their app that received quite a bit of negative feedback. As a video-sharing platform that has integrated itself into users’ daily lives, Snapchat relies on simplicity and ease of use. So when large numbers of these users begin to express pretty serious frustration about the app’s new design, it’s a big threat to their business.


You can bet that right now Snapchat are analyzing exactly how big a threat this backlash is by monitoring the conversation online. This is a perfect example of businesses leveraging the Voice of their Customer with tools like Natural Language Processing. Businesses that track their product’s reputation online can quantify how serious events like this are and make informed decisions on their next steps. In this blog, we’ll give a couple of examples of how you can dive into online chatter and extract important insights on customer opinion.

This TechCrunch article pointed out that 83% of Google Play Store reviews in the immediate aftermath of the update gave the app one or two stars. But as we mentioned in a blog last week, star rating systems aren’t enough – they don’t tell you why people feel the way they do and most of the time people base their star rating on a lot more than how they felt about a product or service.

To get accurate and in-depth insights, you need to understand exactly what a reviewer is positive or negative about, and to what degree they feel this way. This can only be done effectively with text mining.

So in this short blog, we’re going to use text mining to:

  1. Analyze a sample of the Play Store reviews to see what Snapchat users mentioned in reviews posted since the update.
  2. Gather and analyze a sample of 1,000 tweets mentioning “Snapchat update” to see if the reaction was similar on social media.

In each of these analyses, we’ll use the use the AYLIEN Text Analysis API, which comes with a free plan that’s ideal for testing it out on small datasets like the ones we’ll use in this post.


What did the app reviewers talk about?

As TechCrunch pointed out, 83% of reviews since the update shipped received one or two stars, which gives us a high-level overview of the sentiment shown towards the redesign. But to dig deeper, we need to look into the reviews and see what people were actually talking about in all of these reviews.

As a sample, we gathered the 40 reviews readily available on the Google Play Store and saved them in a spreadsheet. We can analyze what people were talking about in them by using our Text Analysis API’s Entities feature. This feature analyzes a piece of text and extracts the people, places, organizations and things mentioned in it.

One of the types of entities returned to us is a list of keywords. To get a quick look into what the reviewers were talking about in a positive and negative light, we visualized the keywords extracted along with the average sentiment of the reviews they appeared in.

From the 40 reviews, our Text Analysis API extracted 498 unique keywords. Below you can see a visualization of the keywords extracted and the average sentiment of the reviews they appeared in from most positive (1) to most negative (-1).

First of all, you’ll notice that keywords like “love” and “great” are high on the chart, while “frustrating” and “terrible” are low on the scale – which is what you’d expect. But if you look at keywords that refer to Snapchat, you’ll see that “Bitmoji” appears high on the chart, while “stories,” “layout,” and “unintuitive” all  appear low down the chart, giving an insight into what Snapchat’s users were angry about.


How did Twitter react to the Snapchat update?

Twitter is such an accurate gauge of what the general public is talking about that the US Geological Survey uses it to monitor for earthquakes – because the speed at which people react to earthquakes on Twitter outpaces even their own seismic data feeds! So if people Tweet about earthquakes during the actual earthquakes, they are absolutely going to Tweet their opinions of Snapchat updates.

To get a snapshot of the Twitter conversation, we gathered 1,000 Tweets that mentioned the update.To gather the Tweets, we ran a search on Twitter using the Twitter Search API (this is really easy –  take a look at our beginners’ guide to doing this in Python).

After we gathered our Tweets, we analyzed them with our Sentiment Analysis feature and as you can see, the Tweets were overwhelmingly negative:  

Quantifying the positive, negative, and neutral sentiment shown towards the update on Twitter is useful, but using Text Mining we can go one further and extract the keywords mentioned in every one of these Tweets. To do this, we use the Text Analysis API’s Entities feature.

Disclaimer: this being Twitter, there was quite a bit of opinion expressed in a NSFW manner 😉


The number of expletives we identified as keywords reinforces the severity of the opinion expressed towards the update. You can see that “stories” and “story” are two of the few prominently-featured keywords that referred to feature updates while keywords like “awful” and “stupid” are good examples of the most-mentioned keywords in reaction to the update as a whole.

It’s clear that using text mining processes like sentiment analysis and entity extraction – can provide a detailed overview of public reaction to an event by extracting granular information from product reviews and social media chatter.

If you can think of insights you could extract with text mining about topics that matter to you, our Text Analysis API allows you to analyze 1,000 documents per day free of charge and getting started with our tools couldn’t be easier – click on the image below to sign up.

Text Analysis API - Sign up


Online review sites are the world’s repository of customer opinion – every day, hundreds of thousands of customers give publicly available feedback on their experiences with businesses. With customer opinion available on a scale like this, anyone can generate insights about their business, their competitors, and potential opportunities.

But to leverage these sites like this, you need to understand what is being talked about positively and negatively in the text of hundreds or thousands of reviews. Since analyzing that amount of reviews manually would be far too time consuming, most people don’t consider any kind of quantitative analysis beyond looking at the star ratings, which are too vague and can be frequently misleading.

So in this blog, we’re going to show you how to use Text Mining to quickly generate accurate insights from thousands of reviews. For this blog, we’re going to scrape and analyze restaurant reviews from TripAdvisor and show you how easy it is to build a robust sentiment analysis workflow without writing any code using and the AYLIEN Text Analysis Add-on for Google Sheets.

We’ll break the process down into three easy-to-follow steps:

  1. We’ll show you how to use to scrape reviews from TripAdvisor
  2. We’ll use the AYLIEN Text API Google Sheets Add-on to analyze the sentiment expressed in each review toward 13 aspects of the dining experience.
  3. We’ll show you the results of our sample analysis

As we mentioned, neither of the tools we’ll use require coding skills, and you can use both of them for free.


Why are star reviews not enough on their own?

Take a look at the difference between these three-star reviews (which are for the same branch of the same restaurant chain):


Screenshot (911)Screenshot (907)


From looking at these reviews, you can spot two important things with the star ratings and the review texts:

  1. Even though the star rating is the same, one of the reviews is positive, the other is negative. This gap between the star rating and what the reviewer really thought is part of the reason Netflix recently ditched the star review system.
  2. The text review allows you to see why the review is positive or negative – the specific aspects that made their dining experience positive or negative.

So to get an accurate analysis of customer opinion from reviews, you need to read the text of every review. The problem here is that doing this at scale is extremely time consuming and pretty much impossible. But we can solve this problem using Text Analytics and Machine Learning.


How to use to scrape reviews from TripAdvisor

In order to find out what people are saying about businesses, we first need to gather the reviews. For this blog, we decided to analyze customer reviews of Texas Roadhouse, ranked by Business Insider as America’s best restaurant chain.

We chose to compare reviews on their branch in Gatlinburg, Tennessee with the branch in Dubai – as this might let us see how customers in diverse regions are responding to the Texas Outhouse offering. Each of these branches had more than 1,000 reviews, which gives us a generous amount of data to analyze.


Untitled design (12)


Usually, gathering data like this would involve writing code to scrape the review sites, but makes this task a lot easier – they allow you to scrape sites by simply pointing and clicking at the data you want. You can sign up for a free trial here and see a handy introductory video here (but we’ll walk you through the process below).

Once you’ve picked which restaurant you want to analyze and you’ve signed up for a trial with, open up the restaurant’s TripAdvisor page in To do this, just enter the URL in the New Extractor input box. If you point and click on the text of a review, will scrape all of the reviews on the page and save them for you.


Screenshot (918)

Screenshot (917)


You’ve now scraped the reviews from a single page. But since you’ll probably want a lot more than the 10 reviews on each TripAdvisor page, we’ll show you how you scrape a few hundred in one go.

Scraping hundreds of reviews at once

You may notice that when you are browsing reviews of a restaurant on TripAdvisor, the page url changes every time you select the next 10 reviews – it adds “-or10” for the next ten results, “-or20-” for the following 10, and so on. You can see it in the URL right before the restaurant name.

In our Texas Roadhouse example, the URL goes from this:

To this: allows us to scrape numerous webpages at once if we upload a list of URLs on a spreadsheet. So to gather 1,000 restaurant reviews, we need to upload a spreadsheet with 100 of these URLs, with the -”or10” increasing by 10 each time.

To make your life a little easier, we’ll share the simple, six-step workaround we used for this with you here:

Step 1: Select the URL of the second page of results containing reviews of the restaurant you want to analyze. In our case it’s

Step 2: Open up a spreadsheet and fill the first three cells (A1, B1, and C1) with the URL, but only up to “-or10” – copy and paste the remainder of the URL to somewhere else for now (in our case we’ll cut “-Texas_Roadhouse-Dubai_Emirate_of_Dubai.html” and paste it to another cell).

Step 3: Edit the cells in B1 and C1 to end with “-or20” and “-or30”, respectively. Then select these cells, and extend the selected cells until you have 100 rows covered. Excel or Google Sheets will then follow the pattern you have set in the first three.


Step 4: since this is not the completed URL, you’ll need to add the end of the URL you have from step 2 to the end of the text in each cell. You can do this by selecting a cell in row A and typing “A1&[the rest of your URL],” and extending that cell’s format downwards again.

Screenshot (965)

Step 5: copy and paste the values of this new column into column A, and save your spreadsheet. Your spreadsheet should now have one column with 100 rows.

Step 6: Open up and create a new extractor, and open up settings. Click on Import URLs, and select the spreadsheet with your URLs and save them. Once you click Run URLs, will start scraping the 1,000 reviews from the URLs you’ve given it. Once it’s done, download the results, and open the file in Google Sheets.

Screenshot (921)


Analyzing the Sentiment of Reviews

So at this point, we’ve gathered 1,000 reviews of each Texas Roadhouse branch, with each review containing a customer’s feedback about their experience in the restaurant. In every review, customers express positive, negative, and neutral sentiment toward the various aspects of their experience.

AYLIEN’s Aspect-Based Sentiment Analysis feature detects the aspects mentioned in a piece of text, and then analyzes the sentiment shown toward each of these aspects. In this blog, we’re analyzing restaurants but you can also use this feature to analyze review of hotels, cars, and airlines. In the restaurants domain, the Aspect-Based Sentiment Analysis feature detects mentions of seven aspects.

Copy of 600px x 300px – Untitled Design (3)


Using our Text Analysis API is easy with the Google Sheets Add-on, which you can download for free here (the Add-on comes with 1,000 credits free so you can test it out). You can complete the analysis by following these three easy steps:

Step 1: Once you’ve downloaded the Add-on, it will be available in the Add-ons menu in your Google Sheets toolbar. Open it up by selecting it and clicking Start.

Step 2: Before you begin the Aspect-based Sentiment Analysis of your reviews, first select the option from the Analysis Type menu, then select all of the cells that contain your reviews.

Step 3: To begin the sentiment analysis, click Analyze. The Text API will then extract the aspects mentioned in each review one by one, and print them in three columns next to the review – Positive, Negative, and Neutral. These results will be returned to you at a rate of about three per second, so our 2,000 reviews should take around ten minutes to analyze.



Results of the Aspect-Based Sentiment Analysis

At this point, each line of your spreadsheet will contain a column of the reviews you gathered, a column of the aspects mentioned in a positive tone, one with the aspects mentioned in a negative tone, and one with aspects mentioned in a neutral tone. To get a quick look into our data, we put together the following visualizations by simply using the spreadsheet’s word counting function.

First off, let’s take a look at the most-mentioned aspects in all of the reviews we gathered. To do this, all you need to do is separate every aspect listed in Google Sheets into its own cell using a simple function, and then use a formula to count them.

Screenshot (959)


To put each aspect mentioned into its own cell, we’ll use the Split text to columns function, in the Data toolbar. This function will move every word in a cell into a cell of its own by splitting the cell horizontally – that is, if a cell in column A has three words, the Split text function will move the second two words into the adjacent cells in columns B and C.

From the pie chart, we can see that the food and staff alone accounted for almost two thirds of the total mentions, and after that there’s a bit of a drop off. After these aspects, customers in these restaurants were concerned about was how busy the restaurant was and the value of the meal.

Knowing which aspects of the dining experience people were most likely to leave reviews about is useful, but we can go further and analyze the sentiment attached to each aspect. Let’s take a look at the sentiment attached to each aspect in each of the Texas Roadhouse branches.

To do this, use Google Sheets’ COUNTIF formula to count every time the Text API listed an aspect in the positive, negative, and neutral columns. Do this by creating a table with each aspect as rows and Positive, Negative, and Neutral as columns, and use the following formula: =COUNTIF(the range of cells that contain the aspects in each sentiment,”*aspect*”).

After you’ve entered the formula, fill it out correctly, like in the example below, where you can see the formula filled out to count the amount of times food is mentioned positively – =COUNTIF(B1:B988,“*food*”).

Screenshot (950)


Once you’ve done this, fill in the results on a table like the one below, and then insert a chart from the Insert tab.

Screenshot (953)

We chose a stacked bar chart, as it allows us to get a quick grasp of what aspects people were interested in and how they felt about each aspect. First off, take a look at the sentiment shown to each aspect by the reviewers of the Dubai branch. You can see that the reviews are very positive:

When we compare the reviews of the Dubai branch above with the Tennessee reviews, we can see immediately that the American branch received more positive reviews than its Dubai counterpart:

Interestingly, we can also see from the volume of the mentions of each aspect that customers in Dubai were more concerned with value than their American counterparts, where reviewers paid more attention to the restaurant staff (with most of this extra attention being negative).


These are just a few things that jumped out at us after a sample analysis of a couple of restaurants. If you want to get started leveraging TripAdvisor (or another review site) for your own research using the steps in this blog, sign up for a free trial with here, and download our Google Sheets Add-on here (there’s no sign-up required for the Add-on and it comes with free credits so you can test it out).

Text Analysis API - Sign up


It’s now the end of an eventful year that saw the UK begin negotiations to leave the EU, the fight of the century between a boxer and a mixed martial artist, and the discovery of alternative facts. The world’s news publishers reported all of this and the countless other events that shaped 2017, leaving a vast record of what the media was talking about right through the year.

Using Natural Language Processing, we can dive into this record to generate insights about topics that interest us. Our News API has been hard at work gathering, analyzing, and indexing over 25 million news stories in near-real time over 2017. The News API extracts and stores dozens of data points on every story, from classifying the subject matter to analyzing the sentiment, to listing the people, places, and things mentioned in every one.

This enriched content provides us with a vast dataset of structured data about what the world was talking about throughout the year, allowing us to take a quantitative look at the news trends of 2017.

Using the News API, we’re going to dive into two questions on topics that dominated last year’s news coverage:

  1. What was the coverage of Donald Trump’s first year in office like?
  2. What trends affected sports coverage – consistently the most popular category – in 2017?


Trump’s first year in office

How much did the media publish?

Any review of 2017’s news has to begin with Donald Trump and his first year in office as President. To begin with, we wanted to see how the US President was covered over the course of the year, to see which events the media covered the most. To do this, we used the Time Series endpoint to analyze the daily volume of stories that mentioned Trump in the title.

Take a look at what the News API found:


From this chart, you can see that the media are generally less interested in Trump now than they were during the first month or two of his presidency. Despite the coverage of the Charlottesville protests, the media fixation on Trump is slowly tapering off.


How did sentiment in the coverage of Trump vary over the year?

Knowing what the media was the most interested in about the President is useful information, but we can also track the sentiment expressed in each one of these stories, and see how the overall sentiment polarity changed over time.

Again using the Time Series endpoint, we can do this. Take a look at what the News API found:

You can see that the News API detected the most negative sentiment in stories about Trump around the time of his call with a fallen US soldier, where he reportedly said to the soldier’s wife, “he knew what he was signing up for”. The most positive sentiment was detected around the time of Trump’s speech in Riyadh, and as the NFL kneeling controversy began to expand.

You will also notice spikes in positive sentiment in stories about Trump around both his administration’s repeal of DACA, and as more and more NFL players joined in the kneeling protests. We think that since both of these spikes follow shortly after the events that the coverage is most likely about the reactions or backlash towards these developments.


What other things were mentioned in stories about Trump?

So we know how both the volume of stories about Trump and their sentiment varied over time. But knowing exactly what other people, organizations, and things were mentioned in these stories across the year would let us see what all of these stories were about.

The News API extracts entities mentioned in in every story it analyzes. Using the Trends endpoint, we can search for the 100 entities that were most frequently mentioned in stories about Trump in 2017. These entities are visualized below.

Perhaps unsurprisingly, we can see that Trump coverage was dominated by his campaign’s and administration’s involvement with Russia. But what is quite remarkable is the scale to which it dominated the coverage that Russia was mentioned in more stories with ‘Trump’ in the title than the US itself.


What were the most-shared stories about Trump in 2017?

Seeing which stories were shared the most on social networking sites can be very interesting. It can also yield some important business insights as the more a story is shared, the more value it generates for advertisers and publishers.

We can do this with the News API by using the Stories endpoint. Since Facebook consistently garners the most shares of news stories of all the social networks, we returned the top three stories:

  1. Trump Removes Anthony Scaramucci From Communications Director Role,” The New York Times – 1,061,494 shares.
  2. Trump announces ban on transgender people in U.S. military,” The Washington Post – 696,341 shares.
  3. Trump admin. to reverse ban on elephant trophies from Africa,” ABC News – 638,917 shares.


2017 in Sports Coverage

Sports is the subject that the media writes the most about, by quite a bit. This is reflected in the fact that the News API gathered over five million stories about sports in 2017, more than any other single subject category.

To make sense of this content at this scale, we need to first understand the subject matter of each story. To enable us to do this, the News API classifies each story according to two taxonomies.

To analyze the most popular sports, we used the Time Series endpoint to see how the daily volume of stories about the four most-popular sports varied over time. We searched stories that the News API classified as belonging to the categories Soccer, American Football, Baseball, and Basketball in the advertising industry’s  IAB-QAG taxonomy. To narrow our search down a bit, we decided to look into Autumn, the busiest time of year for sports.

Take a look what the News API returned:

We can see that the biggest event that caused a spike in stories was Mike Pence’s out-of-the-ordinary appearance at an NFL game as the NFL kneeling protests expanded, a game from which he left after the players kneeled during the playing of the national anthem.

Other than this, the biggest spike in stories was clearly caused by the closing of the English transfer window on the last day of August, showing the dominant presence of soccer in the world’s media outlets.


Who and what were the media talking about?

Being able to see the spikes in the volume of sports stories around certain events is a useful resource to have, but we can use the News API to see exactly what people, places, and organizations were talked about in every one of the over 25 million stories it gathered in 2017.

To do this, we again used the Trends endpoint to find the most-mentioned entities in sports stories from 2017. Take a look at what the News API found:

You can immediately see the dominance of popular soccer clubs in the media coverage, but locations that host popular NFL and NBA teams are also featured prominently. However, soccer has a clear lead over its American competitors in terms of media attention, probably due to the global reach of soccer.


What were the most-shared sports stories on Facebook in 2017?

The Time Series endpoint showed us that the NFL kneeling protests were the most-covered sports event of 2017. Using the News API, we can also see how many times each one of the over 25 million stories was shared across social media.

Looking at the top three most-shared sports stories on Facebook, we can see that the kneeling protests were the subject of two of them. This shows us that the huge spike in story volume about these protests were responding to genuine public demand – people were sharing these stories with their friends and followers online.

  1. Wife of ‘American Sniper’ Chris Kyle Just Issued Major Challenge to NFL – Every Player Should Read This,” Independent Journal-Review – 830,383 shares.
  2. Vice President Mike Pence leaves Colts-49ers game after players kneel during anthem,” Fox News – 829,466 shares.
  3. UFC: Dana White admits Mark Hunt’s UFC career could be over,” New Zealand Herald – 772,926 shares.


Use the News API for yourself

Well that concludes our brief look back at a couple of the biggest media trends of 2017. If there are any subjects of interest to you, try out our free two-week trial of the News API and see what insights you can extract. With the easy-to-use SDKs and extensive documentation, you can make your first query in minutes.


News API - Sign up


Being able to leverage news content at scale is an extremely useful resource for anyone analyzing business, social, or economic trends. But in order to extract valuable insights from this content, we sometimes need to build analysis tools that help us understand it.

To serve the needs of everyone who needs a simple, end-to-end solution for this complex task, we’ve put together a fully-functional example of a RapidMiner process that sources data from the AYLIEN News API and analyzes this data using some of RapidMiner’s operators.


What can you do with the News API in RapidMiner?

With news content now accessible at web scale, data scientists are constantly creating new ways to generate value with insights from news content that were previously almost impossible to extract. Every month, our News API gathers millions of stories in near-real time, analyzes every news story as it is published, and stores each of them along with dozens of extracted data points and metadata.

Equipped with this structured data about what the world’s media is talking about, RapidMiner users can leverage the extensive range of tools the studio has to offer, including:

  • 1,500+ built-in operations & Extensions to dive into your data
  • 100+ data modelling & machine learning operators
  • Advanced visualization tools.

News_API_RM_vizUsing a 3-D scatter plot to visualize news data with four variables in the RapidMiner Studio

How do I get started with the News API process?

For the this blog we’re going to showcase a example of how you can use our News API within RapidMiner to build useful content analysis processes that aggregate and analyze news content with ease. In this example we’ve picked a fun little example that analyzes articles from TechCrunch and builds a classification model to predict which reporter wrote any new article it is shown (protip: you can use the same model to pick which TechCrunch journalist you should target for your pitch!). We hope this blog might spark some creative ideas and use cases around combining RapidMiner and our News API.

This sample process consists of two main steps:

  1. Gathering enriched news content from the News API using the Web Mining extension
  2. Building a classification model by using RapidMiner’s Naive Bayes operator.

If you are unfamiliar with RapidMiner, there are some great introductory videos and walkthroughs for beginners on their YouTube channel.

So let’s get started!

We’ve made it really easy to get started with the News API and RapidMiner, download this pre-built process and open it with RapidMiner. Next, grab your credentials for the News API by signing up for our free two-week trial.

Once you’ve downloaded the process and opened it up with RapidMiner, you’ll see the main operators outlined in the Process tab. You will see that there are seven operators in total, the first three gather data from the News API while the last four train the classifier.

Screenshot (853)

To make your first calls to the News API, the first thing you need to do is build your search criteria. In order to build your News API query click on the Set Macros operator in the top left of your console. Once you’ve selected the operator, clicking on the Edit List button in the Parameters tab will show you the list of parameters for your News API query. Enter your API credentials (your API key and application ID) that you obtained from the News API developer portal) when you signed up and configure your search parameters – check out the full list of query parameters in our News API documentation to build your search query.

Screenshot (865)

The purpose of this blog is to build a classifier that will predict which TechCrunch journalist wrote an article. In order to do this, we first need to teach the model by gathering relevant training data from TechCrunch. To get this data, we built a query that searched for every article published on on the site in the past 30 days and returned the author of each one, along with the contents of the articles they wrote. The News API can return up to 100 results at a time, but since we wanted more than 100 articles, we used pagination to iterate on the search results for five pages, giving us 500 results. You can see what query we used in the screenshot above.

Importantly, after you have defined these parameters in the Set Macros operator, you’ll need to make the same changes by editing the query list in the Get Page operator within the Process Loop. To do this, double-click on the Loop icon in the Process tab, then double-click the Get Page icon, and select the Edit List button next to Query Parameters.

Screenshot (855)

When you’re entering the parameters, be sure to enter every parameter you entered in the previous window and follow the convention already set in the list (entering the parameter in the “%{___}” format).

News API Results

Once you have defined your parameters in both lists, hit the Run (play) button at the top of the console and let RapidMiner run your News API query. Once it has finished running, you can view the results in the Results window. Below you can see a screenshot of the enriched results with the dozens of data points that the News API returns.

Screenshot (877)

Having access to this enriched news content in Rapidminer allows you to extract useful insights from this unstructured data. After running the analysis you can browse the results of the search using simple visualizations to show data on results like sentiment, or as in the graph below, authorship, which shows us which authors have published the most articles in the time period we set.

Screenshot (883)


Training a Classifier

For the sample analysis in this blog, we’re building a classifier using RapidMiner’s Naive Bayes operator.

Naive Bayes is a common algorithm used in Machine Learning for data classification. You can read more about Naive Bayes in an explainer for novices blog we wrote which talks you through how the algorithm works. Essentially, this classifier will guess which author new articles belong to by learning from features in the training data – the news content we retrieved from our News API results. By analyzing the most common features in the articles from each author in these results, the model will learn that different words and phrases are more likely to be in articles from each author.

For example, take a look below at our how our classifier has learned which writers are most likely to talk about ‘cryptocurrency’. You can test how your classifier by selecting the Attribute button in the top left corner.

Screenshot (879)


Once the process is fully run, it will retrieve and process the news content, and train a Naive Bayes classifier that given the body of an article, tries to predict who the likely authors for that article are, from among all TechCrunch journalists.

Additionally, RapidMiner will also evaluate this classifier for us on a held out subset of the data we retrieved from the News API, by comparing the true labels (known authors) to the model’s predictions (predicted authors) on the test set, and providing us with an accuracy score and a confusion matrix based on the same:

Screen Shot 2017-12-18 at 5.29.56 PM

There are many ways to improve the performance of this classifier, for example by using a more advanced classification algorithm like SVM instead of Naive Bayes. In this post, our goal was to show you how easy it is to retrieve news content from our News API and load it into RapidMiner for further analysis and processing.

Things to try next:

  • Try changing your News API query to repeat this process for journalists from a different news outlet
  • Try using a more powerful algorithm such as SVM or Logistic Regression (RapidMiner includes implementations for many different classifiers, and you can easily replace them with one another)
  • Try to apply a minimum threshold on the number of articles that must exist for each author that the model is trained on

This process is just one simple example of what RapidMiner’s analytic capabilities can perform on enriched news content. By running the first three operators on their own, you can take a look at the enriched content that the News API generates and begin to leverage RapidMiner’s advanced capabilities on an ever-growing dataset of structured news data.

To get started with a free trial of the News API, click on the link below, and be sure to check back on our blog over the coming weeks to see some sample analyses and walkthroughs.

News API - Sign up