Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

Four members of our research team spent the past week at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017) in Copenhagen, Denmark. The conference handbook can be found here and the proceedings can be found here.

The program consisted of two days of workshops and tutorials and three days of main conference. Videos of the conference talks and presentations can be found here.The conference was superbly organized, had a great venue, and a social event with fireworks.


Figure 1: Fireworks at the social event

With 225 long papers, 107 papers, and 9 TACL papers accepted, there was a clear uptick of submissions compared to last year. The number of long and short paper submissions to EMNLP this year was even higher than those at ACL for the first time within the last 13 years, as can be seen in Figure 2.


Figure 2: Long and short paper submissions at ACL and EMNLP from 2004-2017

In the following, we will outline our highlights and list some research papers that caught our eye. We will first list overall themes and will then touch upon specific research topics that are in line with our areas of focus. Also, we’re proud to say that we had four papers accepted to the conference and workshops this year! If you want to see the AYLIEN team’s research, check out the research sections of our website and our blog. With that said, let’s jump in!


Exciting Datasets

Evaluating your approach on CoNLL-2003 or PTB is appropriate for comparing against previous state-of-the-art, but kind of boring. The two following papers introduce datasets that allow you to test your model in more exciting settings:

  • Durrett et al. release a new domain adaptation dataset. The dataset evaluates models on their ability to identify products being bought and sold in online cybercrime forums.  
  • Kutuzov et al. evaluate their word embedding model on a new dataset that focuses on predicting insurgent armed groups based on geographical locations.
  • While he did not introduce a new dataset, Nando de Freitas made the point during his keynote that the best environment for learning and evaluating language is simulation.


Figure 3: Nando de Freitas’ vision for AI research

Return of the Clusters

Brown clusters, an agglomerative, hierarchical clustering of word types based on contexts that was introduced in 1992 seem to come in vogue again. They were found to be particularly helpful for cross-lingual applications, while clusters were key features in several approaches:

  • Mayhew et al. found that Brown cluster features were an important signal for cross-lingual NER.
  • Botha et al. use word clusters as a key feature in their small, efficient feed-forward neural networks.
  • Mekala et al.’s new document representations cluster word embeddings, which give it an edge for text classification.
  • In his talk at the SCLeM workshop, Noah Smith cites the benefits of using Brown clusters as features for tasks such as POS tagging and sentiment analysis.


Figure 4: Noah Smith on the benefits of clustering in his invited talk at the SCLeM workshop

Distant Supervision

Distant supervision can be leveraged to collect large amounts of noisy training data, which can be useful in many applications. Some papers used novel forms of distant supervision to create new corpora or to train a model more effectively:

  • Lan et al. use urls in tweets to collect a large corpus of paraphrase data. Paraphrase data is usually hard to create, so this approach facilitates the process significantly and enables a continuously expanding collection of paraphrases.
  • Felbo et al. show that training on fine-grained emoji detection is more effective for pre-training sentiment and emotion models. Previous approaches primarily pre-trained on positive and negative emoticons or emotion hashtags.

Data Selection

The current generation of deep learning models is excellent at learning from data. However, we often do not pay much attention to the actual data our model is using. In many settings, we can improve upon the model by selecting the most relevant data:

  • Fang et al. reframe active learning as reinforcement learning and explicitly learn a data selection policy. Active learning is one of the best ways to create a model with as few annotations as possible; any improvement to this process is beneficial.
  • Van der Wees et al. introduce dynamic data selection for NMT, which varies the selected subset of the training data between different training epochs. This approach has the potential to reduce the training time of NMT models at comparable or better performance.
  • Ruder and Plank use Bayesian Optimization to learn data selection policies for transfer learning and investigate how well these transfer across models, domains, and tasks. This approach brings us a step closer towards gaining a better understanding of what constitutes similarity between different tasks and domains.

Character-level Models

Characters are nowadays used as standard features in most sequence models. The Subword and Character-level Models in NLP workshop discussed approaches in more detail, with invited talks on subword language models and character-level NMT.

  • Schmaltz et al. find that character-based sequence-to-sequence models outperform word-based models and models with character convolutions for sentence correction.
  • Ryan Cotterell gave a great, movie-inspired tutorial on combining the best of FSTs (cowboys) and sequence-to-sequence models (aliens) for string-to-string transduction. While evaluated on morphological segmentation, the tutorial raised awareness in an entertaining way that often the best of both worlds, i.e. a combination of traditional and neural approaches performs best.


Figure 5: Ryan Cotterell on combining FSTs and seq2seq models for string-to-string transduction

Word Embeddings

Research in word embeddings has matured and now mainly tries to 1) address deficits of word2vec, such as its ability of dealing with OOV words; 2) extend it to new settings, e.g. modelling the relations of words over time; and 3) understand the induced representations better:

  • Pinter et al. propose an approach for generating OOV word embeddings by training a character-based BiLSTM to generate embeddings that are close to pre-trained ones. This approach is promising as it provides us with a more sophisticated way to deal with out-of-vocabulary words than replacing them with an <UNK> token.
  • Herbelot and Baroni slightly modify word2vec to allow it to learn embeddings for OOV words from few data.
  • Rosin et al. propose a model for analyzing when two words relate to each other.
  • Kutuzov et al. propose another model that analyzes how two words relate to each other over time.
  • Hasan and Curry improve the performance of word embeddings on word similarity tasks by re-embedding them in a manifold.
  • Yang et al. introduce a simple approach to learning cross-domain word embeddings. Creating embeddings tuned on a small, in-domain corpus is still a challenge, so it is nice to see more approaches addressing this pain point.
  • Mimno and Thompson try to understand the geometry of word2vec better. They show that the learned word embeddings are positioned diametrically opposite of their context vectors in the embedding space.

Cross-lingual transfer

An increasing number of papers evaluate their methods on multiple languages. In addition, there was an excellent tutorial on cross-lingual word representations, which summarized and tried to unify much of the existing literature. Slides of the tutorial are available here.

  • Malaviya et al. train a many-to-one NMT to translate 1017 languages into English and use this model to predict information missing from typological databases.
  • Mayhew et al. introduce a cheap translation method for cross-lingual NER that only requires a bilingual dictionary. They even perform a case study on Uyghur, a truly low-resource language.
  • Kim et al. present a cross-lingual transfer learning model for POS tagging without parallel data. Parallel data is expensive to create and rarely available for low-resource languages, so this approach fills an important need.
  • Vulic et al. propose a new cross-lingual transfer method for inducing VerbNets for different languages. The method leverages vector space specialisation, an effective word embedding post-processing technique similar to retro-fitting.
  • Braud et al. propose a robust, cross-lingual discourse segmentation model that only relies on POS tags. They show that dependency information is less useful than expected; it is important to evaluate our models on multiple languages, so we do not overfit to features that are specific to analytic languages, such as English.


Figure 6: Anders Søgaard demonstrating the similarities between different cross-lingual embedding models at the cross-lingual representations tutorial


The Workshop on New Frontiers of Summarization brought researchers together to discuss key issues related to automatic summarization. Much of the research on summarization sought to develop new datasets and tasks:

  • Katja Filippova (Google Research, Switzerland) gave an interesting talk on sentence compression and passage summarization for Q&A. She described how they went from syntax-based methods to Deep Learning.
  • Volkse et al. created a new summarization corpus by looking for ‘TL;DR’ on Reddit. This is another example of a creative use of distant supervision, leveraging information that is already contained in the data in order to create a new corpus.
  • Falke and Gurevych won the best resource paper award for creating a new summary corpus that is based on concept maps rather than textual summaries. The concept map can be explored using a graph-based document exploration system, which is available as a demo here.
  • Pasunuru et al. use multi-task learning to improve abstractive summarization by leveraging entailment generation.
  • Isonuma et al. also use multi-task learning with document classification in conjunction with curriculum learning.
  • Li et al. propose a new task, reader-aware multi-document summarization, which uses comments of articles, along with a dataset for this task.
  • Naranyan et al. propose another new task, split and rephrase, which aims to split a complex sentence into a sequence of shorter sentences with the same meaning, and also release a new dataset.
  • Ghalandari revisits the traditional centroid-based method and proposes a new strong baseline for multi-document summarization.


Data and model-inherent bias is an issue that is receiving more attention in the community. Some papers investigate and propose methods to address the bias in certain datasets and evaluations:

  • Chaganty et al. investigate bias in the evaluation of knowledge base population models and propose an importance sampling-based evaluation to mitigate the bias.
  • Dan Jurasky gave a truly insightful keynote about his three year-long study analyzing the body camera recordings his team obtained from the Oakland police department for racial bias. Besides describing the first contemporary linguistic study of officer-community member interaction, he also provided entertaining insights on the language of food (cheaper restaurants use terms related to addiction, more expensive venues use language related to indulgence) and the challenges of interdisciplinary publishing.
  • Dubossarsky et al. analyze the bias in word representation models and propose that recently proposed laws of semantic change must be revised.
  • Zhao et al. won the best paper award for an approach using Lagrangian relaxation to inject constraints based on corpus-level label statistics. An important finding of their work is bias amplification: While some bias is inherent in all datasets, they observed that models trained on the data amplified its bias. While a gendered dataset might only contain women in 30% of examples, the situation at prediction time might thus be even more dire.


Figure 7: Zhao et al.’s proposed method for reducing bias amplification

Argument mining & debate analysis

Argument mining is closely related to summarization. In order to summarize argumentative texts, we have to understand claims and their justifications. This research area had the 4th Workshop on Argument Mining dedicated to it:

  • Hidey et al. analyse the semantic types of claims (e.g. agreement, interpretation) and premises (ethos, logos, pathos) in the Subreddit Change My View. This is another creative use of reddit to create a dataset and analyze linguistic patterns.
  • Wachsmut et al. presented an argument web search engine, which can be queried here.
  • Potash and Rumshinsky predict the winner of debates, based on audience favorability.
  • Swamy et al. also forecast winners for the Oscars, the US presidential primaries, and many other contests based on user predictions on Twitter. They create a dataset to test their approach.
  • Zhang et al. analyze the rhetorical role of questions in discourse.
  • Liu et al. show that argument-based features are also helpful for predicting review helpfulness.

Multi-agent communication

Multi-agent communication is a niche topic, which has nevertheless received some recent interest, notably in the representation learning community. Most papers deal with a scenario where two agents play a communicative referential game. The task is interesting, as the agents are required to cooperate and have been observed to develop a common pseudo-language in the process.

  • Andreas and Klein investigate the structure encoded by RNN representations for messages in a communication game. They find that the mistakes are similar to the ones made by humans. In addition, they find that negation is encoded as a linear relationship in the vector space.
  • Kottur et al. show in their best short paper that language does not emerge naturally when two agents are cooperating, but that they can be coerced to develop compositional expressions.


Figure 8: The multi-agent setup in the paper of Kottur et al.

Relation extraction

Extracting relations from documents is more compelling than simply extracting entities or concepts. Some papers improve upon existing approaches using better distant supervision or adversarial training:

  • Liu et al. reduce the noise in distantly supervised relation extraction with a soft-label method.
  • Zhang et al. publish TACRED, a large supervised dataset knowledge base population, as well as a new model.
  • Wu et al. improve the precision of relation extraction with adversarial training.

Document and sentence representations

Learning better sentence representations is closely related to learning more general word representations. While word embeddings still have to be contextualized, sentence representations are promising as they can be directly applied to many different tasks:

  • Mekala et al. propose a novel technique for building document vectors from word embeddings, with good results for text classification. They use a combination of adding and concatenating word embeddings to represent multiple topics of a document, based on word clusters.
  • Conneau et al. learn sentence representations from the SNLI dataset and evaluate them on 12 different tasks.

These were our highlights. Naturally, we weren’t able to attend every session and see every paper. What were your highlights from the conference or which papers from the proceedings did you like most? Let us know in the comments below.

Text Analysis API - Sign up


Welcome to the seventh in a series of blog posts in which we use the News API to look into the previous month’s news content. The News API collected and indexed over 2.5 million stories published last month, and in this blog we’re going to use its analytic capabilities to discover trends in what the media wrote about.

We’ve picked two of the biggest stories from last month, and using just three of the News API’s endpoints (Stories, Trends, and Time Series) we’re going to cover the two following topics:

  1. The conflict brewing with a nuclear-armed North Korea and the US
  2. The ‘fight of the century’ between Conor McGregor and Floyd Mayweather

In covering both of these topics we uncovered some interesting insights. First, apparently we’re much more interested in Donald Trump’s musings on nuclear war than the threat of nuclear war itself. Also, although the McGregor fight lived up to the hype, Conor failed to capitalize on the record-breaking press coverage to launch his ‘Notorious Whiskey’.

1. North Korea

Last month, North Korea detonated a Hydrogen bomb, which was over seven times larger than any of their previous tests. This created an increasing worry that conflict with a nuclear-armed nation is now likely. But using the News API, we can see that in the English-speaking world, even with such a threat looming, we still just can’t get enough of Donald Trump.

Take a look below at the daily volume of stories with ‘North Korea’ in the title last month, which we gathered with the News API’s Time Series endpoint. The graph below shows the volume of stories with the term ‘North Korea’ in the title across every day in August. You can see that the English-speaking media were much more interested with Trump’s ‘fire and fury’ comment at the start of August than they were with North Korea actually detonating a Hydrogen bomb at the start of September.

We guessed that this is largely due to publishers trying to keep up with the public’s insatiable appetite for any Donald Trump-related news. Using the News API, we can put this idea to the test, by analyzing what content about North Korea people shared the most over August.

We used the Stories endpoint of the News API to look at the stories that contained ‘Korea,’ in the title that had the highest engagement rates across social networks to understand the type of content people are most likely to recommend in their social circles which gives a strong indication into readers’ opinions and interests. Take a look below at the most-shared stories across Facebook and Reddit. You can see that the popular content varies across the different networks.


Untitled design (12)


  1. Trump to North Korea: U.S. Ready to Respond With ‘Fire and Fury,’ The Washington Post. 118,312 shares.
  2. China warns North Korea: You’re on your own if you go after the U.S.’ The Washington Post. 94,818 shares.
  3. Trump threatens ‘fury’ against N Korea,’ BBC. 69,098 shares.


Untitled design (10)


  1. Japanese government warns North Korea missile headed toward northern Japan,’  CNBC. 119,075 upvotes.
  2. North Korea shaken by strong tremors in likely nuclear test,’ CNBC. 61,088 upvotes.
  3. Japan, US look to cut off North Korea’s oil supply,’ Nikkei Asian Review. 59,725 upvotes.

Comparing coverage across these three social networks, you can see that Trump features heavily on the most popular content about Korea on Facebook, while the most-upvoted content on Reddit tended to be breaking news with a more neutral tone. This is similar to the patterns we observed with the News API in a previous media review blog, which showed that Reddit was much more focused on breaking stories on popular topics than Facebook.

So now that we know the media focused its attention on Donald Trump, we can ask ourselves, what were all of these stories about? Were these stories talking the President down, like he always claims? Or were they positive? Using the sentiment feature of the News API’s Trends endpoint, we can dive into the stories that had both ‘Trump’ and ‘Korea’ in the title, and see what sentiment is expressed in the body of the text.

From the results below, you can see that over 50% of these articles contained negative sentiment, whereas a little over 30% had a positive tone. For all of the President’s – shall we say, questionable – claims, he’s right about one thing: popular content about how he responds to issues is overwhelmingly negative.

The Superfight – how big was it?

We’re based in Ireland, so having Conor McGregor of all people taking part in the ‘fight of the century’ last month meant that we’ve heard about pretty much nothing else. We can use the News API to put some metrics on all of the hype, and see how the coverage compared to the coverage of other sporting events. Using the Time Series endpoint, we analyzed the impact of the fight on the volume of stories last month. Since it analyzes the content of every news story it gathers, the News API can show us how the volume of stories about every subject fluctuates over time.

Take a look at how the volume of stories about boxing skyrocketed in the build up to and on the weekend of the fight:

You can see that on the day of fight itself, the volume of stories that the News API classified as being about boxing increased, almost by a factor of 10.

To figure out just how big this hype was in the boxing world, we compared the volume of stories published about boxing in the time period surrounding the ‘fight of the century’ and another boxing match which at the time received a lot of hype, the WBA/IBF world heavyweight title bout last April between Anthony Joshua and Wladimir Klitschko. In order to do this, we analyzed the story volume from the two weeks before and after each fight and plotted them side by side. This allows us to easily compare the media coverage on the day of the fight as well as its build-up and aftermath. Take a look below at the results below:

You can see that the McGregor-Mayweather fight totally eclipses the Joshua-Klitschko heavyweight title fight. But it’s important to give context to the data on this hype by comparing it with data from other sports.

It’s becoming almost a point of reference on these News API media review blogs to compare any trending stories to stories in the World Soccer category. This is because the daily volume of soccer stories tends to be consistently the largest of all categories, so it’s a nice baseline to use to compare story volumes. As you can see below, the hype surrounding the ‘fight of the century’ even prompted more boxing stories than soccer stories, which is quite a feat. Notice how only four days after the fight, when boxing was back to its normal level and soccer stories were increasing due to European transfer deadline day looming, there were 2,876 stories about soccer compared with 191 stories about boxing.

You might remember Conor McGregor launched his ‘Notorious Whiskey’ in the press conference following the fight. This was the perfect time for McGregor to launch announce a new product – right at the pinnacle of the media coverage. If you’re wondering how well he leveraged this phenomenal level of publicity for his new distilling career, we used the News API to look into that too. Take a look below at volume of stories that mentioned the new whiskey brand. It looks like mentions of ‘Notorious Whiskey’ have disappeared totally since the weekend of the fight, leaving us with this odd-looking bar chart. But we doubt that will bother Conor at the moment, considering the $100m payday!

That covers our quick look into the News API’s data on two of last month’s stories. The News API gathers over 100,000 stories per day, and indexes them in near-real time. This gives you a stream of enriched news data that you can query. So try out the demo or click on the link below for a free trial to use our APIs for two weeks.

News API - Sign up


At AYLIEN, we provide the building blocks for our customers to create Natural Language Processing-powered solutions. To give you an idea of what some of these solutions look like, we occasionally put together use cases (check out how customers like Complex Media and Streem use our APIs).

For this blog, we’re going to show how 1043 Labs, a US-based software consultancy firm, used our Text Analysis API as part of an innovative platform they built for a client that recently closed a $5 million funding round to expand their operations.



1043 Labs is a custom software consulting firm that helps entrepreneurs get their ideas off the ground. They are a team made up experts from a wide variety of technology fields who bring their technical expertise to the table to help their customers build products, services, and solutions. So when Chris Kraft founded Share Rocket with a vision to create a ratings system for digital media, he hired 1043 Labs to help build the solution he had envisioned.

The Challenge

In the US, 2016 saw digital ad spending surpass that of television ads, with the global annual digital ad spend hitting $72.5 billion in 2016, a figure that’s set to continue to grow into the future. But to get a slice of this digital ad spend, publishers and media organizations across the board need to understand how their content performs so that they can make data-driven decisions on their strategies.




This is where Share Rocket can help. Using a collection of digital tools, Share Rocket’s customers like NBC, Fox, and Hearst can measure how successful their content is online. These tools make it easy to monitor social reach and understand how successful their content is online. Information and data obtained from Share Rocket help users to assess what type of content they need to produce more of. In turn, this allows them to make data-driven decisions on what content to produce and how to promote it, maximizing potential ad revenue.


Share Rocket


How does AYLIEN fit in?

To understand what content performs best, Share Rocket needs to understand what every piece of content is about. This allows their users to ask questions about the subject matter of their text content – what isn’t working, and what they should produce more of. Moreover, Share Rocket needs to understand this at scale. Their clients produce thousands of pieces of content per hour which means manually tagging all of this content would be almost impossible due to the scale alone.


AYLIEN 1043 Case Study


So when Share Rocket hired 1043 Labs to help build their platform, the consultants at 1043 Labs assessed a number of Natural Language Processing APIs with a clear idea of what exactly they were looking for and prioritized the following requirements:

  • Accuracy
  • Speed
  • Customer support  
  • Time to value

Following some in-depth testing and discussions with Mike and some of the engineering team, 1043 Labs chose the AYLIEN Text Analysis API.

“AYLIEN saved us about 3 person months of development. This made our client happier because we delivered the project earlier than planned and under budget.”

Mike Ostman, Founding Partner, 1043 Labs


We’ve heard Mike’s sentiment echoed in calls with our customers quite a bit. People respond really well to how easy our tools are to integrate into whatever you’re building – signing up takes a couple of minutes, and you’ll be making your first calls to the API with a couple of lines of code. After deciding to go with AYLIEN, the team at 1043 Labs had their entire solution built in a couple of weeks.

How are Share Rocket using the AYLIEN Text API?

Share Rocket’s users need accurate metrics on the performance of their content online, so to provide the full picture, Share Rocket need to understand what every piece of content is about. This allows their users to ask questions like ‘how do people like our sports coverage?’ Or ‘is our coverage of weather doing better than our competitor’s?’


Monetize - Share Rocket

But the problem here is that text content is unstructured, meaning it’s particularly difficult for computers to analyze and understand. This is where the Natural Language Processing comes in. Every time a new piece of content is published by their users, Share Rocket use the Classification endpoint of our Text Analysis API to understand what the content is about. Our Classification feature allows users to categorize content based on two industry standard taxonomies: IAB-QAG and IPTC. The ability to classify content automatically means Share Rocket can utilize assigned categories as tags in their proprietary SHARE and SEI tools. These tags can then be used to track content performance across subject categories.

Share Rocket now analyze upwards of 150,000 pieces of content every day for their users with our API. This is a testament not only to the fact that they’ve built a service people need, but also to the fact that our Text Analysis API is a robust tool that scales as your demand increases.

Getting started with our APIs is simple. Open your account by clicking on the CTA below. Once you’ve created your account you can start calling the API within minutes, with a few lines of code.

Text Analysis API - Sign up


Breakthroughs in NLP research are creating huge value for people every day, supercharging technologies from search engines to chatbots. The work that makes these breakthroughs possible is done in two silos – academia and industry. Researchers in both of these silos produce work that advances the field, and frequently collaborate to generate innovative research.

Contributing to this research is why we have such a heavy R&D focus at AYLIEN, with six full-time research scientists out of a total team of 16. The research team naturally has strong ties with academia – some are completing PhDs with the work they are carrying out here, while others already hold one. Academia is also represented on our advisory board and in the great people who have become our mentors.

To further deepen these ties with academia, we’re delighted to announce our first Industry Fellowship in association with Science Foundation Ireland, with Dr. Ian Wood of NUIG. Ian will be based in our Dublin office for one year starting in September. SFI’s goal with this fellowship is to allow industry and academia to cross-pollinate by exchanging ideas and collaborating on research. This placement will allow us to contribute to and learn from the fantastic work that Insight Centre in NUIG are doing, and we’re really excited to open up some new research windows where our team’s and Ian’s interests overlap.

Screenshot (352)

Ian is a postdoctoral researcher at the Insight Centre for Data Analytics, with an incredibly interesting background – a mixture of pure Mathematics, Psychology, and Deep Learning. His research is focused on how the emotions of entire communities change over time, which he researches by creating language models that detect the emotions people express on social media. For his PhD, he analyzed Tweets produced by pro-anorexia communities over three years and tracked their emotions, and showed that an individual’s actions are much more driven by their surrounding community than is generally accepted. Continuing this research, Ian now specializes in finding new ways to build Machine Learning and Deep Learning models to analyze emotions in online communities.

Ian’s placement is mutually beneficial on two levels. First, Ian’s experience in building language models for emotion analysis is obviously beneficial to us, and we can offer Ian a cutting edge research infrastructure and the opportunity to learn from our team in turn. But we’re also really excited at the possibility of opening up new research areas based on common interests, for example by building on existing research between Ian and our PhD student, Sebastian. Ian’s research into reducing dimensionality in data sets crosses over with Sebastian’s work into Domain Adaptation in a really interesting way, and we’re excited that this could open up a new research area for us to work on.

Outside of AYLIEN, Ian also speaks four languages, he was a professional musician (but that was in a previous life, he tells us), and he’s also sailed across the Atlantic in a small boat, so he’ll hopefully have some input into the next AYLIEN team-building exercises…

Welcome to the team, Ian!

If you want to find out more about the Fellowship, check out the LinkedIn group, and if your research interests overlap with ours in any way, drop us a line at – we love hearing from other researchers!


In 2017, video content is becoming ever more central to how people consume media. According to research by HighQ, this year around 30% of smartphone users will watch video content on their device at least once a day. In addition to this, people will spend on average an extra two minutes browsing sites that feature video content compared with sites that do not. For this reason, video content is an important component of driving up revenues for online news publishers, since keeping your audience on your site allows you to sell more ads.

But even though we can find great market research on consumer behavior around video content, we couldn’t find an answer to the following question — what type of video content is the news industry publishing to capitalize on this? For example, how much video content is actually being published? Are some publishers dominating video content? And are some subjects being supplemented with videos more regularly than others? Knowing this would allow us to understand what areas of the online news industry are set to flourish in the coming years with the growing emphasis on video.

We decided to use the News API to look into this question. Last month, our API crawled, analyzed, and indexed 1,344,947 stories as they were published. One of the metadata points that it analyzed was how many images and videos were embedded on the page. So for this blog, we’ll analyze the 1.3 million stories our News API gathered in July to find answers to the following questions:

  1. How many of the stories published last month featured video content?
  2. What were the stories with video content about?
  3. Which news organizations published the most video content?

1. How many stories published last month contained video content?

To get an idea of how far the video medium has spread into the online news industry, we need to find how much video content was used by news publishers last month. To do this, we used the News API’s Time Series endpoint to sort the stories published in July according to how many videos they contained. We then visualized the results to show how many stories contained no videos, how many contained one video, and how many contained more than one. Take a look below at what we found:

As you can see, 96% of stories published last month did not contain any video content, whereas just under 4% contained one video or more. We found this interesting — while HighQ found that almost 30% of smartphone users will watch video content online at least once per day, we can see here that barely 3.5% of news content published last month contained a video. This isn’t really optimal for an industry that relies on clicks for ad revenue.

But let’s focus on the news stories that contained video content. If we knew what these stories were about, we would have a good idea about what areas of online news are likely to fare well, since these areas likely account for a large proportion of ad revenue, and are therefore likely to grow. To look into this, we decided to try to understand what the stories containing video content were about.

2. What were the stories containing video about?

Knowing that only around one out of every thirty stories contained video content last month is interesting, but it begs the question of what these stories were about. To answer this question, we used the Trends endpoint to analyze the 43,134 stories that contained one video and see what subjects each one was about.

One of the pieces of information our News API extracts is topics that are discussed in the story, and which categories the story fits into, based on two taxonomies. For this visualization, we’ll use the advertising industry’s IAB-QAG taxonomy. Take a look below at which categories contained the most video content:

You can see that the Entertainment category had the most stories with video content accompanying them. This isn’t surprising to us at first, as we have all seen articles about celebrities with annoying videos that play automatically. But if you remember last month’s media roundup, you’ll remember that the Sports and Law, Government, and Politics categories produced by far the highest volumes of content (the Sports category alone published over double the content of the Entertainment category). This means that not only are there more videos about entertainment, but also that stories about entertainment are much more likely to contain a video than stories about politics.

So now we know which subject categories video content appeared in the most. But with the News API, we can go one step further and see exactly what people were talking about in the stories that contained a video. To do this, we used the Trends endpoint again to extract the entities mentioned the titles of these stories. Take a look at the chart below to see what people were talking about:

Here you can see exactly what the stories containing videos were about. The single biggest subject that was accompanied by a video was Love Island, a reality TV show. But you can also see that large soccer clubs are well represented on the chart. If you think back to last month’s roundup again, you’ll remember the huge reach and popularity of the top soccer clubs, even during their off-season. The chart above shows that these large soccer clubs are also being covered more with video content than other entities, with publishers obviously trying to leverage this reach to attract people to the stories they publish.

With large soccer clubs dominating both regular news content and video news content, and with ad revenues for video content being so valuable, these soccer clubs look like they have a bright future in terms of media content. Since the clubs benefit financially from media coverage through things like player image rights and viewership of games, large transfer fees like the $263 million PSG are going to pay for Neymar don’t look so crazy.

3. Who were the biggest publishers of video content?

As we mentioned in the introduction, we want to find out which publishers are making the quickest transition to video-based content, as this has a knock-on effect on site viewership, and therefore ad revenues. Knowing which players are leading industry trends like this is a good indicator of which ones are going to survive in an industry that is under financial pressure while transitioning to digital.

With that in mind, we used the Trends endpoint to find out which publishers were leading the way in video content. You can see pretty clearly from the graph below that the Daily Mail dominates last month’s video content. To see the rest of the publishers more clearly, you can select the Daily Mail bubble below and click “exclude”.

The Daily Mail obviously dominate the chart here, which isn’t too surprising when you consider that they feature video as a central part of the content on their site. They produce a huge amount of stories every month, and feature video even when it wasn’t completely related to the story it appeared with. Although the discontinuity can seem odd, even a loosely-related video can increase click through rate and revenues.

As you can see, many traditional news publishers are lagging behind in terms of the amount of video they’re publishing, with The Guardian, Forbes, ABC, and The Daily Mail among the few recognizable print and television giants on the graph. Instead, the field is largely made up of publishers like The Elite Daily, Uproxx, and Heavy, digital native organizations who are publishing more online video content than most traditional publishers.

Well, that concludes our brief analysis of last month’s video content in news stories. If you’re an AYLIEN subscriber, we’d like to remind you that the two endpoints we used in this post (Trends and Time Series) do not return stories, so you can hit them as much as you like and they won’t contribute towards your monthly 10,000 stories. So dig in!

If you’re not a subscriber, you can try the News API free of charge for two weeks by clicking on the image below (free means free, there’s no card required or obligation to buy).

News API - Sign up


In Machine Learning, the traditional assumption is that the data our model is applied to is the same as the data we used for training. This assumption is proven false as soon as we move into the real world: many of the data sources we encounter will be very different than our original training data (same meaning here that it comes from the same distribution). In practice, this causes the performance of our model to deteriorate significantly.

Domain adaptation is a prominent approach to transfer learning that can help to bridge this discrepancy between the training and test data. Domain adaptation methods typically seek to identify features that are shared between the domains or learn representations that are general enough to be useful for both domains. In this blog post, I will discuss the motivation for, and the findings of the recent paper that I published with Barbara Planck. In it, we outline a complementary approach to domain adaptation – rather than learning a model that can adapt between the domains, we will learn to select data that is useful for training our model.

Preventing Negative Transfer

The main motivation behind selecting data for transfer learning is to prevent negative transfer. Negative transfer occurs if the information from our source training data is not only unhelpful but actually counter-productive for doing well on our target domain. The classic example for negative transfer comes from sentiment analysis: if we train a model to predict the sentiment of book reviews, we can expect the model to do well on domains that are similar to book reviews. Transferring a model trained on book reviews to reviews of electronics, however, results in negative transfer, as many of the terms our model learned to associate with a certain sentiment for books, e.g. “page-turner”, “gripping”, or — worse — “dangerous” and “electrifying”, will be meaningless or have different connotations for electronics reviews.

In the classic scenario of adapting from one source to one target domain, the only thing we can do about this is to create a model that is capable of disentangling these shifts in meaning. However, adapting between two very dissimilar domains still fails often or leads to painfully poor performance.

In the real world, we typically have access to multiple data sources. In this case, we can only train our model on the data that is most helpful for our target domain. It is unclear, however, what the best way to determine the helpfulness of source data with respect to a target domain is. Existing work generally relies on measures of similarity between the source and the target domain.

Bayesian Optimization for Data Selection

Our hypothesis is that the best way to select training data for transfer learning depends on the task and the target domain. In addition, while existing measures only consider data in relation to the target domain, we also argue that some training examples are inherently more helpful than others.

For these reasons, we propose to learn a data selection measure for transfer learning. We do this using Bayesian Optimization, a framework that has been used successfully to optimize hyperparameters in neural networks and which can be used to optimize any black-box function. We learn this function by defining several features relating to the similarity of the training data to the target domain as well as to its diversity. Over the course of several iterations, the data selection model then learns the importance of each of those features for the relevant task.

Evaluation & Conclusion

We evaluate our approach on three tasks, sentiment analysis, part-of-speech tagging, and dependency parsing and compare our approach to random selection as well as existing methods that select either the most similar source domain or the most similar training examples.

For sentiment analysis on reviews, training on the most similar domain is a strong baseline as review categories are clearly delimited. We significantly improve upon this baseline and demonstrate that diversity complements similarity. We even achieve performance competitive with a state-of-the-art domain adaptation approach, despite not performing any adaptation.

We observe smaller, but consistent improvements for part-of-speech tagging and dependency parsing. Lastly, we evaluate how well learned measures transfer across models, tasks, and domains. We find that learning a data selection measure can be learned with a simpler model, which is used as a proxy for a state-of-the-art model. Transfer across domains is robust, while transfer across tasks holds — as one would expect — for related tasks such as POS tagging and parsing, but fails for dissimilar tasks, e.g. parsing and sentiment analysis.

In the paper, we demonstrate the importance of selecting relevant data for transfer learning. We show that taking into account task and domain-specific characteristics and learning an appropriate data selection measure outperforms off-the-shelf metrics. We find that diversity complements similarity in selecting appropriate training data and that learned measures can be transferred robustly across models, domains, and tasks.

This work will be presented at the 2017 Conference on Empirical Methods in Natural Language Processing. More details can be found in the paper here.

Text Analysis API - Sign up


Every day, we generate huge amounts of text online, creating vast quantities of data about what is happening in the world and what people think. All of this text data is an invaluable resource that can be mined in order to generate meaningful business insights for analysts and organizations. However, analyzing all of this content isn’t easy, since converting text produced by people into structured information to analyze with a machine is a complex task. In recent years though, Natural Language Processing and Text Mining has become a lot more accessible for data scientists, analysts, and developers alike.

There is a massive amount of resources, code libraries, services, and APIs out there which can all help you embark on your first NLP project. For this how-to post, we thought we’d put together a three-step, end-to-end guide to your first introductory NLP project. We’ll start from scratch by showing you how to build a corpus of language data and how to analyze this text, and then we’ll finish by visualizing the results.

We’ve split this post into 3 steps. Each of these steps will do two things: show a core task that will get you familiar with NLP basics, and also introduce you to some common APIs and code libraries for each of the tasks. The tasks we’ve selected are:

  1. Building a corpus — using Tweepy to gather sample text data from Twitter’s API.
  2. Analyzing text — analyzing the sentiment of a piece of text with our own SDK.
  3. Visualizing results — how to use Pandas and matplotlib to see the results of your work.

Please note: This guide is aimed at developers who are new to NLP and anyone with a basic knowledge of how to run a script in Python. If you don’t want to write code, take a look at the blog posts we’ve put together on how to use our RapidMiner extension or our Google Sheets Add-on to analyze text.


Step 1. Build a Corpus

You can build your corpus from anywhere — maybe you have a large collection of emails you want to analyze, a collection of customer feedback in NPS surveys that you want to dive into, or maybe you want to focus on the voice of your customers online. There are lots of options open to you, but for the purpose of this post we’re going to use Twitter as our focus for building a corpus. Twitter is a very useful source of textual content: it’s easily accessible, it’s public, and it offers an insight into a huge volume of text that contains public opinion.

Accessing the Twitter Search API using Python is pretty easy. There are lots of libraries available, but our favourite option is Tweepy. In this step, we’re going to use Tweepy to ask the Twitter API for 500 of the most recent Tweets that contain our search term, and then we’ll write the Tweets to a text file, with each Tweet on its own line. This will make it easy for us to analyze each Tweet separately in the next step.

You can install Tweepy using pip:

pip install tweepy

Once completed, open a Python shell to double-check that it’s been installed correctly:

>>> import tweepy

First, we need to get permission from Twitter to gather Tweets from the Search API, so you need to sign up as a developer to get your consumer keys and access tokens, which should take you three or four minutes. Next, you need to build your search query by adding your search term to the q = ‘’ field. You will also need to add some further parameters like the language, the amount of results you want returned, and the time period to search in. You can get very specific about what you want to search for on Twitter; to make a more complicated query, take a look at the list of operators you can use the API to search with in the Search API introduction.

Fill your credentials and your query into this script:

## import the libraries
import tweepy, codecs

## fill in your Twitter credentials 
consumer_key = ‘your consumer key here’
consumer_secret = ‘your consumer secret key here’
access_token = ‘your access token here’
access_token_secret = ‘your access token secret here’

## let Tweepy set up an instance of the REST API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

## fill in your search query and store your results in a variable
results = = "your search term here", lang = "en", result_type = "recent", count = 1000)

## use the codecs library to write the text of the Tweets to a .txt file
file ="your text file name here.txt", "w", "utf-8")
for result in results:

You can see in the script that we are writing result.text to a .txt file and not simply the result, which is what the API is returning to us. APIs that return language data from social media or online journalism sites usually return lots of metadata along with your results. To do this, they format their output in JSON, which is easy for machines to read.

For example, in the script above, every “result” is its own JSON object, with “text” being just one field — the one that contains the Tweet text. Other fields in the JSON file contain metadata like the location or timestamp of the Tweet, which you can extract for a more detailed analysis.

To access the rest of the metadata, we’d need to write to a JSON file, but for this project we’re just going to analyze the text of people’s Tweets. So in this case, a .txt file is fine, and our script will just forget the rest of the metadata once it finishes. If you want to take a look at the full JSON results, print everything the API returns to you instead:

This is also why we used codecs module, to avoid any formatting issues when the script reads the JSON results and writes utf-8 text.

Step 2. Analyze Sentiment

So once we’ve collected the text of the Tweets that you want to analyze, we can use more advanced NLP tools to start extracting information from it. Sentiment analysis is a great example of this, since it tells us whether people were expressing positive, negative, or neutral sentiment in the text that we have.

For sentiment analysis, we’re going to use our own AYLIEN Text API. Just like with the Twitter Search API, you’ll need to sign up for the free plan to grab your API key (don’t worry — free means free permanently. There’s no credit card required, and we don’t harass you with promotional stuff!). This plan gives you 1,000 calls to the API per month free of charge.

Again, you can install using pip:

pip install aylien-apiclient

Then make sure the SDK has installed correctly from your Python shell:

>>>from aylienapiclient import textapi

Once you’ve got your App key and Application ID, insert them into the code below to get started with your first call to the API from the Python shell (we also have extensive documentation in 7 popular languages). Our API lets you make your first call to the API with just four lines of code:

>>>from aylienapiclient import textapi
>>>client = (‘Your_app_ID’, ‘Your_application_key’)
>>>sentiment = client.Sentiment({'text': 'enter some of your own text here'})

This will return JSON results to you with metadata, just like our results from the Twitter API.

So now we need to analyze our corpus from step 1. To do this, we need to analyze every Tweet separately. The script below uses the io module to open up a new .csv file and write the column headers “Tweet” and “Sentiment”, and then it opens and reads the .txt file containing our Tweets. Then, for each Tweet in the .txt file it sends the text to the AYLIEN API, extracts the sentiment prediction from the JSON that the AYLIEN API returns, and writes this to the .csv file beside the Tweet itself.

This will give us a .csv file with two columns — the text of a Tweet and the sentiment of the Tweet, as predicted by the AYLIEN API. We can look through this file to verify the results, and also visualize our results to see some metrics on how people felt about whatever our search query was.

from aylienapiclient import textapi
import csv, io

## Initialize a new client of AYLIEN Text API
client = textapi.Client("your_app_ID", "your_app_key")

with'Trump_Tweets.csv', 'w', encoding='utf8', newline='') as csvfile:
	csv_writer = csv.writer(csvfile)
	csv_writer("Tweet", " Sentiment")
	with"Trump.txt", 'r', encoding='utf8') as f:
	    for tweet in f.readlines():
	    	## Remove extra spaces or newlines around the text
	    	tweet = tweet.strip()

	    	## Reject tweets which are empty so you don’t waste your API credits
	    	if len(tweet) == 0:

	    	## Make call to AYLIEN Text API
	    	sentiment = client.Sentiment({'text': tweet})

	    	## Write the sentiment result into csv file
	    	csv_writer.writerow([sentiment['text'], sentiment['polarity']])

You might notice on the final line of the script that when the script goes to write the Tweet text to the file, we’re actually writing the Tweet as it is returned by the AYLIEN API, rather than the Tweet from the .txt file. They are both identical pieces of text, but we’ve chosen to write the text from the API just to make sure we’re reading the exact text that the API analyzed. This is just to make it clearer if we’ve made an error somehow.

Step 3. Visualize your Results

So far we’ve used an API to gather text from Twitter, and used our Text Analysis API to analyze whether people were speaking positively or negatively in their Tweet. At this point, you have a couple of options with what you do with the results. You can feed this structured information about sentiment into whatever solution you’re building, which could be anything from a simple social listening app or a even an automated report on the public reaction to a campaign. You could also use the data to build informative visualizations, which is what we’ll do in this final step.

For this step, we’re going to use matplotlib to visualize our data and Pandas to read the .csv file, two Python libraries that are easy to get up and running. You’ll be able to create a visualization from the command line or save it as a .png file.

Install both using pip:

pip install matplotlib
pip install pandas

The script below opens up our .csv file, and then uses Pandas to read the column titled “Sentiment”. It uses Counter to count how many times each sentiment appears, and then matplotlib plots Counter’s results to a color-coded pie chart (you’ll need to enter your search query to the “yourtext” variable for presentation reasons).

## import the libraries
import matplotlib.pyplot as plt 
import pandas as pd
from collections import Counter
import csv 

## open up your csv file with the sentiment results
with open('your_csv_file_from_step_3', 'r', encoding = 'utf8') as csvfile:
	## use Pandas to read the “Sentiment” column,
df = pd.read_csv(csvfile)
	sent = df["Sentiment"]

## use Counter to count how many times each sentiment appears
## and save each as a variable
	counter = Counter(sent)
	positive = counter['positive']
	negative = counter['negative']
	neutral = counter['neutral']

## declare the variables for the pie chart, using the Counter variables for “sizes”
labels = 'Positive', 'Negative', 'Neutral'
sizes = [positive, negative, neutral]
colors = ['green', 'red', 'grey']
yourtext = "Your Search Query from Step 2"

## use matplotlib to plot the chart
plt.pie(sizes, labels = labels, colors = colors, shadow = True, startangle = 90)
plt.title("Sentiment of 200 Tweets about "+yourtext)

If you want to save your chart to a .png file instead of just showing it, replace on the last line with savefig(‘your chart name.png’). Below is the visualization we ended up with (we searched “Trump” in step 1).

Screenshot (261)

If you run into any issues with these scripts, big or small, please leave a comment below and we’ll look into it. We always try to anticipate any problems our own users might run into, so be sure to let us know!

That concludes our introductory Text Mining project with Python. We hope it gets you up and running with the libraries and APIs, and that it gives you some ideas about subjects that would interest you. With the world producing content on such a large scale, the only obstacle holding you back from an interesting project is your own imagination!

Happy coding!

Text Analysis API - Sign up


Chatbots are a hot topic in tech at the moment. They’re at the center of a shift in how we communicate, so much so that they are central to the strategy and direction of major tech companies like Microsoft and Facebook. According to Satya Nadella, CEO of Microsoft, “Chatbots are the new apps”.

So why exactly have chatbots become so popular?

Their rise in popularity is partly connected to the resurgence of AI and its applications in industry, but it’s also down to our insatiable appetite for on-demand service and our shift to messaging apps over email and phone. A recent study found that 44% of US consumers would prefer to use chatbots over humans for customer relations, and 61% of those surveyed said they interact with a chatbot at least once a month. This is because they suit today’s consumers’ needs – they can respond to customer queries instantly, day or night. 

Large brands and tech companies have recognised this shift in customer needs and now rely on messenger and intelligent assistants to provide a better experience for their customers. This is especially true since Facebook opened up its Messenger platform to third-party bots last year.

So while the adoption of intelligent assistants and chatbots is growing at a colossal rate, contrary to popular belief and media hype, they’re actually nothing new. We’ve had them for over fifty years in the Natural Language Processing community and they’re a great example of the core mission of NLP  – programming computers to understand how humans communicate.

In this blog, we’re going to show three different chatbots and let you interact with each bot so you can see how they have advanced. We’ll give some slightly technical explanations of how each chatbot works so you can see how NLP works under the hood.

The Chatbots

The three chatbots we’ve gathered on this page are:

  1. ELIZA – a chatbot from 1966 that was the first well-known chatbot in the NLP community
  2. ALICE – a chatbot from the late 1990s that inspired the movie Her
  3. Neuralconvo – a Deep Learning chatbot from 2016 that learned to speak from movie scripts

Chatbots Blog

We should mention here that these three bots are all “chit-chat” bots, as opposed to “task-oriented” bots. Whereas task-oriented bots are built for a specific use like checking if an item is in stock or ordering a pizza, a chit-chat bot has no function other than imitating a real person for you to chat with. By seeing how chit-chat bots have advanced, you’re going to see how the NLP community has used different methods to replicate human communication.

ELIZA – A psychotherapy bot

The first version of ELIZA was finished in 1966 by Joseph Weizenbaum, a brilliant, eccentric MIT professor considered one of the fathers of AI (and who is the subject of a great documentary). ELIZA emulates a psychotherapist, one that Weizenbaum’s colleagues trusted enough to divulge highly personal information, even after they knew it was a computer program. Weizenbaum was so shocked at how his colleagues thought ELIZA could help them, even after they knew it was a computer program, that he spent the rest of his life advocating for social responsibility in AI.


But ELIZA only emulates a psychotherapist because it uses clever ways to return your text as a question, just like a real psychotherapist would. This clever tactic means ELIZA can respond to a question that it doesn’t understand with a relatively simple process of rephrasing the input as a question, so the user is kept in conversation.

Just like any algorithm, chatbots work from rules that tell it how to take an input and produce an output. In the case of chatbots, the input is text you supply to it, and the output is text it returns back to you as a response. Looking at the responses you get from ELIZA, you’ll see two rough categories of rules:

  • on a syntactic level, it transfers personal pronouns (“my” to “your,” and vice versa).
  • to imitate semantic understanding (ie that it understands the meaning of what you are typing), it has been programmed to recognize certain keywords and returns phrases that have been marked as suitable returns to the input. For instance, if you input “I want to ___” it will return “What would it mean to you if you ___”.

Try and figure out some of ELIZA’s limits for yourself by asking it questions and trying to figure out why it’s returning each of its responses. Remember: it’s from the 1960s, when color televisions were the height of consumer technology.


This is a pure Natural Language Processing approach to building a chatbot: the bot understands human language by the rules mentioned above, which are basically grammar rules programmed into a computer. This achieves impressive results, but if you wanted to make ELIZA more human-like by pure NLP methods you would have to add more and more grammatical rules, and because grammar is complicated and contradictory, you would quickly end up with a sort of “rule spaghetti,” which wouldn’t work. This approach is in contrast with Machine Learning approaches to chatbots (and natural language in general), where an algorithm will try to guess the correct response based on observations it has made on other conversations. You can see this in action in the final chatbot, Neuralconvo. But first, ALICE.

ALICE – The star of the movie Her

Fast forward from the 1960s to the late 1990s and you meet ALICE, the first well-known chatbot that people could interact with online, and one that developed something of a cult reputation. Director Spike Jonze said that chatting with ALICE in the early 2000s first put the idea for his 2013 film Her in his mind, a movie where a man falls in love with the AI that powers his operating system.


But just like ELIZA, this is a computer program made up of rules that take an input and produce an output. Under the hood, ALICE is an advance on ELIZA in three respects:

  • it is written in a programming language called Artificial Intelligence Markup Language (AIML), similar to XML, which allows it to choose responses on a more abstract level
  • it contains tens of thousands of possible responses
  • it stores previous conversations with users and adds them to its database.

ALICE is an open source bot, one that anyone can download and modify or contribute to. Written originally by Dr. Richard Wallace, over 500 volunteers have contributed to the bot, creating 100,000s of lines of AIML for ALICE to reproduce in conversation.

So ALICE’s improvements on ELIZA allow for more responses that are better tailored to the text you are supplying it with. This allows ALICE to impersonate a person in general, rather than a therapist specifically. The problem here is that the shortcomings are now more obvious – without open ended statements and questions, the lack of a response that matches your input is more obvious. Explore this for yourself below.

So even though ALICE is a more advanced chatbot than ELIZA, the output responses are still written by people, and algorithms choose which output best suits the input. Essentially, people type out the responses and write the algorithms that choose which of these responses will be returned in the hope of mimicking an actual conversation.

Improving the performance and intelligence of chatbots is a popular research area and much of the recent interest in advancing chatbots has been around Deep Learning. Applying Deep Learning to chatbots seems likely to massively improve a chatbot’s ability to interact more like a human. Whereas ELIZA and ALICE reproduce text that was originally written by a person, a Deep Learning bot creates its own text from scratch, based on human speech it has analyzed.

Neuralconvo – A Deep Learning bot

One such bot is Neuralconvo, a modern chatbot created in 2016 by Julien Chaumond and Clément Delangue, co-founders of Huggingface, which was trained using Deep Learning. Deep Learning is a method of training computers to learn patterns in data by using deep neural networks. It is enabling huge breakthroughs in computer science, particularly in AI, and more recently NLP. When applied to chatbots, Deep Learning allows programs to select a response or even to generate entirely new text.

Neuralconvo can come up with its own text because it has “learned” by reading thousands of movie scripts and recognizing patterns in the text. So when Neuralconvo reads a sentence it recognizes patterns in your text, refers back to its training to look for similar patterns, and then generates a new sentence for you that it thinks would follow your sentence if it were in the movie scripts in a conversational manner. It’s basically trying to be cool based on movies it’s seen.

The fundamental difference between ELIZA and Neuralconvo is this: whereas ELIZA was programmed to respond to specific keywords in your input with specific responses, Neuralconvo is making guesses based on probabilities it has observed in movie scripts. So there are no rules telling Neuralconvo to respond to a question a certain way, for example, but the possibilities of its answers are limitless.

Considering Neuralconvo is trained on movie scripts, you’ll see that its responses are suitably dramatic.

The exact model that is working under the hood here is based on the Sequence to Sequence architecture, which was first applied to generate dialogue by Quoc Viet Le and Oriol Vinyals. This architecture consists of two parts: the first one encodes your sentence into a vector, which is basically a code that represents the text. After the entire input text has been encoded this way, the second part then decodes that vector and produces the answer word-by-word by predicting each word that is most likely to come next.

Neuralconvo isn’t going to fool you into thinking that it is a person anytime soon, since it is just a demo of a bot trained on movie scripts. But imagine how effective a bot like this could be when trained using context-specific data, like your own SMS or WhatsApp messages. That’s what’s on the horizon for chatbots, but remember – they will still be algorithms taking your text as input, referring to rules, and returning different text as an output.

Well that sums up our lightning tour of chatbots from the 1960s to today. If you’re interested in blogs about technical topics like training AI to play Flappy Bird or why you should open-source your code, take a look at the Research section of our blog, where our research scientists and engineers post about what interests them.

Text Analysis API - Sign up


Extracting insights from millions of articles at once can create a lot of value, since it lets us understand what information thousands of journalists are producing about what’s happening in the world. But extracting accurate insights depends on filtering out noise and finding relevant content. To allow our users access to relevant content, our News API analyzes thousands of news articles in near real-time and categorizes them according to what content is about.

Having content at web-scale arranged into categories provides accurate information about what the media are publishing as the stories emerge. This allows us to do two things, depending on what we want to use the API for: we can either look at a broad picture of what is being covered in the press, or we can carry out a detailed analysis of the coverage about a specific industry, organization, or event.

For this month’s roundup, we decided to do both. First we’re going to take a look at what news categories the media covered the most to see what the content is about in the most written-about categories, and then we’ll pick one category for a more detailed look. First we’ll take a high-level look at sports content, because it’s what the world’s media wrote the most about, and then we’ll dive into stories about finance, to see what insights the News API can produce for us in a business field.

The 100 categories with the highest volume of stories

The range of the subject matter contained in content published every day is staggering, which makes understanding all of this content at scale particularly difficult. However, the ability to classify new content based on well known, industry-standard taxonomies means it can be easily categorized and understood.

Our News API categorizes every article it analyzes according to two taxonomies: Interactive Advertising Bureau’s QAG taxonomy and IPTC’s Newscodes. We chose to use the IAB-QAG taxonomy, which contains just under 400 categories and subcategories, and decided to look into the top 100 categories and subcategories that the media published the most about in June. This left us with just over 1.75 million of the stories that our News API has gathered and analyzed.

Take a look at the most popular ones in the visualization below.

Note: you can interact with all of the visualizations on this blog – click on each data point for more information, and exclude the larger data points if you want to see more detail on the smaller ones.

As you can see, stories about sport accounted for the most stories published in June. It might not surprise people to see that the media publish a lot about sport, but the details you can pick out here are pretty interesting – like the fact that there were more stories about soccer than food, religion, or fashion last month.

The chart below puts the volume of stories about sports into perspective – news outlets published almost 13 times more stories about sports than they did about music.

What people wrote about sports

Knowing that people wrote so much about sport is great, but we still don’t know what people were talking about in all of this content. To find this out, we decided to dive into the stories about sports and see what the content was about – take a look at the chart below showing the most-mentioned sports sub-categories last month.

In this blog we’re only looking into stories that were in the top 100 sub-categories overall, so if your favourite sport isn’t listed below, that means it wasn’t popular enough and you’ll need to query our API for yourself to look into it (sorry, shovel racers).

You can see how soccer dominates the content about sport, even though it’s off-season for every major soccer league. To put this volume in perspective, there were more stories published about soccer than about baseball and basketball combined. Bear in mind, last month saw the MLB Draft and the NBA finals, so it wasn’t exactly a quiet month for either of these sports.

We then analyzed the stories about soccer with the News API’s entities feature to see what people, countries, and organisations people were talking about.

If you check the soccer schedules for June, you’ll see the Confederations Cup is the only major tournament taking place, which is a competition between international teams. However you can see above that the soccer coverage was still dominated by stories about the clubs with the largest fan bases. The most-mentioned clubs above also top the table in a Forbes analysis  f clubs with the greatest social media reach among fans.


So we’ve just taken a look at what people and organizations dominated the coverage in the news categories that the media published the most in. But even though the sports category is the single most popular one, online content is so wide-ranging that sports barely accounted for 10% of the 1.75 million stories our News API crawled last month.

We thought it would be interesting to show you how to use the API to look into business fields and spot a high-level trend in the news content last month. Using the same analysis that we used on sports stories above, we decided to look at stories about finance. Below is a graph of the most-mentioned entities in stories published in June that fell into the finance category.

You can see that the US and American institutions dominate the coverage of the financial news. This is hardly surprising, considering America’s role as the main financial powerhouse in the world. But what sticks out a little here is that the Yen is the only currency entity mentioned, even though Japan isn’t mentioned as much as other countries.

To find out what kind of coverage the Yen was garnering last month, we analyzed the sentiment of the stories with “Yen” in the title to see how many contained positive, negative, or neutral sentiment.

We can see that there is much more negative coverage here than positive coverage, so we can presume that Japan’s currency had some bad news last month, but that leaves with a big question: why was there so much negative press about the Yen last month?

To find out, we used the keywords feature. Analyzing the keywords in stories returns more detailed information than the entities endpoint we used on the soccer content above, so it is best used when you’re diving into a specific topic rather than getting an overview of some news content, since you’ll get a lot of noise then. It is more detailed because whereas the entities feature returns accurate information about the places, people, and organisations mentioned in stories, the keywords feature will also include the most important nouns and verbs in these stories. This means that we can see a more detailed picture of the things that happened.

Take a look below at the most-mentioned keywords from stories that were talking about the Yen last month.

You can see that the keywords feature returns a different kind of result than entities – words like “year,” and “week,” and “investor,” for example. If we looked at the keywords from all of the news content published in June, it would be hard to get insights because the keywords would be so general. But since we’re diving into a defined topic, we can extract some detailed insights about what actually happened.

Looking at the chart above you can probably guess for yourself what the main stories about the Yen last month involved. We can see from the fact that the most-mentioned terms above that keywords like “data,’ “growth,” “GDP,” and “economy” that Japan has had some negative data about economic growth, which explains the high volume of negative stories about the Yen. You can see below how the value of the Yen started a sustained drop in value after June 15th, the day this economic data was announced, and our News API has tracked the continued negative sentiment.

yen to usd

These are just a couple of examples of steps our users take to automatically extract insights from content on subjects that interest them, whether it is for media monitoring, content aggregation, or any of the thousands of use cases our News API facilitates.

If you can think of any categories you’d like to extract information from using the News API, sign up for a free 14-day trial by clicking on the link below (free means free – you don’t need a credit card and there’s no obligation to purchase).

News API - Sign up


In the last post we looked at how Generative Adversarial Networks could be used to learn representations of documents in an unsupervised manner. In evaluation, we found that although the model was able to learn useful representations, it did not perform as well as an older model called DocNADE. In this post we give a brief overview of the DocNADE model, and provide a TensorFlow implementation.

Neural Autoregressive Distribution Estimation

Recent advances in neural autoregressive generative modeling has lead to impressive results at modeling images and audio, as well as language modeling and machine translation. This post looks at a slightly older take on neural autoregressive models – the Neural Autoregressive Distribution Estimator (NADE) family of models.

An autoregressive model is based on the fact that any D-dimensional distribution can be factored into a product of conditional distributions in any order:

\(p(\mathbf{x}) = \prod_{d=1}^{D} p(x_d | \mathbf{x}_{<d})\)

where \(\mathbf{x}_{<d}\) represents the first \(d-1\) dimensions of \(\mathbf{x}\) in the current ordering. We can therefore create an autoregressive generative model by just parameterising all of the separate conditionals in this equation.

One of the more simple ways to do this is to take a sequence of binary values, and assume that the output at each timestep is just a linear combination of the previous values. We can then pass this weighted sum through a sigmoid to get the output probability for each timestep. This sort of model is called a fully-visible sigmoid belief network (FVSBN):


A fully visible sigmoid belief network. Figure taken from the NADE paper.

Here we have binary inputs \(v\) and generated binary outputs \(\hat{v}\). \(\hat{v_3}\) is produced from the inputs \(v_1\) and \(v_2\).

NADE can be seen as an extension of this, where instead of a linear parameterisation of each conditional, we pass the inputs through a feed-forward neural network:


Neural Autoregressive Distribution Estimator. Figure taken from the NADE paper.

Specifically, each conditional is parameterised as:

\(p(x_d | \mathbf{x_{<d}}) = \text{sigm}(b_d + \mathbf{V}_{d,:} \mathbf{h}_d)\)

\(\mathbf{h}_d = \text{sigm}(c + \mathbf{W}_{:,<d} \mathbf{x}_{<d})\)

where \(\mathbf{W}\), \(\mathbf{V}\), \(b\) and \(c\) are learnable parameters of the model. This can then be trained by minimising the negative log-likelihood of the data.

When compared to the FVSBN there is also additional weight sharing in the input layer of NADE: each input element uses the same parameters when computing the various output elements. This parameter sharing was inspired by the Restricted Boltzmann Machine, but also has some computational benefits – at each timestep we only need to compute the contribution of the new sequence element (we don’t need to recompute all of the preceding elements).

Modeling documents with NADE

In the standard NADE model, the input and outputs are binary variables. In order to work with sequences of text, the DocNADE model extends NADE by considering each element in the input sequence to be a multinomial observation – or in other words one of a predefined set of tokens (from a fixed vocabulary). Likewise, the output must now also be multinomial, and so a softmax layer is used at the output instead of a sigmoid. The DocNADE conditionals are then given by:

\(p(x | \mathbf{x_{<d}}) =
\frac{\text{exp} (b_{w_d} + \mathbf{V}_{w_d,:} \mathbf{h}_d) }
{\sum_w \text{exp} (b_w + \mathbf{V}_{w,:} \mathbf{h}_d) }\)

\(\mathbf{h}_d = \text{sigm}\Big(c + \sum_{k<d} \mathbf{W}_{:,x_k} \Big)\)

An additional type of parameter sharing has been introduced in the input layer – each element will have the same weights no matter where it appears in the sequence (so if the word “cat” appears input positions 2 and 10, it will use the same weights each time).

There is another way to look at this however. We now have a single set of parameters for each word no matter where it appears in the sequence, and there is a common name for this architectural pattern – a word embedding. So we can view DocNADE a way of constructing word embeddings, but with a different set of constraints than we might be used to from models like Word2Vec. For each input in the sequence, DocNADE uses the sum of the embeddings from the previous timesteps (passed through a sigmoid nonlinearity) to predict the word at the next timestep. The final representation of a document is just the value of the hidden layer at the final timestep (or in the other words, the sum of the word embeddings passed through a nonlinearity).

There is one more constraint that we have not yet discussed – the sequence order. Instead of training on sequences of words in the order that they appear in the document, as we do when training a language model for example, DocNADE trains on random permutations of the words in a document. We therefore get embeddings that are useful for predicting what words we expect to see appearing together in a full document, rather than focusing on patterns that arise due to syntax and word order (or focusing on smaller contexts around each word).

An Overview of the TensorFlow code

The full source code for our TensorFlow implementation of DocNADE is available on Github, here we will just highlight some of the more interesting parts.

First we do an embedding lookup for each word in our input sequence (x). We initialise the embeddings to be uniform in the range [0, 1.0 / (vocab_size * hidden_size)], which is taken from the original DocNADE source code. I don’t think that this is mentioned anywhere else, but we did notice a slight performance bump when using this instead of the default TensorFlow initialisation.

with tf.device('/cpu:0'):
    max_embed_init = 1.0 / (params.vocab_size * params.hidden_size)
    W = tf.get_variable(
        [params.vocab_size, params.hidden_size],
    self.embeddings = tf.nn.embedding_lookup(W, x)

Next we compute the pre-activation for each input element in our sequence. We transpose the embedding sequence so that the sequence length elements are now the first dimension (instead of the batch), then we use the higher-order tf.scan function to apply sum_embeddings to each sequence element in turn. This replaces each embedding with sum of that embedding and the previously summed embeddings.

def sum_embeddings(previous, current):
    return previous + current
h = tf.scan(sum_embeddings, tf.transpose(self.embeddings, [1, 2, 0]))
h = tf.transpose(h, [2, 0, 1])
h = tf.concat([
    tf.zeros([batch_size, 1, params.hidden_size], dtype=tf.float32), h
], axis=1)
h = h[:, :-1, :]

We then initialise the bias terms, prepend a zero vector to the input sequence (so that the first element is generated from just the bias term), and apply the nonlinearity.

bias = tf.get_variable(
h = tf.tanh(h + bias)

Finally we compute the sequence loss, which is masked according to the length of each sequence in the batch. Note that for optimisation, we do not normalise this loss by the length of each document. This leads to slightly better results as mentioned in the paper, particularly for the document retrieval evaluation (discussed below).

h = tf.reshape(h, [-1, params.hidden_size])
logits = linear(h, params.vocab_size, 'softmax')
loss = masked_sequence_cross_entropy_loss(x, seq_lengths, logits)


As DocNADE computes the probability of the input sequence, we can measure how well it is able to generalise by computing the probability of a held-out test set. In the paper the actual metric that they use is the average perplexity per word, which for time \(t\), input \(x\) and test set size \(N\) is given by:

\(\text{exp} \big(-\frac{1}{N} \sum_{t} \frac{1}{|x_t|} \log p(x_t) \big)\)

As in the paper, we evaluate DocNADE on the same (small) 20 Newsgroups dataset that we used in our previous post, which consists of a collection of around 19000 postings to 20 different newsgroups. The published version of DocNADE uses a hierarchical softmax on this dataset, despite the fact that they use a small vocabulary size of 2000. There is not much need to approximate a softmax of this size when training relatively small models on modern GPUs, so here we just use a full softmax. This makes a large difference in the reported perplexity numbers – the published implementation achieves a test perplexity of 896, but with the full softmax we can get this down to 579. To note how big an improvement this is, the following table shows perplexity values on this task for models that have been published much more recently:

Model/paper Perplexity
DocNADE (original) 896
Neural Variational Inference 836
DeepDocNADE 835
DocNADE with full softmax 579

One additional change from the evaluation in the paper is that we evaluate the average perplexity over the full test set (in the paper they just take a random sample of 50 documents).

We were expecting to see an improvement due to the use of the full softmax, but not an improvement of quite this magnitude. Even when using a sampled softmax on this task instead of the full softmax, we see some big improvements over the published results. This suggests that the hierarchical softmax formulation that was used in the original paper was a relatively poor approximation of the true softmax (but it’s possible that there is a bug somewhere in our implementation, if you find any issues please let us know).

We also see an improvement on the document retrieval evaluation results with the full softmax:


For the retrieval evaluation, we first create vectors for every document in the dataset. We then use the held-out test set vectors as “queries”, and for each query we find the closest N documents in the training set (by cosine similarity). We then measure what percentage of these retrieved training documents have the same newsgroup label as the query document. We then plot a curve of the retrieval performance for different values of N.

Note: for working with larger vocabularies, the current implementation supports approximating the softmax using the sampled softmax.


We took another look a DocNADE, noting that it can be viewed as another way to train word embeddings. We also highlighted the potential for large performance boosts with older models due simply to modern computational improvements – in this case because it is no longer necessary to approximate smaller vocabularies. The full source code for the model is available on Github.