Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

Twitter users around the world post around 350,000 new Tweets every minute, creating 6,000 140-character long pieces of information every second. Twitter is now a hugely valuable resource from which you can extract insights by using text mining tools like sentiment analysis.

Within the social chatter being generated every second, there are vast amounts of hugely valuable insights waiting to be extracted. With sentiment analysis, we can generate insights about consumers’ reactions to announcements, opinions on products or brands, and even track opinion about events as they unfold. For this reason, you’ll often hear sentiment analysis referred to as “opinion mining”.

With this in mind, we decided to put together a useful tool built on a single Python script to help you get started mining public opinion on Twitter.

What the script does

Using this one script you can gather Tweets with the Twitter API, analyze their sentiment with the AYLIEN Text Analysis API, and visualize the results with matplotlib – all for free. The script also provides a visualization and saves the results for you neatly in a CSV file to make the reporting and analysis a little bit smoother.

Here are some of the cool things you do with this script:

  • Understand the public’s reaction to news or events on Twitter
  • Measure the voice of your customers and their opinions on you or your competitors
  • Generate sales leads by identifying negative mentions of your competitors

You can see the script running a sample analysis of 50 Tweets mentioning Tesla in our example GIF below – storing the results in a CSV file and showing a visualization. The beauty of the script is you can search for whatever you like and it will run your tweets through the same analysis pipeline. 😉

Tesla Sentiment


Installing the dependencies & getting API keys

Since doing a sentiment analysis of Tweets with our API is so easy, installing the libraries and getting your API keys is by far the most time-consuming part of this blog.

We’ve collected them here as a four-step to do list:

  1. Make sure you have the following libraries installed (which you can do with pip):
  1. Get API keys for Twitter:
  • Getting the API keys from Twitter Developer (which you can do here) is the most time consuming part of this process, but this video can help you if you get lost.
  • What it costs & what you get: the free Twitter plan lets you download 100 Tweets per search, and you can search Tweets from the previous seven days. If you want to upgrade from either of these limits, you’ll need to pay for the Enterprise plan ($$)
  1. Get API keys for AYLIEN:
  • To do the sentiment analysis, you’ll need to sign up for our Text API’s free plan and grab your API keys, which you can do here.
  • What it costs & what you get: the free Text API plan lets you analyze 30,000 pieces of text per month (1,000 per day). If you want to make more than 1,000 calls per day, our Micro plan lets you analyze 80,000 pieces of text for ($49/month)
  1. Copy, paste, and run the script below!


The Python script

When you run this script it will ask you to specify what term you want to search Tweets for, and then to specify how many Tweets you want to gather and analyze.

import sys
import csv
import tweepy
import matplotlib.pyplot as plt

from collections import Counter
from aylienapiclient import textapi

if sys.version_info[0] < 3:
   input = raw_input

## Twitter credentials
consumer_key = "Your consumer key here"
consumer_secret = "your secret consumer key here"
access_token = "your access token here"
access_token_secret = "your secret access token here"

## AYLIEN credentials
application_id = "Your app ID here"
application_key = "Your app key here"

## set up an instance of Tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

## set up an instance of the AYLIEN Text API
client = textapi.Client(application_id, application_key)

## search Twitter for something that interests you
query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")

results =
   q=query + " -rt",

print("--- Gathered Tweets \n")

## open a csv file to store the Tweets and their sentiment 
file_name = 'Sentiment_Analysis_of_{}_Tweets_About_{}.csv'.format(number, query)

with open(file_name, 'w', newline='') as csvfile:
   csv_writer = csv.DictWriter(
       fieldnames=["Tweet", "Sentiment"]

   print("--- Opened a CSV file to store the results of your sentiment analysis... \n")

## tidy up the Tweets and send each to the AYLIEN Text API
   for c, result in enumerate(results, start=1):
       tweet = result.text
       tidy_tweet = tweet.strip().encode('ascii', 'ignore')

       if len(tweet) == 0:
           print('Empty Tweet')

       response = client.Sentiment({'text': tidy_tweet})
           'Tweet': response['text'],
           'Sentiment': response['polarity']

       print("Analyzed Tweet {}".format(c))

## count the data in the Sentiment column of the CSV file 
with open(file_name, 'r') as data:
   counter = Counter()
   for row in csv.DictReader(data):
       counter[row['Sentiment']] += 1

   positive = counter['positive']
   negative = counter['negative']
   neutral = counter['neutral']

## declare the variables for the pie chart, using the Counter variables for "sizes"
colors = ['green', 'red', 'grey']
sizes = [positive, negative, neutral]
labels = 'Positive', 'Negative', 'Neutral'

## use matplotlib to plot the chart

plt.title("Sentiment of {} Tweets about {}".format(number, query))

If you’re new to Python, text mining, or sentiment analysis, the next sections will walk through the main sections of the script.


The script in detail

Python 2 & 3

With the migration from Python 2 to Python 3, you can run into a ton of problems working with text data (if you’re interested, check out a great summary of why by Nick Coghlan). One of the changes is that Python 3 runs input() as a string, whereas Python 2 runs input() as a Python expression, so these lines change this to raw_input() if you’re running Python 2.

if sys.version_info[0] < 3:
   input = raw_input

Input your search

The goal of this post is to make it as quick and easy as possible to analyze the sentiment of Tweets that interest you. This script does that by letting you easily change the search term and sample size every time you run the script from the shell using the input() method.

query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")

Run your Twitter query

We’re grabbing the most recent Tweets relevant to your query, but you can change this to ‘popular’ if you want to mine only the most popular Tweets published, or ‘mixed’ for a bit of both. You can see we’ve also decided to exclude retweets, but you might decide that you want to include them. You can check the full list of parameters here. (From our experience there can be a lot of noise in retaining Tweets that have been Retweeted.)

An important point to note here is that the Twitter API limits your results to 100 Tweets, and it doesn’t return an error message if you try to search for more than 100 Tweets. So if you input 500 Tweets, you’ll only have 100 Tweets to analyze, and title of your visualization will still read ‘500 Tweets.’

results =
   q=query + " -rt",

Open a CSV file for the Tweets & Sentiment Analysis

Writing the Tweets and their sentiment to a CSV file allows you to review the API’s analysis of each Tweet. First, we open a new CSV file and write the headers.

with open(file_name, 'w', newline='') as csvfile:
   csv_writer = csv.DictWriter(
       fieldnames=["Tweet", "Sentiment"]

Tidy the Tweets

Dealing with text on Twitter can be messy, so we’ve included this snippet to tidy up the Tweets before you do the sentiment analysis. This means that your results are more accurate, and you also don’t waste your free AYLIEN credits on empty Tweets. 😉

for c, result in enumerate(results, start=1):
   tweet = result.text
   tidy_tweet = tweet.strip().encode('ascii', 'ignore')

   if len(tweet) == 0:
       print('Empty Tweet')

Write the Tweets & their Sentiment to the CSV File

You can see that actually getting the sentiment of a piece of text only takes a couple of lines of code, and here we’re writing the Tweet itself and the result of the sentiment (positive, negative, or neutral) to the CSV file under the headers we already wrote. You’ll notice that we’re actually writing the Tweet as returned by the AYLIEN Text API instead of the Tweet we got from the Twitter API. Even though they’re both the same, writing the Tweet that the AYLIEN API returns just reduces the potential for errors and mistakes.  

We’re also going to print something every time the script analyzes a Tweet.

response = client.Sentiment({'text': tidy_tweet})
   'Tweet': response['text'],
   'Sentiment': response['polarity']

print("Analyzed Tweet {}".format(c))

Screenshot (546)

If you want to include results on how confident the API is in the sentiment it detects in each Tweet: just add  “response[‘polarity_confidence’]” above and add a corresponding header when you’re opening your CSV file.

Count the results of the Sentiment Analysis

Now that we’ve got a CSV file with the Tweets we’ve gathered and their predicted sentiment, it’s time to visualize these results so we can get an idea of the sentiment immediately. To do this, we’re just going to use Python’s standard counter library to count the number of times each sentiment polarity appears in the ‘Sentiment’ column.

with open(file_name, 'r') as data:
   counter = Counter()
   for row in csv.DictReader(data):
       counter[row['Sentiment']] += 1

   positive = counter['positive']
   negative = counter['negative']
   neutral = counter['neutral']

Visualize the Sentiment of the Tweets

Finally, we’re going to plot the results of the count above on a simple pie chart with matplotlib. This is just a case of declaring the variables and then using matplotlib to base the sizes, labels, and colors of the chart on these variables.

colors = ['green', 'red', 'grey']
sizes = [positive, negative, neutral]
labels = 'Positive', 'Negative', 'Neutral'

## use matplotlib to plot the chart

plt.title("Sentiment of {} Tweets about {}".format(number, query))

Screenshot (542)

Go further with Sentiment Analysis

If you want to go further with sentiment analysis you can try two things with your AYLIEN API keys:

  • If you’re looking into reviews of restaurants, hotels, cars, or airlines, you can try our aspect-based sentiment analysis feature. This will tell you what sentiment is attached to each aspect of a Tweet – for example positive sentiment shown towards food but negative sentiment shown towards staff.
  • If you want sentiment analysis customized for the problem you’re trying to solve, take a look at TAP, which lets you train your own language model from your browser.

Building a Sentiment Analysis Workflow for your Organization

This script is built to give you a snapshot of sentiment at the time you run it, so to keep abreast of any change in sentiment towards an organization you’re interested in, you should try running this script every day.

In our next blog, we’ll have a couple of simple updates for this script that will set up a simple, fully automated process that will keep an eye on the sentiment on Twitter for your anything that you’re interested in.

Text Analysis API - Sign up


Last month, our News API gathered, analyzed, and indexed over 2.3 million news stories in near-real time, giving us the ability to spot trends in what the world’s media talking about and dive into specific topics to understand how they developed over time.

In this blog, we’re going to look into two interesting events from September which gathered a lot of media attention:

  1. The launch of the iPhone X.
  2. Ryanair’s ongoing cancellations problem.

September Media Roundup with the AYLIEN News API (2)

The iPhone X Launch

Looking into stories in the technology category, our News API detected a spike in stories published on the 12th of September. You can see that almost 5,000 stories about “Technology” were published that day – an increase of over 20% more than the average weekday. A large portion of these stories were covering the launch of the highly anticipated iPhone X.

How did the media react to the iPhone X launch?

Knowing that the launch of the iPhone X caused a spike in media interest is great because it lets us measure the hype associated with the event. But using the News API, we can go a step further and dig deeper into the content to better understand the media reaction. To do this, we used the Trends endpoint to analyze which entities were mentioned in all news stories.

Using the Trends endpoint, our users can make an unlimited number of queries about quantitative trends in news content, making it easy to spot trending topics in a collection of documents. For example, take a look at what entities were mentioned most in stories about the iPhone X.

On the chart above you can see that the media mentioned two types of entities the most: the most-mentioned entities are somewhat obvious and expected (Apple and iPhone). However, the articles also mention other entities like Tim Cook and Cupertino which help set stories in context, the “who, what, and where” of the story.

But after these most popular entities, we can group some slightly lesser-mentioned entities together, these entities are competitor and product focused (Samsung, S8, Apple Watch). The prominence of these entities shows that the media was very interested in talking about the iPhone X in the context of how it added to Apple’s product offering, and how this offering compared to its competitors.

How did the reaction on social media compare?

So when it came to the iPhone X, the media were talking about Apple and its competitors, but what were people online talking about? We decided to compare the media coverage from the News API with reaction on Twitter to try and gauge the customer reaction to the launch. To do this, we used the Twitter API to gather 10,000 Tweets and our Text API to extract the entities mentioned in every Tweet.

On Twitter, the 140 character limit means people have to jump right into what they want to say, so you can see right away how the content of Tweets differed from the content of news stories.

You can see here that the conversation on Twitter we found was more focused on the product itself than the business implications for Apple. You can see this in the fact RAM is the most mentioned entity (the iPhone X has no increase in RAM from previous iPhones), while OLED, iOS, and the Galileo GPS system are much more prevalent here than in the News API results.

So from this data, you can see that Twitter users were more focused on the product itself – making them a good insight into the voice of the customer, whereas the media were more focused on producing insights into the business implications of the phone.


Ryanair’s bad PR Month (even by Ryanair standards)

In a previous blog, we used our News API and our Text API to analyze how Ryanair handled the initial announcement of their flight cancellations disaster. As we wrote that blog only days after the announcement of the cancellations, we decided to check in with the airline again to see how they’ve fared since then.

Below, you can see the volume of stories published about Ryanair in September and their sentiment. The first spike in extremely negative press covers the weekend that the airline announced the cancellations, and the second spike is the coverage of the announcement of further cancellations.

Was it only the Cancellations that Brought Coverage for Ryanair?

In the chart above you can also see a spike in stories about Ryanair on the Thursday following the initial announcement. To find out what all of this coverage was about, we used the Trends endpoint of the News API to find out what the most-mentioned entities in Ryanair stories were in the week following that big spike caused by the cancellations announcement.

You can see above that all of the most-mentioned entities are aviation- or Ryanair-related. This is useful, as it gives us insight into what other people and places were talked about in the Ryanair coverage, but it doesn’t give us an insight into what the story spike was about. To do that, we’ll analyze the same time period with the keywords feature.

The keywords feature produces a different kind of insight into text data: whereas the entities feature returns specific things like people, locations, and products, the keywords feature will return mentions of things in general like ‘flights,’ ‘week,’ and ‘pilots’. Take a look at the chart below to see the insights are different from what the entities endpoint returned on the same data.

You can see in the chart above that the keyword ‘pilots’ is more prevalent than ‘cancellations,’ so we can guess that in the days after interest in the cancellations dies down, the media became more interested in the story about pilots leaving Ryanair en masse than the cancellations.

To test this idea out, we used the News API’s Time Series endpoint to compare mentions of each of these keywords in stories with ‘Ryanair’ in the title. You can see that although the two issues were covered a lot together, coverage of Ryanair’s pilot trouble was more popular in between Ryanair’s two cancellation announcements. This shows that despite the huge coverage on passenger outrage, the media focused on the more troubling business news for Ryanair – its toxic relationship with its own pilots.

So that concludes our quick analysis of last month’s news with the News API. If you’re interested in using our AI-powered text analysis engine to analyze news content for your own solution, click on the image below and sign up for a free trial!

News API - Sign up


Just six months on from our last blog tracking growth here at AYLIEN, we have more updates to keep you posted on. Since then, we’ve added five more to the team – two scientists, an office manager, an engineer, and a business development manager.

We’re now a team of 17 comprised of eight nationalities, and we’re really excited to share this new step we’re taking. Meet the newest additions to the AYLIEN team!

Chris – Research Scientist


Chris has been with us part-time for a few months now while he finished his PhD, where he designed neural networks for machine translation, and now he’s joining us full-time to advance our understanding of entity linking. Starting as a Music and German undergrad, he dived into Computational Linguistics after his Master’s degree and hasn’t looked back since. Take a look at his research publications or follow him on Twitter.

A native of Denton, Texas, Chris is also keen outdoorsman and quite an accomplished percussionist – he usually plays in the Grand Social on Mondays with Polish Folk collective The Supertonic Orchestra.


Damien – Business Development Manager


Joining us after lecturing in Music in the British and Irish Modern Music Institute, Damien comes on board to help our customers get the most out of AYLIEN’s products. Damien is going to use the experience he gathered over eight years in sales and customer support in tech companies.

Outside of AYLIEN, Damien is also a professional composer, scoring films and video games, he plays alto sax in a Ska/Punk band called Bocs Social, and writes for a blog he started, Audio Dexterous. Damien and Chris bring the number of serious musicians at AYLIEN to three (we’ve all heard @anotherjohng whistling in the lift).


Francesca – Office Manager


Francesca is taking over everything admin-related in AYLIEN. In the six weeks she’s been with us, she has become the go-to person for most things that happen in the office, and has earned the undying gratitude of some of the team by helping them deal with the Irish Naturalisation and Immigration Service, which can be a little… let’s say, tiring. But having been on the organisation board of PyCon Italy for ten years, keeping things running in a tech company is nothing new for Francesca.

Coming from Parma (yes, the home of the ham and the cheese), Francesca did a degree in Food Science and Technology, and she is an avid reader and board gamer.


Ian – Postdoctoral Fellow


The first Aussie on the team, Ian is also our first postdoc from academia to be based in the AYLIEN office, and the third member of the research team holding a PhD. Our first Science Foundation Ireland Industry fellow, we covered Ian’s placement in a blog post last month. His research focuses on building neural networks that analyze emotion in text.

Outside of AYLIEN, Ian has done a little sailing (just across the Atlantic, no big deal), speaks quite a few languages (Czech, Spanish, and Italian, but can get by in French, Hindi, Turkish and German), and is a pretty decent cook.


Sven – Site Reliability Engineer


Sven joins us to take responsibility for the infrastructure of every part of AYLIEN – making everything as reliable and resilient as possible. When he’s not hunting down potential failure points and eliminating manual processes in AYLIEN’s tech, Sven is usually contributing to open-source software and coding interesting projects, and you can take a look at his code on GitHub.

A Madrid native of German origin, Sven cycles to work via the gym every morning, so by 9AM he’s done more exercise than the rest of us combined! Having been around the world quite a bit, he’s now getting around Ireland bit by bit – a winter trip to Connemara is up next!


So that’s the update for the summer, we’re growing at a quick rate across research, engineering, and sales, so if you think you’d work well on the team, drop us a line at If you’re an NLP or machine learning person with a research idea, check out our research page – maybe there’s something for us to work on together.


For years Ryanair and Michael O’Leary have handled the media with a deft touch. Think of all the free advertising Michael O’Leary has garnered for the company, and how they have dealt with extraordinarily negative opinion of their service offering while growing their business into one of the biggest airlines in Europe.

But last week, Ryanair were dealt an incredible blow when they had to immediately cancel around 1,900 flights due to an administrative problem on their part. This is one of the biggest PR challenges they have faced, so we wanted to see how effectively Ryanair handled the news. With our APIs, we can use text analysis to measure the reaction to the announcement, both in the news and on social media.


mol (2)


To do this, we collected just under 600 news stories about the cancellations, and  30,000 Tweets mentioning Ryanair over the past week. We’ve isolated the following specific questions we’re going to answer:

  1. Ryanair announced the news late on Friday, an old PR trick to minimize negative coverage. How effective was this in the news cycle?
  2. Does this old-school PR trick have an effect on the social media reaction?
  3. Exactly how negative was the press coverage that Ryanair was trying to minimize?
  4. How many of the Tweets mentioning Ryanair were affected by the cancellations and how many were just jumping on the bandwagon?


1. Did Ryanair’s Friday evening press dump work?

Ryanair first started cancelling flights on Friday morning, but decided not to officially announce the news until later that evening, which is a common PR trick – announce bad news late on a Friday so the press coverage is affected by the weekend lull, hopefully resulting in less coverage.

Using the News API, we can see that this strategy worked pretty well for Ryanair. To try and understand how much coverage the news got, we decided to track the volume of news stories published over the weekend period using our Time Series endpoint. We also wanted to see how the news spread across social media so we collected Tweets directed at Ryanair using, the Twitter API over the same period. In total we collected just under 30,000 Tweets and about 600 news articles. As all of our data points were time stamped it was easy to plot them side by side on one chart and compare the volume from each channel. Take a look below:

You can see above that announcing the news late on Friday meant that while the conversation about Ryanair took off on Twitter over the weekend, it took the press until Monday to catch up. It’s important to note here that on Monday there was no new news to be released except a detailed list of the flights affected – the press coverage on Monday was essentially working off old news. So by the comparatively small amount of stories over the weekend we can see that Ryanair successfully got out ahead of the story and minimized the immediate impact.  

But if people are talking about the cancellations online, what does it matter if there were fewer stories in the press?


2. This old-school PR trick is important for PR on Facebook too

The previous chart shows that although there was a huge amount of chatter about the Ryanair cancellations on social media, at the same time there were fewer news stories being written. The obvious implication here is that there were fewer stories for social media users to share – more journalists off for the weekend meant fewer stories were appearing in people’s news feeds.

So to put this idea to the test, we used the Trends endpoint of the News API to gather the 10 most-shared stories about Ryanair on Facebook on Friday, Saturday, and Sunday, and counted how many times they were shared. Take a look at how many more shares the rest of the stories got on Monday than on Saturday, even though the news was three days old at that point! With this information, we think it’s a good bet that stories about Ryanair got much more shares on Monday because more people had stories in their news feeds.

To be exact, Monday’s top 10 most-shared stories were shared over 43,000 times more than Saturday’s 10 most-shared stories, simply because news publishers were back in full swing (remember – there was no new important information). This implies that however bad the coverage was on Monday (and it was bad, as we’ll see) the old PR trick of dumping stories on a Friday evening also has an effect on the spread of news on social media. It’s difficult to quantify the combined reach of these extra 43,000 shares, but that’s a lot of extra negative publicity that Ryanair avoided!

We thought this was interesting – although social media is a huge disruptor of the publishing industry, the fact that the old Friday evening press dump works on Facebook sharing tells us that traditional journalism still guides what people are talking about on social media.


3: How negative was the press coverage that Ryanair was trying to minimize?

So far we’ve assumed that the stories about Ryanair were largely negative, but we haven’t looked into how negative they were. Using the Trends endpoint of the News API, we can do just that, and we found that 85% of stories about Ryanair had a negative tone last week. Take a look at how the sentiment changed over each day last week. You can see it gets extremely negative from Friday onwards.


4: Did people ‘swarm’ the bad news on Twitter?

We pointed out that there was a massive spike in mentions of Ryanair on Twitter on Saturday, before the media could cover the story as extensively as they would on a weekday. This gave us an interesting opportunity – sometimes PR research can be hampered by swarming, which is when people unaffected by a problem jump on the bandwagon and add to the negative press.

So we came up with an idea to separate affected customers from people swarming in this example. To identify those actually affected by the events and those who were jumping on the band wagon we narrowed our search to Twitter users who mentioned flight details like locations of departure or arrival and those who didn’t. We made the distinction that those who hadn’t mentioned any specific details about their flight were more than likely swarmers while those who gave specifics were actually affected by the cancellations.
Take a look at these two Tweets as an example, one is a customer affected by cancellations and another is just someone who wanted to say something negative about Ryanair after the news broke.

Our Text API’s Concept Extraction feature allows users to extract any mention of organizations, people, products and locations. Using this capability, we decided to see how many Tweets between Friday evening and Saturday afternoon (after the news broke) mentioned a location, how many only mentioned Ryanair, and how many were talking about any other concept. Take a look at the results:

The Concepts feature lets us dive in further to the data by telling us what exact concepts were mentioned. We’ll take a look at everything we’ve grouped into ‘other concepts’, but first let’s see which locations people were talking about in the Tweets with the @Ryanair handle – you can see they correlate with the airports affected by the cancellations:

We then compared the data on what people were talking about on Friday with what they were talking about on Monday, to see what changed when the press were covering the story in full. You can see that mentions Ryanair alone dropped significantly, but mentions of all other concepts rose correspondingly. We’re guessing the mentions of locations stayed the same because the only new announcement on Monday was a detailed list of flights that were cancelled. Take a look at the Twitter conversation about Ryanair on Monday:

In the chart above, you can see the rise of what we’ve labeled ‘other concepts’. We can use the Text API to see what these are. Take a look:

So that’s our AI-powered layman’s analysis of the coverage of the Ryanair cancellations PR disaster. We think that although the press is extremely negative, Ryanair did a pretty good job of mitigating the volume of it.

If you want to try out our APIs, take a look at the demos of our APIs or sign up for a trial – the News API has a two-week free trial and the Text API has a free plan that lets you analyze 1,000 pieces of text per month with no cost.


News API - Sign up


Four members of our research team spent the past week at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017) in Copenhagen, Denmark. The conference handbook can be found here and the proceedings can be found here.

The program consisted of two days of workshops and tutorials and three days of main conference. Videos of the conference talks and presentations can be found here.The conference was superbly organized, had a great venue, and a social event with fireworks.


Figure 1: Fireworks at the social event

With 225 long papers, 107 papers, and 9 TACL papers accepted, there was a clear uptick of submissions compared to last year. The number of long and short paper submissions to EMNLP this year was even higher than those at ACL for the first time within the last 13 years, as can be seen in Figure 2.


Figure 2: Long and short paper submissions at ACL and EMNLP from 2004-2017

In the following, we will outline our highlights and list some research papers that caught our eye. We will first list overall themes and will then touch upon specific research topics that are in line with our areas of focus. Also, we’re proud to say that we had four papers accepted to the conference and workshops this year! If you want to see the AYLIEN team’s research, check out the research sections of our website and our blog. With that said, let’s jump in!


Exciting Datasets

Evaluating your approach on CoNLL-2003 or PTB is appropriate for comparing against previous state-of-the-art, but kind of boring. The two following papers introduce datasets that allow you to test your model in more exciting settings:

  • Durrett et al. release a new domain adaptation dataset. The dataset evaluates models on their ability to identify products being bought and sold in online cybercrime forums.  
  • Kutuzov et al. evaluate their word embedding model on a new dataset that focuses on predicting insurgent armed groups based on geographical locations.
  • While he did not introduce a new dataset, Nando de Freitas made the point during his keynote that the best environment for learning and evaluating language is simulation.


Figure 3: Nando de Freitas’ vision for AI research

Return of the Clusters

Brown clusters, an agglomerative, hierarchical clustering of word types based on contexts that was introduced in 1992 seem to come in vogue again. They were found to be particularly helpful for cross-lingual applications, while clusters were key features in several approaches:

  • Mayhew et al. found that Brown cluster features were an important signal for cross-lingual NER.
  • Botha et al. use word clusters as a key feature in their small, efficient feed-forward neural networks.
  • Mekala et al.’s new document representations cluster word embeddings, which give it an edge for text classification.
  • In his talk at the SCLeM workshop, Noah Smith cites the benefits of using Brown clusters as features for tasks such as POS tagging and sentiment analysis.


Figure 4: Noah Smith on the benefits of clustering in his invited talk at the SCLeM workshop

Distant Supervision

Distant supervision can be leveraged to collect large amounts of noisy training data, which can be useful in many applications. Some papers used novel forms of distant supervision to create new corpora or to train a model more effectively:

  • Lan et al. use urls in tweets to collect a large corpus of paraphrase data. Paraphrase data is usually hard to create, so this approach facilitates the process significantly and enables a continuously expanding collection of paraphrases.
  • Felbo et al. show that training on fine-grained emoji detection is more effective for pre-training sentiment and emotion models. Previous approaches primarily pre-trained on positive and negative emoticons or emotion hashtags.

Data Selection

The current generation of deep learning models is excellent at learning from data. However, we often do not pay much attention to the actual data our model is using. In many settings, we can improve upon the model by selecting the most relevant data:

  • Fang et al. reframe active learning as reinforcement learning and explicitly learn a data selection policy. Active learning is one of the best ways to create a model with as few annotations as possible; any improvement to this process is beneficial.
  • Van der Wees et al. introduce dynamic data selection for NMT, which varies the selected subset of the training data between different training epochs. This approach has the potential to reduce the training time of NMT models at comparable or better performance.
  • Ruder and Plank use Bayesian Optimization to learn data selection policies for transfer learning and investigate how well these transfer across models, domains, and tasks. This approach brings us a step closer towards gaining a better understanding of what constitutes similarity between different tasks and domains.

Character-level Models

Characters are nowadays used as standard features in most sequence models. The Subword and Character-level Models in NLP workshop discussed approaches in more detail, with invited talks on subword language models and character-level NMT.

  • Schmaltz et al. find that character-based sequence-to-sequence models outperform word-based models and models with character convolutions for sentence correction.
  • Ryan Cotterell gave a great, movie-inspired tutorial on combining the best of FSTs (cowboys) and sequence-to-sequence models (aliens) for string-to-string transduction. While evaluated on morphological segmentation, the tutorial raised awareness in an entertaining way that often the best of both worlds, i.e. a combination of traditional and neural approaches performs best.


Figure 5: Ryan Cotterell on combining FSTs and seq2seq models for string-to-string transduction

Word Embeddings

Research in word embeddings has matured and now mainly tries to 1) address deficits of word2vec, such as its ability of dealing with OOV words; 2) extend it to new settings, e.g. modelling the relations of words over time; and 3) understand the induced representations better:

  • Pinter et al. propose an approach for generating OOV word embeddings by training a character-based BiLSTM to generate embeddings that are close to pre-trained ones. This approach is promising as it provides us with a more sophisticated way to deal with out-of-vocabulary words than replacing them with an <UNK> token.
  • Herbelot and Baroni slightly modify word2vec to allow it to learn embeddings for OOV words from few data.
  • Rosin et al. propose a model for analyzing when two words relate to each other.
  • Kutuzov et al. propose another model that analyzes how two words relate to each other over time.
  • Hasan and Curry improve the performance of word embeddings on word similarity tasks by re-embedding them in a manifold.
  • Yang et al. introduce a simple approach to learning cross-domain word embeddings. Creating embeddings tuned on a small, in-domain corpus is still a challenge, so it is nice to see more approaches addressing this pain point.
  • Mimno and Thompson try to understand the geometry of word2vec better. They show that the learned word embeddings are positioned diametrically opposite of their context vectors in the embedding space.

Cross-lingual transfer

An increasing number of papers evaluate their methods on multiple languages. In addition, there was an excellent tutorial on cross-lingual word representations, which summarized and tried to unify much of the existing literature. Slides of the tutorial are available here.

  • Malaviya et al. train a many-to-one NMT to translate 1017 languages into English and use this model to predict information missing from typological databases.
  • Mayhew et al. introduce a cheap translation method for cross-lingual NER that only requires a bilingual dictionary. They even perform a case study on Uyghur, a truly low-resource language.
  • Kim et al. present a cross-lingual transfer learning model for POS tagging without parallel data. Parallel data is expensive to create and rarely available for low-resource languages, so this approach fills an important need.
  • Vulic et al. propose a new cross-lingual transfer method for inducing VerbNets for different languages. The method leverages vector space specialisation, an effective word embedding post-processing technique similar to retro-fitting.
  • Braud et al. propose a robust, cross-lingual discourse segmentation model that only relies on POS tags. They show that dependency information is less useful than expected; it is important to evaluate our models on multiple languages, so we do not overfit to features that are specific to analytic languages, such as English.


Figure 6: Anders Søgaard demonstrating the similarities between different cross-lingual embedding models at the cross-lingual representations tutorial


The Workshop on New Frontiers of Summarization brought researchers together to discuss key issues related to automatic summarization. Much of the research on summarization sought to develop new datasets and tasks:

  • Katja Filippova (Google Research, Switzerland) gave an interesting talk on sentence compression and passage summarization for Q&A. She described how they went from syntax-based methods to Deep Learning.
  • Volkse et al. created a new summarization corpus by looking for ‘TL;DR’ on Reddit. This is another example of a creative use of distant supervision, leveraging information that is already contained in the data in order to create a new corpus.
  • Falke and Gurevych won the best resource paper award for creating a new summary corpus that is based on concept maps rather than textual summaries. The concept map can be explored using a graph-based document exploration system, which is available as a demo here.
  • Pasunuru et al. use multi-task learning to improve abstractive summarization by leveraging entailment generation.
  • Isonuma et al. also use multi-task learning with document classification in conjunction with curriculum learning.
  • Li et al. propose a new task, reader-aware multi-document summarization, which uses comments of articles, along with a dataset for this task.
  • Naranyan et al. propose another new task, split and rephrase, which aims to split a complex sentence into a sequence of shorter sentences with the same meaning, and also release a new dataset.
  • Ghalandari revisits the traditional centroid-based method and proposes a new strong baseline for multi-document summarization.


Data and model-inherent bias is an issue that is receiving more attention in the community. Some papers investigate and propose methods to address the bias in certain datasets and evaluations:

  • Chaganty et al. investigate bias in the evaluation of knowledge base population models and propose an importance sampling-based evaluation to mitigate the bias.
  • Dan Jurasky gave a truly insightful keynote about his three year-long study analyzing the body camera recordings his team obtained from the Oakland police department for racial bias. Besides describing the first contemporary linguistic study of officer-community member interaction, he also provided entertaining insights on the language of food (cheaper restaurants use terms related to addiction, more expensive venues use language related to indulgence) and the challenges of interdisciplinary publishing.
  • Dubossarsky et al. analyze the bias in word representation models and propose that recently proposed laws of semantic change must be revised.
  • Zhao et al. won the best paper award for an approach using Lagrangian relaxation to inject constraints based on corpus-level label statistics. An important finding of their work is bias amplification: While some bias is inherent in all datasets, they observed that models trained on the data amplified its bias. While a gendered dataset might only contain women in 30% of examples, the situation at prediction time might thus be even more dire.


Figure 7: Zhao et al.’s proposed method for reducing bias amplification

Argument mining & debate analysis

Argument mining is closely related to summarization. In order to summarize argumentative texts, we have to understand claims and their justifications. This research area had the 4th Workshop on Argument Mining dedicated to it:

  • Hidey et al. analyse the semantic types of claims (e.g. agreement, interpretation) and premises (ethos, logos, pathos) in the Subreddit Change My View. This is another creative use of reddit to create a dataset and analyze linguistic patterns.
  • Wachsmut et al. presented an argument web search engine, which can be queried here.
  • Potash and Rumshinsky predict the winner of debates, based on audience favorability.
  • Swamy et al. also forecast winners for the Oscars, the US presidential primaries, and many other contests based on user predictions on Twitter. They create a dataset to test their approach.
  • Zhang et al. analyze the rhetorical role of questions in discourse.
  • Liu et al. show that argument-based features are also helpful for predicting review helpfulness.

Multi-agent communication

Multi-agent communication is a niche topic, which has nevertheless received some recent interest, notably in the representation learning community. Most papers deal with a scenario where two agents play a communicative referential game. The task is interesting, as the agents are required to cooperate and have been observed to develop a common pseudo-language in the process.

  • Andreas and Klein investigate the structure encoded by RNN representations for messages in a communication game. They find that the mistakes are similar to the ones made by humans. In addition, they find that negation is encoded as a linear relationship in the vector space.
  • Kottur et al. show in their best short paper that language does not emerge naturally when two agents are cooperating, but that they can be coerced to develop compositional expressions.


Figure 8: The multi-agent setup in the paper of Kottur et al.

Relation extraction

Extracting relations from documents is more compelling than simply extracting entities or concepts. Some papers improve upon existing approaches using better distant supervision or adversarial training:

  • Liu et al. reduce the noise in distantly supervised relation extraction with a soft-label method.
  • Zhang et al. publish TACRED, a large supervised dataset knowledge base population, as well as a new model.
  • Wu et al. improve the precision of relation extraction with adversarial training.

Document and sentence representations

Learning better sentence representations is closely related to learning more general word representations. While word embeddings still have to be contextualized, sentence representations are promising as they can be directly applied to many different tasks:

  • Mekala et al. propose a novel technique for building document vectors from word embeddings, with good results for text classification. They use a combination of adding and concatenating word embeddings to represent multiple topics of a document, based on word clusters.
  • Conneau et al. learn sentence representations from the SNLI dataset and evaluate them on 12 different tasks.

These were our highlights. Naturally, we weren’t able to attend every session and see every paper. What were your highlights from the conference or which papers from the proceedings did you like most? Let us know in the comments below.

Text Analysis API - Sign up


Welcome to the seventh in a series of blog posts in which we use the News API to look into the previous month’s news content. The News API collected and indexed over 2.5 million stories published last month, and in this blog we’re going to use its analytic capabilities to discover trends in what the media wrote about.

We’ve picked two of the biggest stories from last month, and using just three of the News API’s endpoints (Stories, Trends, and Time Series) we’re going to cover the two following topics:

  1. The conflict brewing with a nuclear-armed North Korea and the US
  2. The ‘fight of the century’ between Conor McGregor and Floyd Mayweather

In covering both of these topics we uncovered some interesting insights. First, apparently we’re much more interested in Donald Trump’s musings on nuclear war than the threat of nuclear war itself. Also, although the McGregor fight lived up to the hype, Conor failed to capitalize on the record-breaking press coverage to launch his ‘Notorious Whiskey’.

1. North Korea

Last month, North Korea detonated a Hydrogen bomb, which was over seven times larger than any of their previous tests. This created an increasing worry that conflict with a nuclear-armed nation is now likely. But using the News API, we can see that in the English-speaking world, even with such a threat looming, we still just can’t get enough of Donald Trump.

Take a look below at the daily volume of stories with ‘North Korea’ in the title last month, which we gathered with the News API’s Time Series endpoint. The graph below shows the volume of stories with the term ‘North Korea’ in the title across every day in August. You can see that the English-speaking media were much more interested with Trump’s ‘fire and fury’ comment at the start of August than they were with North Korea actually detonating a Hydrogen bomb at the start of September.

We guessed that this is largely due to publishers trying to keep up with the public’s insatiable appetite for any Donald Trump-related news. Using the News API, we can put this idea to the test, by analyzing what content about North Korea people shared the most over August.

We used the Stories endpoint of the News API to look at the stories that contained ‘Korea,’ in the title that had the highest engagement rates across social networks to understand the type of content people are most likely to recommend in their social circles which gives a strong indication into readers’ opinions and interests. Take a look below at the most-shared stories across Facebook and Reddit. You can see that the popular content varies across the different networks.


Untitled design (12)


  1. Trump to North Korea: U.S. Ready to Respond With ‘Fire and Fury,’ The Washington Post. 118,312 shares.
  2. China warns North Korea: You’re on your own if you go after the U.S.’ The Washington Post. 94,818 shares.
  3. Trump threatens ‘fury’ against N Korea,’ BBC. 69,098 shares.


Untitled design (10)


  1. Japanese government warns North Korea missile headed toward northern Japan,’  CNBC. 119,075 upvotes.
  2. North Korea shaken by strong tremors in likely nuclear test,’ CNBC. 61,088 upvotes.
  3. Japan, US look to cut off North Korea’s oil supply,’ Nikkei Asian Review. 59,725 upvotes.

Comparing coverage across these three social networks, you can see that Trump features heavily on the most popular content about Korea on Facebook, while the most-upvoted content on Reddit tended to be breaking news with a more neutral tone. This is similar to the patterns we observed with the News API in a previous media review blog, which showed that Reddit was much more focused on breaking stories on popular topics than Facebook.

So now that we know the media focused its attention on Donald Trump, we can ask ourselves, what were all of these stories about? Were these stories talking the President down, like he always claims? Or were they positive? Using the sentiment feature of the News API’s Trends endpoint, we can dive into the stories that had both ‘Trump’ and ‘Korea’ in the title, and see what sentiment is expressed in the body of the text.

From the results below, you can see that over 50% of these articles contained negative sentiment, whereas a little over 30% had a positive tone. For all of the President’s – shall we say, questionable – claims, he’s right about one thing: popular content about how he responds to issues is overwhelmingly negative.

The Superfight – how big was it?

We’re based in Ireland, so having Conor McGregor of all people taking part in the ‘fight of the century’ last month meant that we’ve heard about pretty much nothing else. We can use the News API to put some metrics on all of the hype, and see how the coverage compared to the coverage of other sporting events. Using the Time Series endpoint, we analyzed the impact of the fight on the volume of stories last month. Since it analyzes the content of every news story it gathers, the News API can show us how the volume of stories about every subject fluctuates over time.

Take a look at how the volume of stories about boxing skyrocketed in the build up to and on the weekend of the fight:

You can see that on the day of fight itself, the volume of stories that the News API classified as being about boxing increased, almost by a factor of 10.

To figure out just how big this hype was in the boxing world, we compared the volume of stories published about boxing in the time period surrounding the ‘fight of the century’ and another boxing match which at the time received a lot of hype, the WBA/IBF world heavyweight title bout last April between Anthony Joshua and Wladimir Klitschko. In order to do this, we analyzed the story volume from the two weeks before and after each fight and plotted them side by side. This allows us to easily compare the media coverage on the day of the fight as well as its build-up and aftermath. Take a look below at the results below:

You can see that the McGregor-Mayweather fight totally eclipses the Joshua-Klitschko heavyweight title fight. But it’s important to give context to the data on this hype by comparing it with data from other sports.

It’s becoming almost a point of reference on these News API media review blogs to compare any trending stories to stories in the World Soccer category. This is because the daily volume of soccer stories tends to be consistently the largest of all categories, so it’s a nice baseline to use to compare story volumes. As you can see below, the hype surrounding the ‘fight of the century’ even prompted more boxing stories than soccer stories, which is quite a feat. Notice how only four days after the fight, when boxing was back to its normal level and soccer stories were increasing due to European transfer deadline day looming, there were 2,876 stories about soccer compared with 191 stories about boxing.

You might remember Conor McGregor launched his ‘Notorious Whiskey’ in the press conference following the fight. This was the perfect time for McGregor to launch announce a new product – right at the pinnacle of the media coverage. If you’re wondering how well he leveraged this phenomenal level of publicity for his new distilling career, we used the News API to look into that too. Take a look below at volume of stories that mentioned the new whiskey brand. It looks like mentions of ‘Notorious Whiskey’ have disappeared totally since the weekend of the fight, leaving us with this odd-looking bar chart. But we doubt that will bother Conor at the moment, considering the $100m payday!

That covers our quick look into the News API’s data on two of last month’s stories. The News API gathers over 100,000 stories per day, and indexes them in near-real time. This gives you a stream of enriched news data that you can query. So try out the demo or click on the link below for a free trial to use our APIs for two weeks.

News API - Sign up


Breakthroughs in NLP research are creating huge value for people every day, supercharging technologies from search engines to chatbots. The work that makes these breakthroughs possible is done in two silos – academia and industry. Researchers in both of these silos produce work that advances the field, and frequently collaborate to generate innovative research.

Contributing to this research is why we have such a heavy R&D focus at AYLIEN, with six full-time research scientists out of a total team of 16. The research team naturally has strong ties with academia – some are completing PhDs with the work they are carrying out here, while others already hold one. Academia is also represented on our advisory board and in the great people who have become our mentors.

To further deepen these ties with academia, we’re delighted to announce our first Industry Fellowship in association with Science Foundation Ireland, with Dr. Ian Wood of NUIG. Ian will be based in our Dublin office for one year starting in September. SFI’s goal with this fellowship is to allow industry and academia to cross-pollinate by exchanging ideas and collaborating on research. This placement will allow us to contribute to and learn from the fantastic work that Insight Centre in NUIG are doing, and we’re really excited to open up some new research windows where our team’s and Ian’s interests overlap.

Screenshot (352)

Ian is a postdoctoral researcher at the Insight Centre for Data Analytics, with an incredibly interesting background – a mixture of pure Mathematics, Psychology, and Deep Learning. His research is focused on how the emotions of entire communities change over time, which he researches by creating language models that detect the emotions people express on social media. For his PhD, he analyzed Tweets produced by pro-anorexia communities over three years and tracked their emotions, and showed that an individual’s actions are much more driven by their surrounding community than is generally accepted. Continuing this research, Ian now specializes in finding new ways to build Machine Learning and Deep Learning models to analyze emotions in online communities.

Ian’s placement is mutually beneficial on two levels. First, Ian’s experience in building language models for emotion analysis is obviously beneficial to us, and we can offer Ian a cutting edge research infrastructure and the opportunity to learn from our team in turn. But we’re also really excited at the possibility of opening up new research areas based on common interests, for example by building on existing research between Ian and our PhD student, Sebastian. Ian’s research into reducing dimensionality in data sets crosses over with Sebastian’s work into Domain Adaptation in a really interesting way, and we’re excited that this could open up a new research area for us to work on.

Outside of AYLIEN, Ian also speaks four languages, he was a professional musician (but that was in a previous life, he tells us), and he’s also sailed across the Atlantic in a small boat, so he’ll hopefully have some input into the next AYLIEN team-building exercises…

Welcome to the team, Ian!

If you want to find out more about the Fellowship, check out the LinkedIn group, and if your research interests overlap with ours in any way, drop us a line at – we love hearing from other researchers!


In 2017, video content is becoming ever more central to how people consume media. According to research by HighQ, this year around 30% of smartphone users will watch video content on their device at least once a day. In addition to this, people will spend on average an extra two minutes browsing sites that feature video content compared with sites that do not. For this reason, video content is an important component of driving up revenues for online news publishers, since keeping your audience on your site allows you to sell more ads.

But even though we can find great market research on consumer behavior around video content, we couldn’t find an answer to the following question — what type of video content is the news industry publishing to capitalize on this? For example, how much video content is actually being published? Are some publishers dominating video content? And are some subjects being supplemented with videos more regularly than others? Knowing this would allow us to understand what areas of the online news industry are set to flourish in the coming years with the growing emphasis on video.

We decided to use the News API to look into this question. Last month, our API crawled, analyzed, and indexed 1,344,947 stories as they were published. One of the metadata points that it analyzed was how many images and videos were embedded on the page. So for this blog, we’ll analyze the 1.3 million stories our News API gathered in July to find answers to the following questions:

  1. How many of the stories published last month featured video content?
  2. What were the stories with video content about?
  3. Which news organizations published the most video content?

1. How many stories published last month contained video content?

To get an idea of how far the video medium has spread into the online news industry, we need to find how much video content was used by news publishers last month. To do this, we used the News API’s Time Series endpoint to sort the stories published in July according to how many videos they contained. We then visualized the results to show how many stories contained no videos, how many contained one video, and how many contained more than one. Take a look below at what we found:

As you can see, 96% of stories published last month did not contain any video content, whereas just under 4% contained one video or more. We found this interesting — while HighQ found that almost 30% of smartphone users will watch video content online at least once per day, we can see here that barely 3.5% of news content published last month contained a video. This isn’t really optimal for an industry that relies on clicks for ad revenue.

But let’s focus on the news stories that contained video content. If we knew what these stories were about, we would have a good idea about what areas of online news are likely to fare well, since these areas likely account for a large proportion of ad revenue, and are therefore likely to grow. To look into this, we decided to try to understand what the stories containing video content were about.

2. What were the stories containing video about?

Knowing that only around one out of every thirty stories contained video content last month is interesting, but it begs the question of what these stories were about. To answer this question, we used the Trends endpoint to analyze the 43,134 stories that contained one video and see what subjects each one was about.

One of the pieces of information our News API extracts is topics that are discussed in the story, and which categories the story fits into, based on two taxonomies. For this visualization, we’ll use the advertising industry’s IAB-QAG taxonomy. Take a look below at which categories contained the most video content:

You can see that the Entertainment category had the most stories with video content accompanying them. This isn’t surprising to us at first, as we have all seen articles about celebrities with annoying videos that play automatically. But if you remember last month’s media roundup, you’ll remember that the Sports and Law, Government, and Politics categories produced by far the highest volumes of content (the Sports category alone published over double the content of the Entertainment category). This means that not only are there more videos about entertainment, but also that stories about entertainment are much more likely to contain a video than stories about politics.

So now we know which subject categories video content appeared in the most. But with the News API, we can go one step further and see exactly what people were talking about in the stories that contained a video. To do this, we used the Trends endpoint again to extract the entities mentioned the titles of these stories. Take a look at the chart below to see what people were talking about:

Here you can see exactly what the stories containing videos were about. The single biggest subject that was accompanied by a video was Love Island, a reality TV show. But you can also see that large soccer clubs are well represented on the chart. If you think back to last month’s roundup again, you’ll remember the huge reach and popularity of the top soccer clubs, even during their off-season. The chart above shows that these large soccer clubs are also being covered more with video content than other entities, with publishers obviously trying to leverage this reach to attract people to the stories they publish.

With large soccer clubs dominating both regular news content and video news content, and with ad revenues for video content being so valuable, these soccer clubs look like they have a bright future in terms of media content. Since the clubs benefit financially from media coverage through things like player image rights and viewership of games, large transfer fees like the $263 million PSG are going to pay for Neymar don’t look so crazy.

3. Who were the biggest publishers of video content?

As we mentioned in the introduction, we want to find out which publishers are making the quickest transition to video-based content, as this has a knock-on effect on site viewership, and therefore ad revenues. Knowing which players are leading industry trends like this is a good indicator of which ones are going to survive in an industry that is under financial pressure while transitioning to digital.

With that in mind, we used the Trends endpoint to find out which publishers were leading the way in video content. You can see pretty clearly from the graph below that the Daily Mail dominates last month’s video content. To see the rest of the publishers more clearly, you can select the Daily Mail bubble below and click “exclude”.

The Daily Mail obviously dominate the chart here, which isn’t too surprising when you consider that they feature video as a central part of the content on their site. They produce a huge amount of stories every month, and feature video even when it wasn’t completely related to the story it appeared with. Although the discontinuity can seem odd, even a loosely-related video can increase click through rate and revenues.

As you can see, many traditional news publishers are lagging behind in terms of the amount of video they’re publishing, with The Guardian, Forbes, ABC, and The Daily Mail among the few recognizable print and television giants on the graph. Instead, the field is largely made up of publishers like The Elite Daily, Uproxx, and Heavy, digital native organizations who are publishing more online video content than most traditional publishers.

Well, that concludes our brief analysis of last month’s video content in news stories. If you’re an AYLIEN subscriber, we’d like to remind you that the two endpoints we used in this post (Trends and Time Series) do not return stories, so you can hit them as much as you like and they won’t contribute towards your monthly 10,000 stories. So dig in!

If you’re not a subscriber, you can try the News API free of charge for two weeks by clicking on the image below (free means free, there’s no card required or obligation to buy).

News API - Sign up


In Machine Learning, the traditional assumption is that the data our model is applied to is the same as the data we used for training. This assumption is proven false as soon as we move into the real world: many of the data sources we encounter will be very different than our original training data (same meaning here that it comes from the same distribution). In practice, this causes the performance of our model to deteriorate significantly.

Domain adaptation is a prominent approach to transfer learning that can help to bridge this discrepancy between the training and test data. Domain adaptation methods typically seek to identify features that are shared between the domains or learn representations that are general enough to be useful for both domains. In this blog post, I will discuss the motivation for, and the findings of the recent paper that I published with Barbara Planck. In it, we outline a complementary approach to domain adaptation – rather than learning a model that can adapt between the domains, we will learn to select data that is useful for training our model.

Preventing Negative Transfer

The main motivation behind selecting data for transfer learning is to prevent negative transfer. Negative transfer occurs if the information from our source training data is not only unhelpful but actually counter-productive for doing well on our target domain. The classic example for negative transfer comes from sentiment analysis: if we train a model to predict the sentiment of book reviews, we can expect the model to do well on domains that are similar to book reviews. Transferring a model trained on book reviews to reviews of electronics, however, results in negative transfer, as many of the terms our model learned to associate with a certain sentiment for books, e.g. “page-turner”, “gripping”, or — worse — “dangerous” and “electrifying”, will be meaningless or have different connotations for electronics reviews.

In the classic scenario of adapting from one source to one target domain, the only thing we can do about this is to create a model that is capable of disentangling these shifts in meaning. However, adapting between two very dissimilar domains still fails often or leads to painfully poor performance.

In the real world, we typically have access to multiple data sources. In this case, we can only train our model on the data that is most helpful for our target domain. It is unclear, however, what the best way to determine the helpfulness of source data with respect to a target domain is. Existing work generally relies on measures of similarity between the source and the target domain.

Bayesian Optimization for Data Selection

Our hypothesis is that the best way to select training data for transfer learning depends on the task and the target domain. In addition, while existing measures only consider data in relation to the target domain, we also argue that some training examples are inherently more helpful than others.

For these reasons, we propose to learn a data selection measure for transfer learning. We do this using Bayesian Optimization, a framework that has been used successfully to optimize hyperparameters in neural networks and which can be used to optimize any black-box function. We learn this function by defining several features relating to the similarity of the training data to the target domain as well as to its diversity. Over the course of several iterations, the data selection model then learns the importance of each of those features for the relevant task.

Evaluation & Conclusion

We evaluate our approach on three tasks, sentiment analysis, part-of-speech tagging, and dependency parsing and compare our approach to random selection as well as existing methods that select either the most similar source domain or the most similar training examples.

For sentiment analysis on reviews, training on the most similar domain is a strong baseline as review categories are clearly delimited. We significantly improve upon this baseline and demonstrate that diversity complements similarity. We even achieve performance competitive with a state-of-the-art domain adaptation approach, despite not performing any adaptation.

We observe smaller, but consistent improvements for part-of-speech tagging and dependency parsing. Lastly, we evaluate how well learned measures transfer across models, tasks, and domains. We find that learning a data selection measure can be learned with a simpler model, which is used as a proxy for a state-of-the-art model. Transfer across domains is robust, while transfer across tasks holds — as one would expect — for related tasks such as POS tagging and parsing, but fails for dissimilar tasks, e.g. parsing and sentiment analysis.

In the paper, we demonstrate the importance of selecting relevant data for transfer learning. We show that taking into account task and domain-specific characteristics and learning an appropriate data selection measure outperforms off-the-shelf metrics. We find that diversity complements similarity in selecting appropriate training data and that learned measures can be transferred robustly across models, domains, and tasks.

This work will be presented at the 2017 Conference on Empirical Methods in Natural Language Processing. More details can be found in the paper here.

Text Analysis API - Sign up


Every day, we generate huge amounts of text online, creating vast quantities of data about what is happening in the world and what people think. All of this text data is an invaluable resource that can be mined in order to generate meaningful business insights for analysts and organizations. However, analyzing all of this content isn’t easy, since converting text produced by people into structured information to analyze with a machine is a complex task. In recent years though, Natural Language Processing and Text Mining has become a lot more accessible for data scientists, analysts, and developers alike.

There is a massive amount of resources, code libraries, services, and APIs out there which can all help you embark on your first NLP project. For this how-to post, we thought we’d put together a three-step, end-to-end guide to your first introductory NLP project. We’ll start from scratch by showing you how to build a corpus of language data and how to analyze this text, and then we’ll finish by visualizing the results.

We’ve split this post into 3 steps. Each of these steps will do two things: show a core task that will get you familiar with NLP basics, and also introduce you to some common APIs and code libraries for each of the tasks. The tasks we’ve selected are:

  1. Building a corpus — using Tweepy to gather sample text data from Twitter’s API.
  2. Analyzing text — analyzing the sentiment of a piece of text with our own SDK.
  3. Visualizing results — how to use Pandas and matplotlib to see the results of your work.

Please note: This guide is aimed at developers who are new to NLP and anyone with a basic knowledge of how to run a script in Python. If you don’t want to write code, take a look at the blog posts we’ve put together on how to use our RapidMiner extension or our Google Sheets Add-on to analyze text.

Step 1. Build a Corpus

You can build your corpus from anywhere — maybe you have a large collection of emails you want to analyze, a collection of customer feedback in NPS surveys that you want to dive into, or maybe you want to focus on the voice of your customers online. There are lots of options open to you, but for the purpose of this post we’re going to use Twitter as our focus for building a corpus. Twitter is a very useful source of textual content: it’s easily accessible, it’s public, and it offers an insight into a huge volume of text that contains public opinion.

Accessing the Twitter Search API using Python is pretty easy. There are lots of libraries available, but our favourite option is Tweepy. In this step, we’re going to use Tweepy to ask the Twitter API for 500 of the most recent Tweets that contain our search term, and then we’ll write the Tweets to a text file, with each Tweet on its own line. This will make it easy for us to analyze each Tweet separately in the next step.

You can install Tweepy using pip:

pip install tweepy

Once completed, open a Python shell to double-check that it’s been installed correctly:

>>> import tweepy

First, we need to get permission from Twitter to gather Tweets from the Search API, so you need to sign up as a developer to get your consumer keys and access tokens, which should take you three or four minutes. Next, you need to build your search query by adding your search term to the q = ‘’ field. You will also need to add some further parameters like the language, the amount of results you want returned, and the time period to search in. You can get very specific about what you want to search for on Twitter; to make a more complicated query, take a look at the list of operators you can use the API to search with in the Search API introduction.

Fill your credentials and your query into this script:

## import the libraries
import tweepy, codecs

## fill in your Twitter credentials 
consumer_key = ‘your consumer key here’
consumer_secret = ‘your consumer secret key here’
access_token = ‘your access token here’
access_token_secret = ‘your access token secret here’

## let Tweepy set up an instance of the REST API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

## fill in your search query and store your results in a variable
results = = "your search term here", lang = "en", result_type = "recent", count = 1000)

## use the codecs library to write the text of the Tweets to a .txt file
file ="your text file name here.txt", "w", "utf-8")
for result in results:

You can see in the script that we are writing result.text to a .txt file and not simply the result, which is what the API is returning to us. APIs that return language data from social media or online journalism sites usually return lots of metadata along with your results. To do this, they format their output in JSON, which is easy for machines to read.

For example, in the script above, every “result” is its own JSON object, with “text” being just one field — the one that contains the Tweet text. Other fields in the JSON file contain metadata like the location or timestamp of the Tweet, which you can extract for a more detailed analysis.

To access the rest of the metadata, we’d need to write to a JSON file, but for this project we’re just going to analyze the text of people’s Tweets. So in this case, a .txt file is fine, and our script will just forget the rest of the metadata once it finishes. If you want to take a look at the full JSON results, print everything the API returns to you instead:

This is also why we used codecs module, to avoid any formatting issues when the script reads the JSON results and writes utf-8 text.

Step 2. Analyze Sentiment

So once we’ve collected the text of the Tweets that you want to analyze, we can use more advanced NLP tools to start extracting information from it. Sentiment analysis is a great example of this, since it tells us whether people were expressing positive, negative, or neutral sentiment in the text that we have.

For sentiment analysis, we’re going to use our own AYLIEN Text API. Just like with the Twitter Search API, you’ll need to sign up for the free plan to grab your API key (don’t worry — free means free permanently. There’s no credit card required, and we don’t harass you with promotional stuff!). This plan gives you 1,000 calls to the API per month free of charge.

Again, you can install using pip:

pip install aylien-apiclient

Then make sure the SDK has installed correctly from your Python shell:

>>>from aylienapiclient import textapi

Once you’ve got your App key and Application ID, insert them into the code below to get started with your first call to the API from the Python shell (we also have extensive documentation in 7 popular languages). Our API lets you make your first call to the API with just four lines of code:

>>>from aylienapiclient import textapi
>>>client = textapi.Client("Your_app_ID", "Your_application_key")
>>>sentiment = client.Sentiment({'text': 'enter some of your own text here'})

This will return JSON results to you with metadata, just like our results from the Twitter API.

So now we need to analyze our corpus from step 1. To do this, we need to analyze every Tweet separately. The script below uses the io module to open up a new .csv file and write the column headers “Tweet” and “Sentiment”, and then it opens and reads the .txt file containing our Tweets. Then, for each Tweet in the .txt file it sends the text to the AYLIEN API, extracts the sentiment prediction from the JSON that the AYLIEN API returns, and writes this to the .csv file beside the Tweet itself.

This will give us a .csv file with two columns — the text of a Tweet and the sentiment of the Tweet, as predicted by the AYLIEN API. We can look through this file to verify the results, and also visualize our results to see some metrics on how people felt about whatever our search query was.

from aylienapiclient import textapi
import csv, io

## Initialize a new client of AYLIEN Text API
client = textapi.Client("your_app_ID", "your_app_key")

with'Trump_Tweets.csv', 'w', encoding='utf8', newline='') as csvfile:
	csv_writer = csv.writer(csvfile)
	csv_writer.writerow(["Tweet", "Sentiment"])
	with"Trump.txt", 'r', encoding='utf8') as f:
	    for tweet in f.readlines():
	    	## Remove extra spaces or newlines around the text
	    	tweet = tweet.strip()

	    	## Reject tweets which are empty so you don’t waste your API credits
	    	if len(tweet) == 0:

	    	## Make call to AYLIEN Text API
	    	sentiment = client.Sentiment({'text': tweet})

	    	## Write the sentiment result into csv file
	    	csv_writer.writerow([sentiment['text'], sentiment['polarity']])

You might notice on the final line of the script that when the script goes to write the Tweet text to the file, we’re actually writing the Tweet as it is returned by the AYLIEN API, rather than the Tweet from the .txt file. They are both identical pieces of text, but we’ve chosen to write the text from the API just to make sure we’re reading the exact text that the API analyzed. This is just to make it clearer if we’ve made an error somehow.

Step 3. Visualize your Results

So far we’ve used an API to gather text from Twitter, and used our Text Analysis API to analyze whether people were speaking positively or negatively in their Tweet. At this point, you have a couple of options with what you do with the results. You can feed this structured information about sentiment into whatever solution you’re building, which could be anything from a simple social listening app or a even an automated report on the public reaction to a campaign. You could also use the data to build informative visualizations, which is what we’ll do in this final step.

For this step, we’re going to use matplotlib to visualize our data and Pandas to read the .csv file, two Python libraries that are easy to get up and running. You’ll be able to create a visualization from the command line or save it as a .png file.

Install both using pip:

pip install matplotlib
pip install pandas

The script below opens up our .csv file, and then uses Pandas to read the column titled “Sentiment”. It uses Counter to count how many times each sentiment appears, and then matplotlib plots Counter’s results to a color-coded pie chart (you’ll need to enter your search query to the “yourtext” variable for presentation reasons).

## import the libraries
import matplotlib.pyplot as plt 
import pandas as pd
from collections import Counter
import csv 

## open up your csv file with the sentiment results
with open('your_csv_file_from_step_2.csv', 'r', encoding = 'utf8') as csvfile:
	## use Pandas to read the “Sentiment” column,
        df = pd.read_csv(csvfile)
	sent = df["Sentiment"]

## use Counter to count how many times each sentiment appears
## and save each as a variable
	counter = Counter(sent)
	positive = counter['positive']
	negative = counter['negative']
	neutral = counter['neutral']

## declare the variables for the pie chart, using the Counter variables for “sizes”
labels = 'Positive', 'Negative', 'Neutral'
sizes = [positive, negative, neutral]
colors = ['green', 'red', 'grey']
yourtext = "Your Search Query from Step 2"

## use matplotlib to plot the chart
plt.pie(sizes, labels = labels, colors = colors, shadow = True, startangle = 90)
plt.title("Sentiment of 200 Tweets about "+yourtext)

If you want to save your chart to a .png file instead of just showing it, replace on the last line with savefig(‘your chart name.png’). Below is the visualization we ended up with (we searched “Trump” in step 1).

Screenshot (261)

If you run into any issues with these scripts, big or small, please leave a comment below and we’ll look into it. We always try to anticipate any problems our own users might run into, so be sure to let us know!

That concludes our introductory Text Mining project with Python. We hope it gets you up and running with the libraries and APIs, and that it gives you some ideas about subjects that would interest you. With the world producing content on such a large scale, the only obstacle holding you back from an interesting project is your own imagination!

Happy coding!

Text Analysis API - Sign up