Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78
Data Science


At AYLIEN we like to use topical and interesting events like the FIFA World Cup and the Super Bowl to showcase our technology in a simple and interesting way. Primarily we choose world events with a lot of hype associated with them so that we can try and dive into public opinion from data collected from social media and other sources. This time around we decided to focus on Super Bowl 50 which took place on the 7th of February 2016, to try and get a handle on the public reaction too.

Super Bowl 50 saw the Denver Broncos line out against the Carolina Panthers in what was a battle of the strongest defensive outfit in the league, the Broncos, versus an offensive focused Panthers team.

We set out to try and understand the public reaction to Super Bowl 50 by collecting and analyzing reactions online. We hoped to uncover interesting insights and correlations in the build up to and during the actual game. We focused our attention on the volume of chatter surrounding the event for each team, the battle of the quarterbacks and even the advertising battle which has become such a huge part of the whole Super Bowl event each year.

(Interested in the ads battle? Check out the recording of our recent Webinar with RapidMiner where we dive into who came out on top in the SB50 commercials battle here.)


Overall we collected about 1.8 million tweets using the Twitter Search and Streaming APIs. We also pulled team information like rosters and coaches names from SportRadar API, which we later used to segregate tweets. We analyzed all of the tweets gathered using the AYLIEN Text Analysis API and visualized our results in Tableau. We’ll talk more about the whole process later in the blog.

Tools used:

We focused our data collection on keywords, hashtags and handles that were related to Super Bowl 50. You can download the data set here.

Once we collected all of our tweets we spent a bit of time cleaning and prepping our data. We disregarded some of the metadata which we felt we didn’t need. We kept key indicators like time stamps, geolocation, tweet ID and the raw text of each tweet. We also removed any retweets and tweets that contained links. From previous experience, tweets that contain links are mostly objective and don’t hold any opinion towards the event.

You can read more about the technicalities of the process and even copy the code we used in our walkthrough available here.


We used Tableau to visualize our results and embedded some of the more interesting visualizations below.

We started off by analyzing the volume of chatter on Twitter in the build up to and during Super Bowl 50. You can see in the graph below how the chatter builds in the few days leading up to the event with pretty obvious peaks and troughs in volume at the start and right at the end of the game. With the highest number of tweets published as people expressed their reaction to the result.


Volume of Tweets:

We looked at the overall volume which was somewhat interesting but we also wanted to know which team had the most vocal fans and who was tweeting the most. To understand the reaction towards each team we needed to separate the tweets in some way. The first approach we took was to use pre-identified Hashtags, in this Case #BroncosWin and #PanthersWin which in the build up were touted as the official hashtags to use.

This didn’t prove too useful however, for the most part hashtags can have massive spikes in usage and popularity but they usually fade away quite dramatically or get replaced by other hashtags that might be trending at that time. This is nicely visualized in the graph below which shows the decline in usage for each hashtag.

Team Specific Hashtags

Our second approach was a bit more technical and it focused around the idea of classifying tweets as Denver or Carolina focused based on what concepts – team, coach, players, cities – were mentioned in a tweet. We accomplished this using our Concept Extraction feature. For example if a tweet mentions Cam Newton it is most likely a tweet about Carolina.

Volume by Team:

We had a lot more success with this approach and were able to classify around 40% of the 1.8M tweets we collected as either Carolina or Denver focused or not relevant. As you can see in the visualization above, the Panthers fans were far more vocal, tweeting about twice as much as the Broncos fans.

Were the Broncos fans quietly confident or is it down to something simpler like the there being more panthers fans than Broncos?  

Location of Tweets

We also wanted to understand where these tweets and the activity was coming from. We could assume that for each team the majority of their activity would focus around their home cities, Denver and Charlotte and we were right. You can see a strong concentration of activity clustered around North Carolina for the Panthers tweets.

Carolina Tweets:

It was much the same for the Broncos tweets with most of the activity focused around Colorado.

While both teams seem to have pockets of fans based in other major cities on the east and west there seems to be a lot more Panthers fans spread throughout the west coast from Florida to New Hampshire.

The other obvious clusters were coming mainly from the San Francisco area where the game was held.

Denver Tweets:


While the volume of tweets and how it increases and decreases is interesting, it doesn’t tell us a whole lot about the opinion of the public, who they were going for, which players they like and who they thought would come out on top.

We used the results from the analysis we did using our Sentiment Analysis and Concept Extraction features to understand what people were actually tweeting about and what their sentiment was, towards teams and some players.

Overall Sentiment:

First off we looked at the overall polarity i.e. how many tweets were positive, how many were classified negative and how many we deemed neutral. The majority of tweets, as expected came back as neutral.

While you can’t tell from this graph, which team the positivity or negativity is directed at, there are still some interesting insights here. Notably how the most opinionated tweets are positive in the build up to the game, everyone believes in their team, they’re excited for the big game and are showing their excitement with an overall positive sentiment, the negativity does creep in however once the game kicks off, reaching it’s peak at the end of the game which you could assume was down to the disappointment of the Carolina fans.

We’ll talk more about this in the next section but that initial very severe spike in activity and positivity is also interesting.


Using the same approach as before with Concept Extraction to separate tweets we could split the positive and negative tweets into Broncos and Panthers related tweets.

The Carolina Panthers certainly had the most chatter about them from a volume point of view but they also had the most positive sentiment towards them in the build up and the beginning of the game.

Were the Panthers fans too cocky?

Sentiment towards each team (build up):

What happened on February 5th?

The first thing we noticed from the visualization above was the extreme spike on Feb 5th. It’s somewhat strange to see such a large spike in activity at that time especially because it was so rich in positive sentiment. After some digging in the data we figured out that this was down to a campaign ran by Sports Central where they asked their followers to vote on who was going to win Super Bowl 50 using a Twitter poll. Once you voted your account automatically tweeted one of the following tweets. This certainly shows the effect a poll can have on Twitter but its effectiveness is quite short lived.

As was the case with the hashtags #BroncosWin and #PanthersWin we discussed earlier, the Twitter poll gave a very strong indication of the public opinion at that time, but failed to deliver insight throughout the rest of the build up and during the game itself.

Game Day

The graph below focuses on game day, there was quite a significant spike in negative sentiment towards the last two quarters towards Carolina and at the end of the game we can see quite a significant amount of negativity present. The the opposite effect can be seen on the Denver side with a large positive spike right at the end of the game. Where fans expressed their delight with the result.

Sentiment towards each team (Game):


We also wanted to focus on some key individuals and how they performed in the eyes of the public. Anyone who watches football will tell you, a lot of the game focuses on one key position, the quarterback.

Below we analyzed both the positive and negative reactions towards both Cam Newton and Peyton Manning during the game. The overall reaction by fans to Newton’s performance was pretty poor. He only completed 18/41 passes and was sacked a totla of 6 times, which was visualized pretty clearly with the dominance of Negative sentiment towards Newton in tweets, especially as it became more evident that the Broncos had shut them out.

Manning, having not thrown a single touchdown pass, was still praised for his performance and control of the game. After all, Denver are known for their defense focused strategies and they closed out the Carolina attack and Newton’s offensive efforts in particular. So while Manning didn’t deliver a perfect quarterback performance, as usual he delivered the goods in the form of a win, much to the satisfaction of the fans.   

Player Reaction on Game Day

From a volume of tweets point of view, other players of note included Greg Olsen and Demaryius Thomas who had the most mentions in tweets.  

Other Players Mentioned


While this was put together as a fun exercise there are some key takeaways that can be applied to more business and commercially-focused applications. From a data analytics point of view this use case could be classed as a “voice of the customer application” of Text Analytics with a focus on social listening. It’s pretty clear there is a wealth of information about customer opinion towards brands and events on social platforms like Twitter.

Key Takeaways:

  • Hashtags can provide insight but they are heavily influenced by flocking and are easily overshadowed or replaced
  • Twitter Polls are great way to gain immediate traction and reaction from the twittersphere but they are extremely short lived
  • The ability to segment reactions based on concepts in tweets allows for a greater understanding of opinion towards entities and concepts, in this case teams and players but the same can be easily applied to people and brands for example

As we mentioned above we ran a similar analysis process on brand related tweets for the Super Bowl commercials. Check it out here.

News API - Sign up


As you know, from time to time at AYLIEN we like to share useful code snippets, apps and use cases that either we’ve put together or our users have built. For this blog we wanted to showcase a neat little script one of our engineers hacked together that can be used when the target webpage your analyzing is lite on content.

Our standard Article Extraction feature extracts the main body of text, the primary image, the headline, author etc. It’s primarily designed for article type, or blog type pages where there is a significant chunk of text on the page.

But what happens when you want to Categorize a home page or even a domain, or what if there just isn’t enough text on a page to analyze? 

You have to look elsewhere.

While these type of pages (product pages, home pages, feature pages) may not have 1 large article style piece of text they often have other strong indicators and text hidden in other areas, such as:

  • Headers
  • Meta descriptions
  • Meta tags for images
  • Keyword Tags


These other text sources on webpages can be effectively leveraged when trying to classify “text light” webpages.

Here’s an example demonstrating and explaining how it’s done using our Node.js SDK:

var _ = require('underscore'),
    cheerio = require('cheerio'),
    request = require('request'),
    AYLIENTextAPI = require("aylien_textapi");

var textapi = new AYLIENTextAPI({
    application_id: "YOUR_APP_ID",
    application_key: "YOUR_APP_KEY"

var url = '';
request(url, function(err, resp, body) {
  if (!err) {
    var text = extract(body)
    textapi.classifyByTaxonomy({'text': text, 'taxonomy': 'iab-qag'}, function(err, result) {

function getText(tagName, $) {
  var texts = _.chain($(tagName)).map(function(e) {
    return $(e).text().trim();
  }).filter(function(t) {
    return t.length > 0;

  return texts.join(' ');

function extract(body) {
  var $ = cheerio.load(body);
  var keywords = $('meta[name="keywords"]').attr('content');
  var description = $('meta[name="description"]').attr('content');
  var imgAlts = _($('img[alt]')).map(function(e) {
    return $(e).attr('alt').trim();
  }).join(' ');

  var h1 = getText('h1', $);
  var h2 = getText('h2', $);
  var links = getText('a', $);
  var text = [h1, h2, links, imgAlts].join(' ');

  return text;

In a nutshell, what we’re doing here is extracting the text present in the meta tags, H1 and H2 tags, linked text and image alt attributes within the pages HTML instead of relying on the main body of text.

Our ad tech focused customers currently use this or similar approaches when classifying text light pages and domains in order to be able to classify a page or domain against the IAB QAG taxonomy.

You can copy and paste the snippet for further use or test it out in our Sandbox Environment.

Text Analysis API - Sign up


We’ve just returned from RapidMiner Wisdom 2016 in New York City. This fantastic event organised by our partners RapidMiner, saw data scientists, academics and the newly coined citizen data scientists gather to discuss everything RapidMiner and predictive analytics related.

We saw some really great presentations on a wide range of topics. RapidMiner evangelists spoke about the different ways in which they’re all using RapidMiner’s predictive capabilities to ingest and mine a huge range of data sources to create actionable business outcomes. We had talks on Process Mining at Siemens to Risk Analysis at PWC and were even introduced to a super cool AYLIEN-powered stock recommender system called Mr.INV, built by the impressive guys at NYU.

We were primarily at RapidMiner wisdom to showcase our new Text Analysis Extension for RapidMiner, a text analytics plugin for RapidMiner that brings the smarts of Natural Language Processing and Text Analysis to any RapidMiner process.

Perhaps we’re somewhat biased, but the key thing we took away from the talks was how crucial, mining unstructured data, particularly text was in each speaker’s data analytics process.

There’s no doubt that the benefits of mining unstructured datasets is coming to the fore in the data mining world. There are 2 reasons for this: the sheer amount of unstructured content being created today and the realization that there is a wealth of information hidden in this user generated content (social conversations, email, news articles, blog posts, etc).

But mining this information can be quite the challenge…

“More content was uploaded yesterday than any one human could ever consume in their entire lifetime”

– Conde Nast, 2015

And it’s not just a problem online and on the web, unstructured now accounts for nearly 90% of enterprise data.

So while this content we’re generating is rich in useful business and research insight, for the most part it’s unstructured and therefore extremely difficult to understand and mine, especially at the scale required. While it’s easy for a human to read some content and pull out the key points and extract what matters, it’s impossible for us humans to stay on top of the amount of content being created every second of everyday and while it’s getting easier it’s still difficult for machines to make sense of this unstructured information in the same way they do with structured data.So what’s the difference between Unstructured and Structured data and why is it important to leverage unstructured sources in our Data strategies?

Structured & Unstructured data

So what’s the difference between Unstructured and Structured data and why is it important to leverage unstructured sources in our Data strategies?

Unstructured data is primarily user generated, it’s not stored in a traditional table or database, it’s noisy and there is a hell of a lot of it. Structured data however differs in that it’s easily referenceable, it’s stored in a table or database, it’s often numbers-heavy and it’s easily ingested by machines or computers.

Our presentation at Wisdom focused on the voice of the customer and the benefits of analyzing unstructured content on social media to mine public opinion towards a certain event or even a brand.

To showcase what could be done we took a real world event, in this case Web Summit 2015, collected tweets about the event and visualized our findings, all within RapidMiner.

Some of the highlights of the talk:

The Web Summit saw 42,000 attendees from 134 countries descend on Dublin. There were 1,000 speakers including some pretty big names in the tech industry (Michael Dell, John Scully, Ed Catmull, Benedict Evans). In total we collected 199,054 tweets over 3 days and 4 nights, analyzed them in RapidMiner with our Text Analysis extension and visualized our findings in Tableau.


We wanted to see how much chatter there was around the different speaker sessions. To do this we extracted the names of the speakers mentioned and graphed the volume over the 3 days. The graph below shows the volume of tweets with a mention of one of the speakers. You can clearly see spikes in volume when they hit the stage to speak. Tweets mentioning Paddy Cosgrave, Web Summit’s founder stayed pretty constant throughout and not too surprisingly, John Scully had probably the biggest reaction of all the speakers extracted from the data set.

Volume of tweets:

While the activity was quite constant over three days you can see three major spikes in the volume of Tweets which represent each day. It’s pretty clear from this that people were enjoying themselves too much at the Night Summit out drinking Guinness in all the pubs in Dublin to be tweeting with the drop in volume as the day progressed. There was also a pretty evident dip in activity during lunch which suggests everyone was too busy taking advantage of the networking opportunities the lunch break provided.

The other thing we did was tried analyzed the polarity of the tweets we mined i.e. whether they are positive or negative. We hoped to get a feel for people’s reactions to the event by mining and analyzing the voice of attendees through their expressions and activity on Twitter. Overall the sentiment of the event was quite positive, however there were some negative trends that creep in throughout the event.

Polarity of Tweets:

One of the key themes we noticed was a lot of negativity towards certain aspects of the event, any who attended the event this year can probably guess what aspect of the event people were most vocal and frustrated about. In 2014 it was the Wi-Fi and in 2015 the food on offer at Web Summit was certainly an interesting talking point online.

Negative theme (Food):

From the graph above where we’ve mapped the number of tweets over time mentioning the food and charted the steep decline in the polarity of tweets as people began to get frustrated. You can also see the reaction of the attendees to the Web Summit coming out and addressing the issue and taking it on the chin, with the spike back into the green zone.

The key message we wanted to get across in our talk wasn’t just how easy it is to do these complex content analysis processes in RapidMiner but also the wealth of information that is hidden in unstructured content online that is too often overlooked in data mining strategies.

We’ll be talking more about this topic on the 16th of February in a joint webinar with RapidMiner where we’re going to analyze the public reaction to the Super Bowl 50 ad wars to try and determine which brand comes out on top. Sign up here.

Text Analysis API - Sign up