Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78


This is our third blog in the “Text Analysis 101; A basic understanding for Business Users” series. The series is aimed at non-technical readers, who would like to get a working understanding of the concepts behind Text Analysis. We try to keep the blogs as jargon free as possible and the formulas to a minimum.

This week’s blog will focus on Topic Modelling. Topic Modelling is an unsupervised Machine Learning (ML) technique. This means that it does not require a training dataset of manually tagged documents from which to learn. It is capable of working directly with the documents in question.

Our first two blogs in the series focused on document classification using both supervised and unsupervised (clustering) methods.

What Topic Modelling is and why it is useful.

As the name suggests, Topic Modelling discovers the abstract topics that occur in a collection of documents. For example, assume that you work for a legal firm and have a large number of documents to consider as part of an eDiscovery process.

As part of the eDiscovery process we attempt to identify certain topics that we may be interested in and discard topics we have no interest in. However, for the most part we are talking about large volumes of documents and often times we have no idea which documents are relevant or irrelevant. Topic modelling enables the discovery of high-level topics that exist in the target documents, and also the degree to which each topic is referred to in each document, i.e. the composition of topics for each document. If the documents are ordered chronologically then topic modelling can also provide insight into how the topics evolve over time.

LDA  – A model for “generating” documents

Latent Dirichlet Allocation (LDA) is the name given to a model commonly used for describing the generation of documents. There are a few basic things to understand about LDA

  1. LDA views documents as if each document were a bag of words, imagine taking the words in a document and pouring them into a bag. All of the word order and the grammar would be lost but all of the words would still be present i.e. if there are twelve occurrences of the word “the” in the document then there will be twelve “the”s in the bag.
  2. LDA also views documents as if they were “generated” by a mixture of topics i.e. a document might be generated from 50% sports, 20% finance and 30% gossip.
  3. LDA considers that any given topic will have a high probability of generating certain words and a low probability of generating other words. For example, the “Sports” topic will have a high probability of  generating words like “football”, basketball”, “baseball” and will have a low probability of producing words like “kitten”, “puppy” and “orangutan”. The presence of certain words within a document will, therefore, give an indication of the topics which make up the document.

So in summary, from the LDA view, documents are created by the following process.

  1. Choose the topics from which the document will be generated and the proportion of the document to come from each topic. For example, we could choose the three  topics and proportions from above i.e 50% sports, 20% finance and 30% gossip.
  2. Generate appropriate words from the topics chosen in the proportions specified.

For example, if our document had 10 words and three topics in proportion 50% sports, 20% finance and 30% gossip, the LDA process might generate the following “bag of words” to make up the document.

baseball dollars fans playing Kardashian pays magazine chat stadium ball

The 5 red words are from the sports topic, the 2 blue words are from the finance topic and the three green words are from the gossip topic.

Collapsed Gibbs Sampling

We know that LDA assumes documents are bags of words composed in proportion from the topics that generated the words. Collapsed Gibbs Sampling tries to work backwards to figure out the words that belong to each topic and secondly, the topic proportions that make up each document. Below is an attempt to describe this method in simple terms.

  1. Keep a copy of each document for reference.
  2. Pour all of the word from each documents into a bag. The bag will then contain every word from every document, some words will appear multiple times.
  3. Decide the number of topics (K) that you will divide the documents into and have a bowl for each topic.
  4. Randomly pour the words from the bag into the topic bowls putting an equal number in each bowl. At this point, we have a first guess at the makeup of words in each topic. It is a completely random guess so is not of any practical use yet. It needs to be improved. It is also a first guess at the topic makeup of each document i.e. you can count the number of words in each document that are from each topic to figure out the proportions of topics that make up the document.

Improving on the first random guess to find the topics.

The Collapsed Gibbs Sampling algorithm can work from this first random guess and over many iterations to discover the topics. Below is a simplified description of how this achieved.

For each document in the document set, go through each word one by one and do the following:

  1. For each of our K topics
    1. Find the percentage of words in the document that were generated from this topic. This will give us an indication of how import the topic (as represented by our current guess of words in the bowl) is to the document. i.e. how much of the document came from the topic
    2. Find the percentage of the topic that came from this word across all documents. This will give us an indication of how important the word is to the topic.
    3. Multiply the two percentages together, this will give an indication of how likely it is that the topic in question generated this word
  2. Compare the answers to the multiplication from each topic and move the word to the bowl with the highest answer.
  3. Keep repeating this process over and over again until the words stop moving from bowl to bowl i.e. the topics will have converged into K distinct topics.

At this point we have the words that make up each topic so we can assign a label to the topic i.e. if the topic contains the words dog, cat, tiger, buffalo we would assign the label “Animals” to the topic. Now that we have the words in each topic we can analyse each document or “bag of words” to see what proportion of each topic it was generated from.

We now have the words which make up each topic, we have a label for the topic and we have the topics and proportions within each document and that’s pretty much it. There are two blogs that I used as part of our research, which you might want to take a look at. The LDA Buffet by Matthew L Jockers and An Introduction to LDA by Edwin Chen.

Keep an eye out for more in our “Text Analysis 101” series.





This is our second blog on harnessing Machine Learning (ML) in the form of Natural Language Processing (NLP) for the Automatic Classification of documents. By classifying text, we aim to assign a document or piece of text to one or more classes or categories making it easier to manage or sort. A Document Classifier often returns or assigns a category “label” or “code” to a document or piece of text. Depending on the Classification Algorithm or strategy used, a classifier might also provide a confidence measure to indicate how confident it is that the result is correct.

In our first blog, we looked at a supervised method of Document Classification. In supervised methods, Document Categories are predefined by using a training dataset with manually tagged documents. A classifier is then trained on the manually tagged dataset so that it will be able to predict any given Document’s Category from then on.

In this blog, we will focus on Unsupervised Document Classification. Unsupervised ML techniques differ from supervised in that they do not require a training dataset and in the case of documents, the categories are not known in advance. For example, let’s say we have a large number of emails that we want to analyze as part of an eDiscovery Process. We may have no idea what the emails are about or what topics they deal with and we want to automatically find out what are the most common topics present in the dataset. Unsupervised techniques such as Clustering can be used to automatically discover groups of similar documents within a collection of documents.

 An Overview of Document Clustering

Document Clustering is a method for finding structure within a collection of documents, so that similar documents can be grouped into categories. The first step in the Clustering process is to create word vectors for the documents we wish to cluster. A vector is simply a numerical representation of the document, where each component of the vector refers to a word, and the value of that component indicates the presence or importance of that word in the document. The distance matrix between these vectors is then fed to algorithms, which group similar vectors together into clusters. A simple example will help to illustrate how documents might be transformed into vectors.

A simple example of transforming documents into vectors

Using the words within a document as the “features” that describe a document, we need to find a way to represent these features numerically as a vector. As we did in our first blog in the series we will consider three very short documents to illustrate the process.



We start by taking all of the words across the three documents in our training set and create a table or vector from these words.


Then for each of the documents, we create a vector by assigning a 1 if the word exists in the document and a 0 if it doesn’t. In the table below each row is a vector describing a single document.



Preprocessing the data

As we described in our blog on Supervised Methods of Classification it is likely that some preprocessing of the data would be needed prior to creating the vectors.  In our simple example, we have given equal importance (a value of 1) to each and every word when creating Document Vectors and no word appears more than once. To improve the accuracy, we could give different weighting to words based on their importance to the document in question and their frequency within the document set as a whole. A common methodology used to do this is TF-IDF (term frequency – inverse document frequency). The TF-IDF weighting for a word increases with the number of times the word appears in the document but decreases based on how frequently the word appears across the entire document set. This has the effect of giving a lower overall weighting to words which occur more frequently in the document set such as “a”, “it”, etc

Clustering Algorithms

In the graph below each “dot” is a vector which represents a document. The graph shows the output from a Clustering Algorithm with an X marking the center of each cluster (known as a ‘centroid’). In this case the vector’s only have two features (or dimensions) and can easily be plotted on a two-dimensional graph as shown below.

K-Means Clustering Algorithm output example:


Two extreme cases to illustrate the concept of discovering the clusters

If we want to group the vectors together into clusters, we first need to look at the two extreme cases to illustrate how it can be done. Firstly, we assume that there is only one cluster and that all of the document vectors belong in this cluster. This is a very simple approach which is not very useful when it comes to managing or sorting the documents effectively.

The second extreme case is to decide that each document is a cluster by itself, so that if we had n documents we would have N clusters. Again, this is a very simple solution with not much practical use.

Finding the K clusters from the N Document Vectors

Ideally from N documents we want to find K distinct clusters that separate the document into useful and meaningful categories. There are many Clustering Algorithms available to help us achieve this. For this blog, we will look at the k-means algorithm in more detail to illustrate the concept.

How many clusters (K)?

One simple rule of thumb for deciding the optimum number of clusters (K) to have is:

K = sqrt(N/2).

There are many more methods of finding K which you can read about here.

Finding the Clusters

Again, there are many ways we can find clusters. To illustrate the concept we’ll look at the steps used in one popular method, the K-means algorithm which follows the following steps:

  1. Find the value of K using our simple rule of thumb above.
  2. Randomly assign each of the K cluster centroids throughout the dataset.
  3. Assign each data point to the cluster whose centroid is closest to it.
  4. Recompute the centroid location for each cluster as an average of the vector points within the cluster (this will find the new “center” of the cluster).
  5. Reassign each vector data point to the centroid closest to it i.e. some will now switch from one cluster to another as the centroids positions have changed.
  6. Repeat steps 4 and 5 until none of the data points switch centroids i.e the clusters have “converged”.

That’s pretty much it, you now have your n documents assigned to K clusters! If you have difficulty visualising the steps above, watch this excellent video tutorial by Viktor Lavrenko of the University of Edinburgh, which explains it in more depth.

Keep an eye out for more in our “Text Analysis 101” series. The next blog will look at how Topic Modelling is performed.



For anyone who doesn’t know, the WebSummit is an annual tech conference held in Dublin Ireland. For three days in November Tech junkies, Media personnel, Startups, Tech Giants, Founders and even School kids all descend on the RDS to learn from the best, secure investment or land that perfect job. It truly is a geek’s networking heaven.

Apart from WiFi issues and long queues, this year’s WebSummit was a huge success. Tipping over 22,000 attendees this year, its biggest yet, the WebSummit has gone from strength to strength since starting out as a 500 person meet up for the local technology community.

Before the WebSummit, which we attended and thoroughly enjoyed, we decided that, along with countless other analytics companies, we would put together a blog post on data gathered from Twitter over the course of the 3 or 4 days.

We tracked the official hashtags from the WebSummit (#WebSummit, #WebSummit14 and #WebSummit2014) and also gathered data on a dozen or so of the speakers, listed below: (some for other reasons than others)

12 Speakers

  • Paddy Cosgrave
  • Drew Houston
  • Bono
  • Mark Pincus
  • John Collison
  • Mikel Svane
  • Phil Libin
  • Eva Longoria
  • David Goldberg
  • David Karp
  • Lew Cirne
  • Peter Thiel

    Using Twitter’s Streaming API we collected about 77,300 Tweets in total from 10am Nov 3 to 10am Nov 7 (times in GMT). We set it to monitor the hashtags and users mentioned above.

    Once we had gathered the Tweets, we used AYLIEN Text Analysis API from within RapidMiner to Analyze the Sentiment of the Tweets. Following the analysis, we visualized the data in Tableau. You can read more about RapidMiner and AYLIEN Text Analysis API here.


    While the activity was quite constant over three days you can see three major spikes in the volume of Tweets which represent each day. It’s pretty clear from this that people were enjoying themselves too much at the Night Summit to be Tweeting from the pub with the drop in volume as the day progressed. There was also a pretty evident dip in activity during lunch which suggests we all enjoyed the food or the networking opportunities the lunch break provided.

    The second graph below shows the volume of tweets with a mention of one of the speakers we were monitoring. You can clearly see spikes in volume when they hit the stage to speak. Tweets mentioning Paddy Cosgrave, WebSummit’s founder stayed pretty constant throughout. Surprisingly, the most talked about speaker at this technology conference wasn’t the founder of Dropbox or even Peter Thiel, it was Eva Longoria, the star of Desperate Housewives! Bono came in second and Peter Thiel was the third most mentioned speaker. It turns out even tech geeks have a thing for celebrities.

    Geo-location and Language

    We utilized the location data returned via the Twitter API to map where most of the activity was coming from. Not suprisingly, chatter was mainly concentrated in Dublin. What was surprising is how little activity was coming from the US.

    Tweets from and about the Summit were predominantly written in English. Considering there were companies and attendees from all over the world we expected more multi-lingual data and were surprised by the lack of Tweets in other languages.


    We hoped to get a feel for people’s reactions to the event by mining and analyzing the voice of attendees through their expressions and activity on Twitter. Overall the sentiment of the event was quite positive, however there were some negative trends that creep in throughout the event. People were most vocal about the lack of a good connection and the queues for the food.

    We analyzed all the positive and negative Tweets and created a word cloud for each by extracting keywords mentioned in those tweets. This gave a pretty clear understanding for what people liked and disliked about the event.

    Note: you can hover over the circles in the word clouds to see any words that aren’t displayed.

    The Good

  • Attendees used words like “Great”, “Love” and “Amazing” to describe the conference
  • They also really enjoyed their time in “Dublin”
  • People loved the main stage
  • The event lived up to its networking promises as attendees had positive things to say about the “people” they met

    The Bad

  • “Bad” was a common word used in negative descriptions of the event, as was “critics” and surprisingly, “people”
  • The “wifi” certainly featured as a negative topic as were the queues
  • The RDS (event holders) took a bit of a hit for not providing adequate wifi

    Some words and terms were evident in both positive and negative Tweets. The jury was out on Eva Longoria’s attendance and it’s pretty obvious the public is still undecided on what they make of Bono.

    The WiFi (The “Ugly”)

    Considering it was a tech event you would presume connectivity would be a given. That wasn’t the case. There was a strong reaction to the lack of a WiFi signal. At an event that gets 20,000+ tech heads into one room, each with a minimum of 2 devices, ensuring the ability to stay connected was always going to be a challenge.

    The initial reaction to the WiFi issues was evident in the sharp drop in polarity of Tweets. Each day it certainly had an effect on the overall sentiment of the event. However, at the close of the event the polarity had returned to where it started as people wrapped their WebSummit experience up in mostly positive Tweets. Perhaps the lack of connectivity also meant that a lot of the attendees didn’t even get the option to vent their frustrations online.

    We really enjoyed our time at the Summit, met some great people and companies and learned a lot from some of the excellent speakers. Looking forward to next year’s Summit already!

    Text Analysis API - Sign up


  • Note: if you can’t see the charts, please click here.

    The FIFA World Cup is without doubt the biggest sporting event in the World, with millions of fans and viewers from all around the globe who use Social Media to share their thoughts and emotions about the games, teams and players and thus creating massive amounts of content on Social Media by doing so.

    Throughout the tournament, Facebook saw a record-breaking 3 billion interactions and Twitter saw a whopping 672 million tweets about the World Cup.

    That’s why at AYLIEN we decided to collect some of this data using Twitter’s Streaming API and analyzed tweets related to the world cup, looking for interesting insights and correlations.

    We are going to explore how you can use text analysis techniques to dig into some of this data in a series of blog posts.

    In Part 1 of the series, we’re going to get a high-level view of our data, and also to look for some basic data insights about the tournament.

    Data and Tools

    Data: datasets used in this blog post are as follows:

    • tweets.csv: Around 30 million Tweets (80 million including retweets – which are omitted) collected between June 6th and July 14th using the Twitter Streaming API, and filtered by some of the official World Cup hashtags (e.g. “#WorldCup” and “#Brazil2014”), as well as team code hashtags (e.g. “#ARG” and “#GER”) and Twitter usernames of teams and players. (Note: we’re assuming that Twitter samples the tweets in a uniform fashion and without any major side effect on their distribution)
    • matches.csv: Information about the 64 matches, such as match time and results, obtained using the World Cup json project.
    • events.csv: Information about match events such as goals, substitutions and cards, obtained using the World Cup json project.

    Tools: For these posts we will use AYLIEN Text Analysis API for Sentiment Analysis, RapidMiner for data processing and Tableau for interactive visualizations.


    Let’s start our quest by taking a look at the matches and their events, such as goals, substitutions and red and yellow cards:

    Things to note:

    • The number of matches with 5 or more yellow cards tends to increase in later stage games, possibly due to higher sensitivity and intensity of these matches.

    Tweet languages

    Now let’s take a look at a breakdown of the most popular languages used in our tweets dataset:

    Things to note:

    • English, followed by Spanish and Portuguese are the three most used languages in our tweets dataset.

    Tweet locations

    Next we’ll have a look at the distribution of geo-tagged tweets over different countries around the globe, along with their languages:

    Tweets and events

    Plotting the total volume of tweets over time shows a repeating pattern of spikes appearing at match times and also at times when a major event has occurred (such as elimination of a team, qualification for the next round, or shocking results). Let’s have a look at a few examples:

    1. Tweet volume by Language

    In these examples, we’re going to see how the volume of tweets in a language is affected by the matches and critical events related to teams from countries where that language is spoken (also note the trend lines in black):

    Note: double click on the charts to zoom, click and hold to pan.

    Teams: USA, England, Australia, Cameroon and Nigeria.

    Teams: Germany and Switzerland.

    Teams: France, Belgium, Algeria, Cameroon and Côte d’Ivoire.

    Teams: Spain, Argentina, Mexico, Uruguay, Chile, Costa Rica, Ecuador, Honduras and Colombia.

    Teams: Italy.

    Teams: Brazil and Portugal.

    2. Tweet volume during matches

    A similar pattern can be observed at a smaller scale during matches, with spikes appearing for each goal or major event. Let’s see an example from the Brazil – Germany match:

    3. Tweet volumes for different teams

    Finally, let’s take a look at how the volume of tweets that mention a team changes over time for the four teams that qualified for the semi-finals round (for each team we are counting mentions of the team’s full name e.g. “Germany” as well as its team code hashtag e.g. “#GER”):

    Subscribe to our blog and stay tuned for part 2, where we use Text Analytics to dig deep into the tweets’ contents.

    Got some cool use cases of text analysis? We would love to hear about them. Get in touch below.

    Update: here is the second part of the series.

    Text Analysis API - Sign up