Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

Frequent and ongoing communication with customers and users is key to the success of any business. That’s where tools like Intercom and Zendesk excel by helping companies listen and talk to their customers in a seamless and channel-agnostic manner.

Last year we decided to move all our customer communication to Intercom, and both we and our customers have been extremely happy with the experience and the results so far. However, every now and then our support channels get abused by internet trolls (or extremely-angry-for-no-apparent-reason visitors) who have too much time on their hands and come all the way to our website to try and harass us:

Screen Shot 2017-12-25 at 3.49.30 PM

Screen Shot 2017-12-25 at 3.55.15 PM


This is not cool, and it’s exactly what our support team don’t need on a busy Monday morning! (or any other time)

So like any responsible CEO, when I saw this, I decided to take action! Here was my plan:

  • Build an offensive speech detector that given a message, determines whether it’s offensive or not; and
  • Use this detector to analyze all incoming messages on Intercom and identify the offensive ones, and then respond to the offenders with a funny random GIF meme.

I spent a few hours during my Christmas break building this. Here’s a glimpse of what the end result looks like:

Offensive Speech detector _ AYLIEN - Edited - Edited

EDIT: we deployed this detector only for the first few weeks after we posted this blog, so if you’re reading this message and decide to try it out on our Intercom, your message will reach us now. Instead of a personalized meme, your insults will be met with the hurt feelings of our sales team. 

Pretty cool, huh? For the rest of this blog I will explain how I went about building this, and how you can build your own Intercom troll police bot in 3 steps:

  • Step 1: Train an offensive speech detector model using AYLIEN’s Text Analysis Platform (TAP)
  • Step 2: Set up a callback mechanism using Intercom’s webhooks functionality and AWS Lambda to monitor and analyze incoming messages
  • Step 3: Connect all the pieces together and launch the bot

Before proceeding, please make sure you have the following requirements satisfied:

  • An active TAP account (TAP is currently in private beta, but you can get 2 months free as a beta user by signing up here)
  • An Amazon Web Services (AWS) account
  • An Intercom account

Step1. Training an offensive speech detector model

First, we need to find a way to identify offensive messages. This can be framed as a text classification task, where we train a model that for any given message, predicts a label such as “offensive” or “not offensive” based on the contents of the message. It’s pretty similar to a spam detector that for each incoming email, tries to determine whether it’s spam or not, and classifies it accordingly.

We need to train our offensive speech detector model on labeled data that contains examples of both offensive and non-offensive messages. Luckily, there’s a great dataset available for this purpose that we can obtain from here and train our model on.

Great, so now we have the data to train the model. But how do we actually do it?

We’re going to use our newest product offering, AYLIEN Text Analysis Platform (TAP) for building this model. TAP allows users to upload their datasets as CSV files, and train custom NLP models for text classification and sentiment analysis tasks from within their browser. These models can then be exposed as APIs, and called from anywhere.

Our steps to follow are:

  • Uploading the dataset
  • Creating training and test data
  • Training the model
  • Deploying the model

Uploading the dataset

Let’s download the labeled_data.csv file from the “hate speech and offensive language” repository linked above, and once downloaded, head to the My Datasets section in TAP to create a new dataset.

Create a new dataset by clicking on Create Dataset and then click on Upload CSV:

Screen Shot 2017-12-27 at 12.41.02 PM


Select and upload labeled_data.csv, and click on Convert which will take you to the Preview screen:

Screen Shot 2017-12-27 at 3.51.28 PM


Assign the Document role to the tweet column, and the Label role to the class column, and click on Convert to Dataset in the bottom right corner, to convert the CSV file to a dataset:

Screen Shot 2017-12-27 at 3.55.29 PM


Please note that the original dataset uses numerical values for labels (0-2), which have the following meaning:

  • 0 – Hate speech
  • 1 – Offensive language
  • 2 – Neither

For clarity we have renamed the labels in our dataset to match the above.

Creating training and test data

In the dataset view click on the Split button to split the dataset into training and test collections:

Screen Shot 2017-12-27 at 3.57.26 PM


Set the split ratio to 95% and hit Split Dataset to split this dataset. Once the split job is finished, click on the Train button to proceed to training.

Training the model

In the first step of the training process, click to expand the Parameters section, and click to enable “Whether or not to remove default stopwords” to ask the model to ignore common words such as “the” or “it”.

Screen Shot 2017-12-27 at 3.59.41 PM


Afterwards click on Next Step to start the training process.

Evaluating the model

Once the model is trained, we can have a quick look at the evaluation stats to see how well our model is performing on the test data:

Screen Shot 2018-01-16 at 8.55.26 AM


Additionally, we can use Live Evaluation to interact with the newly trained model and get a better sense of how it works:

Screen Shot 2018-01-16 at 9.15.23 AM


Deploying the model

Now click on the Deploy tab and then click on Deploy Model:

Screen Shot 2017-12-27 at 4.16.31 PM


Once the deployment process is finished, you will be provided with the details for the API:

Screen Shot 2017-12-27 at 4.18.25 PM


We have now trained an offensive speech detection model that we can access from AWS Lambda. Now let’s build the pipeline for retrieving and processing incoming Intercom messages.

Step 2. Monitoring and processing incoming Intercom messages

Now that we have built our offensive speech detection model, we need a way to run each incoming message on Intercom through the model to determine whether it’s offensive or not, and respond with a funny meme if it is.

We will use Intercom’s handy webhooks capability to achieve this. Webhooks act as web-wide callback functions, allowing one web service to notify another upon an event, by emitting a HTTP request each time that event occurs.

To make this work, we need to point Intercom to a web service that it can ping upon every new message that is posted on Intercom. You can implement the web service in pretty much any programming language and host it somewhere on the internet. Given that our web service in this case is fairly minimal and light, we’re going to use AWS Lambda which makes it very easy to build, host and expose small microservices such as this one without managing any backend infrastructure.

The overall workflow is as follows:

  • User submits a message on Intercom
  • Intercom notifies our AWS Lambda web service by sending a webhook
  • Our Lambda service analyzes the incoming message using TAP, and if it’s deemed to be offensive, sends back a random funny meme to the Intercom chat (courtesy of Giphy!)

Building the AWS Lambda microservice

To recap from above, we need to build a service that is accessible as an API for Intercom’s webhook to hit and notify us about new messages. Luckily, Lambda makes building services like this extremely easy.

Navigate to AWS Lambda’s dashboard and hit Create function or click here to create a new Lambda function.

Screen Shot 2017-12-26 at 6.16.03 PM


In the Create function form, choose the “Author from scratch” option to build your function from scratch:

Screen Shot 2017-12-26 at 6.17.17 PM


Next, enter a name and create a new role for the function. We’re going to use Node.js for implementing the service in this instance, but you can choose from any of the available Runtimes:

Screen Shot 2017-12-26 at 6.20.50 PM


Now that our function is created, we need to implement the logic for our service. Replace the contents of index.js with the following script:

Be sure to replace the four placeholders with real values:

  • You can retrieve your TAP_MODEL_ID and TAP_API_KEY from the Deploy screen in TAP
  • You can retrieve your INTERCOM_ACCESS_TOKEN by going to Authorization > Access token
  • Finally, you can retrieve your INTERCOM_ADMIN_ID either from the webhook payload (it’s located in–see Step 3) or by calling the List Admins endpoint in the Intercom API

Note that we must provide the two packages required by our script, “request-promise” and “striptags”, in a node_modules folder. Lambda allows us to upload a ZIP bundle that contains the scripts and their dependencies using the dropdown on the top left corner of the code editor view:

Screen Shot 2017-12-27 at 11.35.14 AM


You can download the entire ZIP bundle including the dependencies from here. Simply upload this as a ZIP bundle in Lambda, and you should have both the script and its dependencies ready to go. To create your own ZIP bundle you can create a new folder on your computer, put index.js there and install the two packages using npm, then zip the entire folder and upload it.

Finally, we need to expose this Lambda function as an API that is accessible by Intercom for sending a webhook. We can achieve this in Lambda using API Gateway. Let’s add a new trigger of type “API Gateway” to our Lambda function:

Screen Shot 2017-12-26 at 6.22.39 PM


You will notice that the API Gateway must be configured. Let’s click on it to view and set the configuration parameters:

Screen Shot 2017-12-27 at 10.35.21 AM


Note that for simplicity we have set the Security policy to open, which means the API gateway is openly accessible by anyone that knows the URI. For a production application you will most likely need to secure this endpoint, for example by choosing Open with access key which will require an API key for sending requests to the service.

Make sure you hit the Save button at the top right corner after each change:

Screen Shot 2017-12-27 at 10.35.47 AM


Now that we have our API gateway created and configured, we need to retrieve its endpoint URI and provide it to the Intercom webhook. Copy the “Invoke URL” value from the API Gateway section in the Lambda function view:

Screen Shot 2017-12-27 at 10.52.24 AM


Creating the Intercom webhook

Our Lambda microservice is created and exposed as an API. The next step is to instruct Intercom to hit the web service for every new message by sending a webhook.

In order to do this, head to the Intercom Developer Hub located here. From your dashboard navigate to Webhooks:

Screen Shot 2017-12-27 at 10.58.10 AM


Then click on Create webhook in the top right corner to create a new webhook, and paste the Lambda service’s URI from the previous step into the “Webhook URL” field.

Screen Shot 2017-12-27 at 11.01.43 AM

The key events that we would like Intercom to notify our service upon are “New message from a user or lead” and “Reply from a user or lead”, both of which indicate a new message from a user has been posted.

Please note that for simplicity we are not encrypting the outgoing notifications in this example. In a real-world scenario you will most likely want to leverage this facility, since otherwise the recipient of the webhook will receive all your Intercom messages unencrypted.

Step 3. Connecting the pieces and launching the bot

We have created our AWS Lambda service, and instructed Intercom to ping it every time a new message is posted by a user. Now each time a user posts a message on Intercom, a notification similar to the one below will be sent to our web service:

The Lambda service parses these notifications, invokes TAP to see if they are offensive, and if it finds a message offensive, hits the Intercom API to respond to the offender with a random funny meme.

With the webhook being active and the service being exposed and configured, we are ready to test our bot.

Head to your Intercom widget and, well, send an offensive message–and be prepared to get busted by the troll bot!

AYLIEN _ Text Analysis API _ Natural Language Processing - Edited


And that’s it. We can now sit back, relax, and enjoy roasting trolls! 🙂

Note: To stop the bot, all you need to do is disable the webhook from the Intercom developer hub dashboard, to prevent it from invoking the Lambda script.

Things to try next:

  • Adjust the minimum threshold for the confidence score (currently set to 0.5 in index.js) based on your preferences. A lower value will result in a higher number of meme responses and potentially more false positives, whereas a higher value will only trigger a meme response if the classifier is confident about a message being offensive.
  • Download a dump of your previous (non-offensive) messages from Intercom as explained here and add the cleaned up messages to the “Not offensive” label in your TAP dataset and train a new model. This should improve the accuracy of the model and enable it to distinguish offensive and non-offensive messages better.


Text Analysis Platform


With our News API, our goal is to make the world’s news content easier to query, just like a database. Additionally, we leverage Machine Learning to process, normalize and analyze this content to make it easier for our users to gain access to rich and high quality metadata, and use powerful filtering capabilities that will ultimately help you to find the needle in the haystack more easily.

To this end, we have just launched two new handy features for filtering stories based on their image metadata and setting range queries for social media share counts. You can read more about these two features – which are now also available in our News API SDKs – below.

Image metadata filters

The news content published online is increasingly becoming multimodal, to the point that it is rare to find an article or a blog post that doesn’t include an image or a video. Our News API stats show that 83% of all the articles that we have in our index contain at least 1 image.

Therefore, it is important to be able to search and filter stories not just based on their textual content, but also based on their images.

To facilitate this, we now analyze each extracted image of each news article to capture its size (width and height), format and content length. Additionally, we have introduced 7 new parameters for filtering stories based on these attributes:

  • media.images.width.min: minimum image width (in pixels)
  • media.images.width.max: maximum image width (in pixels)
  • media.images.height.min: minimum image height (in pixels)
  • media.images.height.max: maximum image height (in pixels)
  • media.images.content_length.min: minimum image content size (in bytes)
  • media.images.content_length.max: maximum image content size (in bytes)
  • media.images.format[]: image format (possible values are: JPEG, PNG, GIF, SVG, ICO, TIFF, CUR, WEBP and BMP).

As an example, let’s use these parameters to retrieve stories about Golf that have an image in JPEG or PNG format that is bigger than 80kb in size:


Here’s an image returned from the search query above:

Screen Shot 2016-11-07 at 17.43.48

Social range filters

One of the highly popular features of our News API is its ability to sort stories based on how many times they have been shared on social media. However, if you use this to retrieve popular stories over a long period of time, you will sometimes notice that a few highly popular stories (those that have been shared 100’s of thousands of times) would come at the top, preventing you from accessing the long tail of interesting and popular stories easily.

To battle this, we have introduced the following 8 new parameters that allow you to set range (i.e. minimum and maximum) filters on social media shares counts:

  • social_shares_count.facebook.min: minimum number of Facebook shares
  • social_shares_count.facebook.max: maximum number of Facebook shares
  • social_shares_count.google_plus.min: minimum number of Google+ shares
  • social_shares_count.google_plus.max: maximum number of Google+ shares
  • social_shares_count.linkedin.min: minimum number of LinkedIn shares
  • social_shares_count.linkedin.max: maximum number of LinkedIn shares
  • social_shares_count.reddit.min: minimum number of Reddit shares
  • social_shares_count.reddit.max: maximum number of Reddit shares

To retrieve all stories that mention Donald Trump, and have been shared between 50 and 500 times on Facebook, we can use the following query:

These filters are now available across all our News API SDKs. We hope that you find these new updates useful, and we would love to hear any feedback you may have.

To start using our News API for free and query the world’s news content easily, click here.

News API - Sign up


Yesterday I was talking to a friend at a Starbucks by the River Liffey, and I explained to him how I, as a solo founder, approach taking advice from my team and advisors. I don’t think there is anything novel or special about my approach; I think it simply boils down to openness, trust and good communication.

He immediately came back to me and said: “So you’ve basically made a co-founder out of your team?” and I felt like that’s exactly what I’ve tried to do.

Having this ‘virtual’ co-founder has been a crucial step in our progress and development as a company, and in my personal development, and I always encourage other founders to engage with advisors and mentors in areas they lack expertise in, which if done right, could really provide a level of help and support you would normally get from a second or third or fourth co-founder. As a founder, you have to make tough decisions and navigate through diverse and challenging areas, so even if you have the strongest sense of intuition, you really need that second voice and that extra pair of eyeballs to validate your decisions, or at least to give you a better understanding of your intuition, and that’s where an experienced advisor can be tremendously helpful.

In my case, in addition to having the chance to work with amazingly talented, supportive and caring people on a daily basis, I’ve been lucky enough to have two brilliant advisors – Shawn Broderick and John Breslin.

Over the past couple of years, Shawn and John have helped me with various issues, from fundraising to team building to product directions and building ties with academia.

AYLIEN advisory board update

Today I’m proud to announce the addition of two new advisors to our team: Prof. Barry Smyth of UCD and INSIGHT and Dr. James (Jimi) Shanahan of UC Berkeley, Xerox and NativeX.

Both individuals are extremely well achieved, and I find it difficult to put them in a single category like “academic” or “entrepreneurial” as they’re both very well balanced between the academic/research world, and the business world, and in addition to being distinctly successful academics, they have both started, grown and sold companies. So instead, I’m going to tell you a bit more about their background and how we plan to work together in future.

Barry Smyth


Barry is a Full Professor and Digital Chair of Computer Science at University College Dublin. To date, he has published in excess of 400 scientific articles and has contributed to dozens of patents.

In 1999, Barry co-founded ChangingWorlds, bringing advanced personalization tech to the mobile sector. ChangingWorlds grew to 120 people before being acquired by Amdocs Ltd in 2008, the same year in which Barry co-founded HeyStaks Technologies Ltd – a company focused on commercializing new social search technology.

Barry received a Ph.D. in Artificial Intelligence at Trinity College in Dublin and holds a B.Sc. in Computer Science from University College Dublin.

James Shanahan


James has over 20 years’ experience developing and researching cutting-edge information management systems that harness information retrieval, linguistics, and machine learning in applications and domains such as web search and computational advertising at companies such as NativeX, Digg, AT&T, SearchMe and Turn Inc.

A frequent speaker at various academic and commercial conferences, James has published seven books and 45 refereed papers in machine learning and information systems. As you may have guessed from the image above, James is a keen kiteboarder. In fact, he represented Ireland at the Kiteboarding World Championships in both 2014 and 2015!

James received a Ph.D. in Engineering Mathematics from The University of Bristol, United Kingdom and holds a B.Sc. in Computer Science from the University of Limerick Ireland.

Going forward

Barry and Jimi have a strong knowledge of/ties to academia, and together with our other brilliant academic advisor, John Breslin, we will be working on growing our partnerships with leading universities and research institutions in Ireland and abroad. This will hopefully result in wider academic collaborations between us and other organizations, and will ultimately lead to new publications, products and internship/fellowship opportunities with us for students and researchers working in the Machine Learning and Natural Language Processing space.

If you’re interested in collaborating with us in any of these areas, please feel free to get in touch with me directly:


Product, Research

It is a strong indicator of today’s globalized world and rapidly growing access to Internet platforms, that we have users from over 188 countries and 500 cities globally using our Text Analysis and News APIs. Our users need to be able to understand and analyze what’s being said out there, about them, their products, services, or their competitors, regardless of the locality and the language used.

Social media content on platforms like Twitter, Facebook and Instagram can provide unrivalled insights into customer opinion and experience to brands and organizations. However, as shown by the following stats, users post content in a multitude of languages on these platforms:

  • Only about 39% of tweets posted are in English;
  • Facebook recently reported that about 50% of its users speak a language other than English;
  • Native platforms such as Sina Weibo and WeChat, where most of the content is written in a native language, are on the rise;
  • 70% of active Instagram users are based outside the US.

A look at online review platforms such as Yelp and TripAdvisor, as well as various news outlets and blogs, reveals similar patterns regarding the variety of language used.

Therefore, no matter if you are a social media analyst, or a hotel owner trying to gauge customer satisfaction, or a hedge fund analyst trying to analyze a foreign market, you need to be able to understand textual content in a multitude of languages.

The Challenge with Multilingual Text Analysis

Scaling Natural Language Processing (NLP) and Natural Language Understanding (NLU) applications – which form the basis of our Text Analysis and News APIs – to multiple human languages has traditionally proven to be difficult, mainly due to the language-dependent nature of preprocessing and feature engineering techniques employed in traditional approaches.

However, Deep Learning-based NLP methods, which have gained a tremendous amount of growing attention and popularity over the last couple of years, have proven to bring a great amount of invariance to NLP processes and pipelines, including towards the language used in a document or utterance.



At AYLIEN we have been following the rise and the evolution of Deep Learning-based NLP closely, and our research team have been leveraging Deep Learning to tackle a multitude of interesting and novel problems in Representation Learning, Sentiment Analysis, Named Entity Recognition, Entity Linking and Generative Document Models, with multiple publications to date.

Additionally, using technologies such as TensorFlow, Docker and Kubernetes, as well as software engineering best practices, our engineering team ensures this research is surfaced in our products by ensuring our proprietary models are performant and scalable, enabling us to serve millions of requests every day.

Multilingual Sentiment Analysis with AYLIEN

Today we’re excited to announce an early result of these efforts with the launch of the first version of our Deep Learning-based Sentiment Analysis models for short sentences which are now available for English, Spanish and German.

Let’s explore a couple of examples and see these new capabilities in action:


A Spanish tweet:

“Vamos!! Se ganó, valio la pena levantarse temprano, bueno el futbol todo lo vale :D”


A German tweet:

“Lange wird es mein armes Handy nicht mehr machen 🙁 Nach 5 Jahren muss ich mein Samsung Galaxy S 2 wohl bald aufgeben”


Try it out for yourself on our demo, or grab a free API key and an SDK to leverage these new models in your application.

How it Works

Our new models leverage the power of word embeddings, transfer learning and Convolutional Neural Networks to provide a simple, yet powerful end-to-end Sentiment Analysis pipeline which is largely language agnostic.

Additionally, in contrast to more traditional machine learning models, this new model allows us to learn representations from large amounts of unlabeled data. This is particularly valuable for languages such as German where manually annotated data is scarce or expensive to generate, as it enables us to train sentiment models that leverage small amounts of annotated data in a language to great effect.


Source: Training Deep Convolutional Neural Network for Twitter Sentiment Classification by Severyn et al.


Next steps

Over the next couple of months, we will be continuing to work on improving these models as well as rolling out support for even more languages. Your feedback can be extremely helpful in shaping our roadmap, so if you have any thoughts, ideas or questions please feel free to reach out to us at

We are also excited about the new research that we’ve been doing on cross-lingual embeddings, which should make the process of multilingual Sentiment Analysis even easier.


Text Analysis API - Sign up


In recent times deep learning techniques have become more and more prevalent in NLP tasks; just take a look at the list of accepted papers at this year’s NAACL conference, and you can’t miss it. We’ve now completely moved away from traditional NLP approaches to focus on deep learning and how it can be leveraged in language problems, as successfully as it has in both image and audio recognition tasks.

One of these approaches that has seen great success and is backed by a wave of research papers and funding is the concept of word embeddings.

Word embeddings

For those of you who aren’t familiar with them, word embeddings are essentially dense vector representations of words.

Similar to the way a painting might be a representation of a person, a word embedding is a representation of a word, using real-valued numbers. They are an arrangement of numbers representing the semantic and syntactic information of words and their context, in a format that computers can understand.

Here’s a nice little primer you should read if you’re looking for a more in depth description:

Word embeddings can be trained and used to derive similarities and relations between words. This means that by encoding each word as a small set of unique digits, say 100, 200 digits or more even that represent the word “mother” and another set of digits that represent “father” we can better understand the context of that word.



Word vectors created through this process manifest interesting characteristics that almost look and sound like magic at first. For instance, if we subtract the vector of Man from the vector of King, the result will be almost equal to the vector resulting from subtracting Woman from Queen. Even more surprisingly, the result of subtracting Walked from Walking almost equates to that of Swam minus Swimming. These examples show that the model has not only learnt the meaning and the semantics of these words, but also the syntax and the grammar to some degree.



Relations between words according to word embeddings


As our very own NLP Research Scientist, Sebastian Ruder, explains that “word embeddings are one of the few currently successful applications of unsupervised learning. Their main benefit arguably is that they don’t require expensive annotation, but can be derived from large unannotated corpora that are readily available. Pre-trained embeddings can then be used in downstream tasks that use small amounts of labeled data.”

Although word embeddings have almost become the de facto input layer in many NLP tasks, they do have some drawbacks. Let’s take a look at some of the challenges we face with word2vec, probably the most popular and commercialized model used today.

Word2vec Challenges

Inability to handle unknown or OOV words

Perhaps the biggest problem with word2vec is the inability to handle unknown or out-of-vocabulary (OOV) words.

If your model hasn’t encountered a word before, it will have no idea how to interpret it or how to build a vector for it. You are then forced to use a random vector, which is far from ideal. This can particularly be an issue in domains like Twitter where you have a lot of noisy and sparse data, with words that may only have been used once or twice in a very large corpus.

No shared representations at sub-word levels

There are no shared representations at sub-word levels with word2vec. For example, you and I might encounter a new word that ends in “less”, and from our knowledge of words that end similarly we can guess that it’s probably an adjective indicating a lack of something, like flawless or careless.

Word2vec represents every word as an independent vector, even though many words are morphologically similar, just like our two examples above.

This can also become a challenge in morphologically rich, and polysynthetic languages such as Arabic, German or Turkish.

Scaling to new languages requires new embedding matrices

Scaling to new languages requires new embedding matrices and does not allow for parameter sharing, meaning cross-lingual use of the same model isn’t an option.

Cannot be used to initialize state-of-the-art architectures

As explained earlier, pre-training word embeddings on weakly supervised or unsupervised data has become increasingly popular, as have various state-of-the-art architectures that take character sequences as input. If you have a model that takes character-based input, you normally can’t leverage the benefits of pre-training, which forces you to randomize embeddings.


So while the application of deep learning techniques like word embeddings and word2vec in particular have brought about great improvements and advancements in NLP, they are not without their flaws.

At AYLIEN, Representation Learning, the wider field word embeddings comes under, is an active area of research for us. Our scientists are actively working on better embedding models and approaches for overcoming some of the challenges we mentioned.

Stay tuned for some exciting updates over the next few weeks ;).


Text Analysis API - Sign up


Data Science

As you may know we recently launched a new service offering, our News API, and over the past week or so we’ve been using it to run some little experiments around analyzing news content.

We wanted to use the News API to collect and analyze popular news headlines. We set out to find both similarities and differences in the way two journalists write headlines for their respective news articles and blog posts. The two reporters we selected operate in, and write about, two very different industries/topics and have two very different writing styles:

  • Finance: Akin Oyedele of Business Insider, who covers market updates.
  • Celebrity: Carly Ledbetter of the Huffington Post, who mainly writes about celebrities.


Note: For a more technical, in-depth and interactive representation of this project, check out the Jupyter notebook we created. This includes sample code and more in depth descriptions of our approach.

The Approach

We set out some clear steps to follow in comparing the writings of our two selected authors:

  1. Collect news headlines from both of our journalists
  2. Create parse trees from collected headlines (we explain parse trees below!)
  3. Extract information from each parse tree that is indicative of the overall headline structure
  4. Define a simple sequence similarity metric to quantitatively compare any pair of headlines
  5. Apply the same metric to all headlines collected for each author to find similarity
  6. Use K-Means and tSNE to produce a visual map of all the headlines so we can clearly see the differences between our two journalists


So what exactly are parse trees?

In linguistics, a parse tree is a rooted tree that represents the syntactic structure of a sentence, according to some pre-defined grammar. Here’s an example;

For example with a simple sentence like “The cat sat on the mat”, a parse tree might look like this;


Thankfully parsing our extracted headlines isn’t too difficult. We used the Pattern Library for Python to parse the headlines and generate our parse trees.



In total we gathered about 700 article headlines for both journalists using the AYLIEN News API which we then analyzed using Python. If you’d like to give it a go yourself, you can grab the Pickled data files directly from the GitHub repository (link), or by using the data collection notebook we prepared for this project.

First we loaded all the headlines for Akin Oyedele, then we created parse trees for all 700 of them, and finally we stored them together with some basic information about the headline in the same Python object.

Then using a sequence similarity metric, we compared all of these headlines two by two, to build a similarity matrix.



To visualize headline similarities for Akin we generated a 2D scatter plot with the hope of grouping similarly structured headlines close together in a graph in groups of sorts.

To achieve this, we first reduced the dimensionality of our similarity matrix using tSNE and applied K-Means clustering to find groups of similar headlines. We also used some a nice viz library which we’ve outlined below;

  • tSNE to reduce the dimensionality of our similarity matrix from 700 down to 2
  • K-Means to identify 5 clusters of similar headlines and add some color
  • Plotted the actual chart using Bokeh


The chart above shows a number of dense groups of headlines, as well as some sparse ones. Each dot on the graph represents a headline, as you can see when you hover over one in the interactive version. Similar titles are as you can see grouped together quite cleanly. Some of the more stand-outish groups are;

  • The circular group left of center typically consists of short, snappy stock update headlines such as “Viacom is crashing”
  • The large circular group on the top right are mostly announcement-style headlines such as “Here come the…. “ formats.
  • The small green circular group towards the bottom left are similar and use the same phrases we see headlines such as “Industrial production falls more than expected” or “ADP private payrolls rise more than expected”.


Comparing the two authors

By repeating the process for our second journalist, Carly Ledbetter, we were then able to compare both authors and see how many common patterns exist between the two in terms of how they write their headlines.

We observed that roughly 50% (347/700) of the headlines had a similar structure.

Here we can see the same dense and sparse patterns, as well as groups of points that are somewhat unique to each author, or shared by both authors. The yellow dots represent our Celebrity focused author and the blue our finance guy.

  • The bottom right cluster is almost exclusive to the first author, as it covers the short financial/stock report headlines such as “Here comes CPI”, but it also covers some of the headlines from the first author such as “There’s Another Leonardo DiCaprio Doppelgänger”. Same could be said about the top middle cluster.
  • The top right cluster mostly contains single-verb headlines about celebrities doing things, such as “Kylie Jenner Graces Coachella With Her Peachy Presence” or “Kate Hudson Celebrated Her Birthday With A Few Shirtless Men” but it also includes market report headlines from the first author such as “Oil rig count plunges for 7th straight week”.


Conclusion and future work

In this project we’ve shown how you can retrieve and analyze news headlines, evaluate their structure and similarity, and visualize the results on an interactive map.

While we were quite happy with the results and found it quite interesting there were some areas that we thought could be improved. Some of the weaknesses of our approach, and ways to improve them are:

– Using entire parse trees instead of just the chunk types

– Using a tree or graph similarity metric instead of a sequence similarity one (ideally a linguistic-aware one too)

– Better pre-processing to identify and normalize Named Entities, etc.

Next up..

In our next post, we’re going to study the correlations between various headline structures and some external metrics like number of Shares and Likes on Social Media platforms, and see if we can uncover any interesting patterns. We can hazard a guess already that the short, snappy Celebrity style headlines would probably get the most shares and reach on social media, but there’s only one way to find out.

If you’d like to access the data used or want to see the sample code we used head over to our Jupyter notebook.


Text Analysis API - Sign up


Semantic Labeling is a very popular feature with our Text API users, so we’ve decided to roll it out, as a fully functional Text Analysis Add-on feature too.

For this blog we’re going to walk you through what it does and use some examples to showcase how useful it can be for classifying or categorizing text.

So what exactly is Semantic Labeling?

It’s an intelligent way of tagging or categorizing text, based on labels that you suggest. It’s a training-less approach to classification, which means, it doesn’t rely on a predefined taxonomy to categorize or tag textual content.

With Semantic Labeling you can provide a piece of text, specify a set of labels and the add-on will automatically assign the most appropriate label to that text. This allows for greater flexibility for add-on users to decide how they want to tag and categorize text in their spreadsheets

Our customers are using this feature for a variety of different use cases. We’ll walk you through a couple of simple ones, to show you the feature in action.

Text Classification from a URL

Say for example, I run a sports blog and I want to automatically curate and categorize lots of articles/URL’s into predefined categories that I cover on my blog and list them in a spreadsheet.

Any of the features in the add-on can be used to analyze a URL. Just choose the cell containing that URL in your spreadsheet and hit analyze. Using the Semantic Labeling feature is a little different because you need to also submit your candidate labels through the Text Analysis Add-on sidebar.

Once you choose Semantic Labeling, you’ll notice 5 label options will populate on the right. This is where you enter your categories or labels. In this case, we’re going to use the following URL and Labels.

Example URL:


  • Golf
  • Football
  • Soccer
  • Hockey
  • Cricket
    Once you’ve selected the cell that you want to analyze and you’ve entered your labels, just hit analyze.

    The add-on will then populate the results in the next 5-10 cells in that row. As is the case in the example below.

    In this case the add-on chose “Football”, as the most closely related label to the article on that webpage. The add-on also displays a confidence score showing which label is “the winner”.

    As you can see from the screenshot of the URL below it did a pretty nice job of recognizing the article had nothing to do with soccer or golf and was primarily about Football.

    Article Screenshot:

    Customer Query Routing

    We’ve also seen our users analyze social interactions like Tweets, Facebook comments and even Email to try and intelligently understand and tag them without the need for manual reading.

    So, let’s say we want to automatically determine whether a post on social media should be routed to and dealt with by our Sales, Support or Finance Departments.

    We’ll use 2 different Tweets that could be handled by different teams within a business and use the different department titles as labels.


  • Sales
  • Finance
  • Support
  • “Are you guys down? I can’t access my account?”
  • “Who do I get in touch with if I want purchase your software?”
    Again, choose the cells you want to analyze that contain your Tweets, add your candidate labels in the sidebar and hit analyze.

    The add-on, as shown in the previous example, will populate it’s results in the next few cells showing the most appropriate label first, along with it’s score.

    Again the add-on was pretty accurate in assigning the correct labels to each Tweet. The first Tweet was tagged as most relevant to support and the second one was most appropriately referred to the sales department.

    This feature allows you to analyze and categorize long and short form text based off your own labels or tags. You can submit between 2 and 5 labels to the add-on and it will return the most semantically relevant tag as well as a confidence score.

  • Text Analysis API - Sign up


  • This blog is an adaptation of a talk, “Computer intelligence”, delivered by our founder Parsa Ghaffari (@parsaghaffari) and Trinity College, Dublin, research fellow and founder of Wripl, Kevin Koidl (Ph.D., M.Sc. (Dipl. Wirtsch. Inf. TU)). (@koidl)
  • The talk is a discussion of how computer, or artificial, intelligence works, its applications in the industry and the challenges it presents. You can watch the original video here.


Artificial intelligence or Computer intelligence [1] is hot in the tech scene right now, it’s high priority with tech giants like Google and Facebook, journalists are writing about how it will take our jobs and our lives and it’s even hot in Hollywood (although mostly in a technophobic fashion typical of the 21st century Hollywood).

In the industry, all of a sudden AI is everywhere and it almost looks like we’re ready to replace Marc Andreessen’s famous “software is eating the world” with “AI is eating the world”.

But what exactly are we talking about when we refer to Artificial or Computer intelligence?

AI could be defined as the science and engineering of making intelligent computers and computer programs. Since we don’t have a solid definition of intelligence that is not relative to human intelligence, we can define it as the ability to learn or understand things or to deal with new or difficult situations. We also know what computers are, they’re essentially machines that are programmed to carry out a specific task. So Computer Intelligence could be seen as a combination of these two concepts: an algorithmic approach to mimicking human intelligence.

Two branches of AI

Back in the 60s, AI got to a point where it could actually do things, and that created a new branch of AI that was more practical and more pragmatic, which, as a result, got adopted and pioneered by the industry eventually. The new branch (which we call Narrow AI in this article) had different optimization goals and success metrics compared to the original, now called, General AI.



General AI

If your goal was to predict what’s going to happen next in the room where you’re sitting, one option would be to consult with a physicist who would probably take an analytical approach and use well known equations from Thermodynamics, Electromagnetism and Newtonian Physics to predict the next state of the room.

A fundamentally different approach that doesn’t require a physicist’s involvement would be to set up as many sensors as possible (think video cameras, microphones, thermometers, etc) to capture and feed all the data from the room to a Super Computer, which then runs some form of probabilistic modelling to predict the next state.

The results you get from the second approach would be far more accurate than the ones produced by the physicist. However, with the second approach you did not really understand why things are the way they are, and that’s what General AI is all about: to understand how certain things such as language, cognition, vision, etc work and how they can be replicated.

Narrow AI

Narrow AI is a more focused application of Computer Intelligence that aims to solve a specific problem and is driven by industry, economics and results. Common use cases you will certainly have heard of include Siri on your iPhone or self-driving cars for example.

While Siri can be seen as an AI application, that doesn’t mean that the intelligence behind Siri can also power a self-driving car. The AI behind both is very different, one can’t do the other.

It’s also true that with Narrow AI the intelligence works by crunching information in set conditions for economic outputs so as an example Siri can only answer you certain questions, questions that she has the answer to or can retrieve the answer to by referencing a database.

Challenges of AI

Technical Challenges

As human beings, understanding visual and lingual information comes to us naturally, we read a piece of text and we can extract meaning, intent, feelings, information. We look at a picture and we identify objects, colours, people, places.

However, for machines it’s not that easy. Take this sentence for instance; “I made her duck”. It’s a pretty straightforward sentence, but it has multiple meanings. There are actually 4 potential meanings for that short sentence.

I cooked her some duck I forced her, to duck I made, her duck (the duck belonged to her) I made her duck (made her duck out of wood for example)

When we interpret text we rely on prompts, either syntax indicators or just context, that helps us predict the meaning of a sentence but teaching a machine to do this is a lot harder. There is a lot of ambiguity in language that makes it extremely hard for machines to understand text or language in general.




The same can be said for an image or picture, or visual information in general. As humans we can pick up and recognise certain things in an image within a matter of seconds, we know that there are a man and a dog in a picture, we recognise colours and even brands, but it takes an intelligent machine to do the same.




Philosophical Challenges

One of the main arguments against AI’s success is that we don’t have a good understanding of human intelligence, and, therefore, are not able to fully replicate it. A convincing counter-argument, pioneered by the likes of Ray Kurzweil, is that intelligence or consciousness is an emergent property of the comparatively simpler building blocks of our brains (Neurons) and to replicate a brain or to create intelligence, all we need to do is to understand, decode and replicate these building blocks.

Ethical Challenges

Imagine you’re in a self-driving car and it’s taking you over a narrow bridge. Suddenly a person appears in front of the car (say, due to losing balance) and to avoid hitting that person the AI must take a sharp turn which will result in the car falling off the bridge. If you hit the person, they will die and if you fall off the bridge you will get killed.

One solution is for the AI to predict who’s more “valuable” and make a decision based on that. So it would factor in things like age, job status, family status and so on and boil it down to a numerical comparison between you and the other person’s “worth”. But how accurate would that be? Or would you ever buy a self-driving car that has a chance of killing you?


While some serious challenges in AI still remain open, industry and the enterprise have latched on to the benefits that AI, like Natural Language Processing, Image Recognition and Machine Learning, can bring to a variety of problems and applications.

One thing can be said for certain, and it’s that AI has left the science and research labs and is powering developments in health, business and media. Industry has recognised the potential of Narrow AI and how it can change, enhance and optimize the way we approach problems and tasks as human beings.

[1] The border between AI and human intelligence is getting blurred, therefore eventually we might get to a point where intelligent behaviour manifested by a machine can no longer be labeled as “artificial”. In that case, Computer Intelligence would be better suited. That said, we use the terms Computer Intelligence and Artificial Intelligence interchangeably in this article.



I have made this letter longer than usual, because I lack the time to make it short — Blaise Pascal

We live in the age of “TL;DR“s and 140 character long texts: bite-sized content that is easy to consume and quick to digest. We’re so used to skimming through feeds of TL;DRs for acquiring information and knowledge about our friends and surroundings, that we barely sit through reading a whole article unless we find it extremely interesting.

It’s not necessarily a “bad” thing though – we are getting an option to exchange breadth for depth, which gives us more control over how we acquire new information with a higher overall efficiency.

This is an option we previously did not have, as most of the content was produced in long form and often without considering reader’s time constraints. But in the age of Internet, textual content must compete with other types of media such as images and videos, that are inherently easier to consume.

Vision: The Brevity Knob

In an ideal world, every piece of content should come with a knob attached to it that lets you adjust its length and depth by just turning the knob in either direction, towards brevity or verbosity:

  • If it’s a movie, you would start with a trailer and based on how interesting you find it, you could turn the knob to watch the whole movie, or a 60 or 30-minute version of it.
  • For a Wikipedia article, you would start with the gist, and then gradually turn the knob to learn more and gain deeper knowledge about the subject.
  • When reading news, you would read one or two sentences that describe the event in short and if needed, you’d turn the knob to add a couple more paragraphs and some context to the story.

This is our simplistic vision for how summarization technology should work.

Text Summarization

At AYLIEN we’ve been working on a Text Summarization technology that works just like the knob we described above: you give it some text, a news article perhaps, specify the target length of your summary, and our Summarization API automatically summarizes your text for you. Using it you can turn an article like this:

Screen Shot 2017-01-25 at 14.27.59

Into a handful of key sentences:

  1. Designed to promote a healthier balance between our real lives and those lived through the small screens of our digital devices, Moment tracks how much you use your phone each day, helps you create daily limits on that usage, and offers “occasional nudges” when you’re approaching those limits.
  2. The app’s creator, Kevin Holesh, says he built Moment for himself after realizing how much his digital addictions were affecting his real-world relationships.
  3. My main goal with Moment was make me aware of how many minutes I’m burning on my phone each day, and it’s helped my testers do that, too.”
  4. The overall goal with Moment is not about getting you to “put down your phone forever and go live in the woods,” Holesh notes on the app’s website.
  5. There’s also a bonus function in the app related to whether or not we’re putting our phone down in favor of going out on the town, so to speak – Moment can also optionally track where you’ve been throughout the day.

See a Live Demo

A New Version

Today we’re happy to announce a new version of our Summarization API that has numerous advantages over the previous versions and gives you more control over the length of the generated summary.

Two new parameters sentences_number and sentences_percentage allow you to control the length of your summary. So to get a summary that is 10% of the original text in length, you would make the following request:

curl --get --include "" -H "X-Mashape-Key: YOUR_MASHAPE_KEY"

We hope you find this new technology useful. Please check it out on our website and let us know if you have any questions or feedback:

Happy TL;DRing!

Text Analysis API - Sign up


Making API requests one by one can be inefficient when you have a large number of documents you wish to analyze. We’ve added a batch processing feature that makes it easier to process a large number of documents all at once using the Text Analysis API.

Steps to use this feature are as follows:

Step 1. Package all your documents in one file

Start by putting all your documents (or URLs) in one big text file – one document/URL per line. Example:

Don't panic.
Time is an illusion. Lunchtime doubly so.
For a moment, nothing happened. Then, after a second or so, nothing continued to happen.

Step 2. Make a batch request and obtain job identifier

Calling the /batch endpoint creates a new analysis job that will be processed eventually. There are a couple of parameters that you need to provide to /batch:

Parameter Description Possible values Required
data Data to be analyzed †
endpoints Comma separated list of Text Analysis API endpoints classify, concepts, entities, extract, language, sentiment, summarize, hashtags
entities_type Type of entities in your file, whether they are URLs, or texts text, url
output_format The format you wish to download the batch results in (Default: json) json, xml

† Maximum file size is 5MB

All other parameters sent to /batch will be passed down to the endpoints you’ve specified in endpoints in an as-is manner. For example:

curl -v -H "X-Mashape-Authorization: YOUR_MASHAPE_KEY"
    -F data=@"/home/amir/42"
    -F "endpoints=sentiment"
    -F "entities_type=text"
    -F "output_format=xml"
    -F "mode=tweet"

Will upload contents of file /home/amir/42, and indicates that each line is a text (not a URL), desired operation is sentiment analysis, and you wish to download the results in XML format.

A successful request will lead to a 201 Created, with a Location header which indicates the URI you can poll to get the status of your submitted job. For you convenience URI is also in the body of response.

Step 3. Poll the job status information until it is finished

You can call the URI obtained from last step to see the status of your job. Your job can be in either one of these states: pending, in-progress, failed, or completed. If your job is completed you’ll receive 303 See Other with a Location header indicating where you can download your results. Its also in the body of your response. Example:

curl -H "X-Mashape-Authorization: YOUR_MASHAPE_KEY"
    -H "Accept: text/xml"

Sample response (XML):


And sample JSON response:

    "status": "completed",
    "location": ""

Step 4. Download your results

location value obtained from the last step, is a pre-signed S3 Object URL which you can easily download using curl, or wget. Please note that results will be kept only for 7 days after the job is finished and will be deleted afterwards. If you fail to obtain the results during this period, you must re-submit your job.

Happy crunching!

Text Analysis API - Sign up