## Intro

The 2016 US Presidential election was one of (if not the) most controversial in the nation’s history. With the end prize being arguably the most powerful job in the world, the two candidates were always going to find themselves coming under intense media scrutiny. With more media outlets covering this election than any that have come before it, an increase in media attention and influence was a given.

But how much of an influence does the media really have on an election? Does journalistic bias sway voter opinion, or does voter opinion (such as poll results) generate journalistic bias? Does the old adage “all publicity is good publicity” ring true at election time?

“My sense is that what we have here is a feedback loop. Does media attention increase a candidate’s standing in the polls? Yes. Does a candidate’s standing in the polls increase media attention? Also yes.” -Jonathan Stray @jonathanstray

Thanks to an ever-increasing volume of media content flooding the web, paired with advances in natural language processing and text analysis capabilities, we are in a position to delve deeper into these questions than ever before, and by analyzing the final sixty days of the 2016 US Presidential election, that’s exactly what we set out to do.

## So, where did we start?

We started by building a very simple search using our News API to scan thousands of monitored news sources for articles related to the election. These articles, 170,000 in total, were then indexed automatically using our text analysis capabilities in the News API.

This meant that key data points in those articles were identified and indexed to be used for further analysis:

• Keywords
• Entities
• Concepts
• Topics

With each of the articles or stories sourced comes granular metadata such as publication time, publication source, source location, journalist name and sentiment polarity of each article. Combined, these data points provided us with an opportunity to uncover and analyze trends in news stories relating to the two presidential candidates.

We started with a simple count of how many times each candidate was mentioned from our news sources in the sixty days leading up to election day, as well as the keywords that were mentioned most.

## Keywords

By extracting keywords from the news stories we sourced, we get a picture of the key players, topics, organizations and locations that were mentioned most. We generated the interactive chart below using the following steps;

1. We called the News API using the query below.
2. We called it again, but searched for “Trump NOT Clinton”
3. Mentions of the two candidates naturally dominated in both sets of results so we removed them in order to get a better understanding of the keywords that were being used in articles written about them. We also removed some very obvious and/or repetitive words such as USA, America, White House, candidate, day, etc.

Here’s the query;

#### Most mentioned keywords in articles about Hillary Clinton

Straight away, bang in the middle of these keywords, we can see FBI and right beside it, emails.

#### Most mentioned keywords in articles about Donald Trump

Similar to Hillary, Trump’s main controversies appear most prominently in his keywords, with terms like women, video, sexual and assault all appearing prominently.

## Most media mentions

If this election was decided by the number of times a candidate was mentioned in the media, who would win? We used the following search queries to total the number of mentions from all sources over the sixty days immediately prior to election day;

Note: We could also have performed this search with a single query, but we wanted to separate the candidates for further analysis, and in doing this, we removed overlapping stories with titles that mentioned both candidates.

Here’s what we found, visualized;

#### Who was mentioned more in the media? Total mentions volume:

It may come as no surprise that Trump was mentioned considerably more than Clinton during this period, but was he consistently more prominent in the news over these sixty days, or was there perhaps a major story that has skewed the overall results? By using the Time Series endpoint, we can graph the volume of stories over time.

We generated the following chart using results from the two previous queries;

#### How media mentions for both candidates fluctuated in the final 60 days

As you would expect, the volume of mentions for each candidate fluctuates throughout the sixty day period, and to answer our previous question – yes, Donald Trump was consistently more prominent in terms of media mentions throughout this period. In fact, he was mentioned more than Hillary Clinton in 55 of the 60 days.

Let’s now take a look at some of the peak mention periods for each candidate to see if we can uncover the reasons for the spikes in media attention;

### Donald Trump

Trump’s peak period of media attention was October 10-13, as indicated by the highest red peak in the graph above. This period represented the four highest individual days of mention volume and can be attributed to the scandal that arose from sexual assault accusations and a leaked tape showing Trump making controversial comments about groping women.

The second highest peak, October 17-20, coincides with a more positive period for Trump, as a combination of a strong final presidential debate and a growing email scandal surrounding Hillary Clinton increased his media spotlight.

### Hillary Clinton

Excluding the sharp rise in mentions just before election day, Hillary’s highest volume days in terms of media mentions occurred from October 27-30 as news of the re-emergence of an FBI investigation surfaced.

So we’ve established the dates over the sixty days when each candidate was at their peak of media attention. Now we want to try establish the sentiment polarity of the stories that were being written about each candidate throughout this period. In other words, we want to know whether stories were being written in a positive, negative or neutral way. To achieve this, we performed Sentiment Analysis.

## Sentiment analysis

Sentiment Analysis is used to detect positive or negative polarity in text. Also known as opinion mining, sentiment analysis is a feature of text analysis and natural language processing (NLP) research that is increasingly growing in popularity as a multitude of use-cases emerge. Put simply, we perform Sentiment Analysis to uncover whether a piece of text is written in a positive, negative or neutral manner.

Note: The vast majority of news articles about the election will undoubtedly contain mentions of both Trump and Clinton. We therefore decided to only count stories with titles that mentioned just one candidate. We believe this significantly increases the likelihood that the article was written about that candidate. To achieve this, we generated search queries that included one candidate while excluding the other. The News API supports boolean operators, making such search queries possible.

First of all, we wanted to compare the overall sentiment of all stories with titles that mentioned just one candidate. Here are the two queries we used;

And here are the visualized results;

What am I seeing here? Blue represents articles written in a neutral manner, red in a negative manner and green in a positive manner. Again, you can hover over the graph to view more information.

#### What was the overall media sentiment towards Donald Trump?

Those of you that followed the election, to any degree, will probably not be surprised by these results. We don’t really need data to back up the claim that Trump ran the more controversial campaign and therefore generated more negative press.

Again, similar to how we previously graphed mention volumes over time, we also wanted to see how sentiment in the media fluctuated throughout this sixty day period. First we’ll look at Clinton’s mention volume and see if there is any correlation between mention volume and sentiment levels.

## Hillary Clinton

How to read this graph: The top half (blue) represents fluctuations in the number of daily media mentions (‘000’s) for Hillary Clinton. The bottom half represents fluctuations in the average sentiment polarity of the stories in which she was mentioned. Green = positive and red = negative.

You can hover your cursor over the data points to view more in-depth information.

#### Mentions Volume (top) vs. Sentiment (bottom) for Hillary Clinton

From looking at this graph, one thing becomes immediately clear; as volume increases, polarity decreases, and vice versa. What does this tell us? It tells us that perhaps Hillary was in the news for the wrong reasons too often – there were very few occasions when both volume and polarity increased simultaneously.

Hillary’s average sentiment remained positive for the majority of this period. However, that sharp dip into the red circa October 30 came just a week before election day. We must also point out the black line that cuts through the bottom half of the graph. This is a trend line representing average sentiment polarity and as you can see, it gets consistently closer to negative as election day approaches.

#### Mentions Volume (top) vs. Sentiment (bottom) for Donald Trump

Trump’s graph paints a different picture altogether. There was not a single day when his average polarity entered into the positive (green). What’s interesting to note here, however, is how little his mention volumes affected his average polarity. While there are peaks and troughs, there were no major swings in either direction, particularly in comparison to those seen on Hillary’s graph.

These results are of course open to interpretation, but what is becoming evident is that perhaps negative stories in the media did more damage to Clinton’s campaign than they did to Trump’s. While Clinton’s average sentiment polarity remained consistently more positive, Trump’s didn’t appear to be as badly affected when controversial stories emerged. He was consistently controversial!

Trumps lowest point, in terms of negative press, came just after the second presidential debate at the end of September. What came after this point is the crucial detail, however. Trump’s average polarity recovered and mostly improved for the remainder of the campaign. Perhaps critically, we see his highest and most positive averages of this period in the final 3 weeks leading up to election day.

## Sentiment from sources

At the beginning of this post we mentioned the term media bias and questioned its effect on voter opinion. While we may not be able to prove this effect, we can certainly uncover any traces of bias from media content.

What we would like to uncover is whether certain sources (ie publications) write more or less favorably about either candidate.

To test this, we’ve analyzed the sentiment of articles written about both candidates from two publications: USA Today and Fox News.

### USA Today

Query:

Similar to the overall sentiment (from all sources) displayed previously, the sentiment polarity of articles from USA Today shows consistently higher levels of negative sentiment towards Donald Trump. The larger than average percentage of neutral results indicate that USA Today took a more objective approach in their coverage of the election.

### Fox News

Again, Trump dominates in relation to negative sentiment from Fox News. However, what’s interesting to note here is that Fox produced more than double the percentage of negative story titles about Hillary Clinton than USA Today did. We also found that, percentage-wise, they produced half as many positive stories about her. Also, 3.9% of Fox’s Trump coverage was positive, versus USA Today’s 2.5%.

### Media bias?

These figures beg the question; how are two major news publications writing about the exact same news, with such varied levels of sentiment? It certainly highlights the potential influence that the media can have on voter opinion, especially when you consider how many people see each article/headline. The figures below represent social shares for a single news article;

Bear in mind, these figures don’t represent the number of people who saw the article, they represent the number of people who shared it. The actual number of people who saw this on their social feed will be a high-multiple of these figures. In fact, we grabbed the average daily social shares, per story, and graphed them to compare;

#### Average social shares per story

Pretty even, and despite Trump being mentioned over twice as many times as Clinton during this sixty day period, he certainly didn’t outperform her when it came to social shares.

## Conclusion

Since the 2016 US election was decided there has been a sharp focus on the role played by news and media outlets in influencing public opinion. While we’re not here to join the debate, we are here to show you how you can deep-dive into news content at scale to uncover some fascinating and useful insights that can help you source highly targeted and precise content, uncover trends and assist in decision making.

## Introduction

It’s certainly an exciting time be involved in Natural Language Processing (NLP), not only for those of us who are involved in the development and cutting-edge research that is powering its growth, but also for the multitude of organizations and innovators out there who are finding more and more ways to take advantage of it to gain a competitive edge within their respective industries.

With the global NLP market expected to grow to a value of $16 billion by 2021, it’s no surprise to see the tech giants of the world investing heavily and competing for a piece of the pie. More than 30 private companies working to advance artificial intelligence technologies have been acquired in the last 5 years by corporate giants competing in the space, including Google, Yahoo, Intel, Apple and Salesforce. [1] It’s not all about the big boys, however, as NLP, text analysis and text mining technologies are becoming more and more accessible to smaller organizations, innovative startups and even hobbyist programmers. NLP is helping organizations make sense of vast amounts of unstructured data, at scale, giving them a level of insight and analysis that they could have only dreamed about even just a couple of years ago. Today we’re going to take a look at 3 industries on the cusp of disruption through the adoption of AI and NLP technologies; 1. The legal industry 2. The insurance industry 3. Customer service ## NLP & Text Analysis in the Legal industry While we’re still a long long way away from robot lawyers, the current organic crop of legal professionals are already taking advantage of NLP, text mining and text analysis techniques and technologies to help them make better-informed decisions, in quicker time, by discovering key insights that can often be buried in large volumes of data, or that may seem irrelevant until analyzed at scale, uncovering strategy-boosting and often case-changing trends. Let’s take a look at two examples of how legal pro’s are leveraging NLP and text analysis technologies to their advantage; • Information retrieval in ediscovery • Contract management • Article summarization ### Information retrieval in ediscovery Ediscovery refers to discovery in legal proceedings such as litigation, government investigations, or Freedom of Information Act requests, where the information sought is in electronic format. Electronic documents are often accompanied by metadata that is not found on paper documents, such as the date and time the document was written, shared, etc. This level of minute detail can be crucial in legal proceedings. As far as NLP is concerned, ediscovery is mainly about information retrieval, aiding legal teams in their search for relevant and useful documents. In many cases, the amount of data requiring analysis can exceed 100GB, when often only 5% – 10% of it is actually relevant. With outside service bureaus charging$1,000 per GB to filter and reduce this volume, you can start to see how costs can quickly soar.

Data can be filtered and separated by extracting mentions of specific entities (people, places, currency amounts, etc), including/excluding specific timeframes and in the case of email threads, only include mails that contain mentions of the company, person or defendant in question.

### Contract management

NLP enables contract management departments to extract key information, such as currency amounts and dates, to generate reports that summarize terms across contracts, allowing for comparisons among terms for risk assessment purposes, budgeting and planning.

In cases relating to Intellectual Property disputes, attorneys are using NLP and text mining techniques to extract key information from sources such as patents and public court records to help give them an edge with their case.

### Article summarization

Legal documents can be notoriously long and tedious to read through in their entirety. Sometimes all that is required is a concise summary of the overall text to help gain an understanding of its content. Summarization of such documents is possible with NLP, where a defined number of sentences are selected from the main body of text to create, for example, a summary of the top 5 sentences that best reflect the content of the document as a whole.

## NLP & Text Analysis in the Insurance industry

Insurance providers gather massive amounts of data each day from a variety of channels, such as their website, live chat, email, social networks, agents and customer care reps. Not only is this data coming in from multiple channels, it also relates to a wide variety of issues, such as claims, complaints, policies, health reports, incident reports, customer and potential customer interactions on social media, email, live chat, phone… the list goes on and on.
The biggest issue plaguing the insurance industry is fraud. Let’s take a look at how NLP, data mining and text analysis techniques can help insurance providers tackle these key issues;

• Streamline the flow of data to the correct departments/agents
• Improve agent decision making by putting timely and accurate data in front of them
• Improve SLA response times and overall customer experience
• Assist in the detection of fraudulent claims and activity

### Streamlining the flow of data

That barrage of data and information that insurance companies are being hit by each and every day needs to be intricately managed, stored, analyzed and acted upon in a timely manner. A missed email or note may not only result in poor service and an upset customer, it could potentially cost the company financially if, for example, relevant evidence in a dispute or claim case fails to surface or reach the right person/department on time.

Natural Language Processing is helping insurance providers ensure the right data reaches the right set of eyeballs at the right time through automated grouping and routing of queries and documents. This goes beyond simple keyword-matching with text analysis techniques used to ‘understand’ the context and category of a piece of text and classify it accordingly.

### Fraud detection

According to a recent report by Insurance Europe, detected and undetected fraudulent claims are estimated to represent 10% of all claims expenditure in Europe. Of note here, of course, is the fraud that goes undetected.

Insurance companies are using NLP and text analysis techniques to mine the data contained within unstructured sources such as applications, claims forms and adjuster notes to unearth certain red flags in submitted claims. For example, a regular indicator of organized fraudulent activity is the appearance of common phrases or descriptions of incidents from multiple claimants. The trained human eye may or may not be able to spot such instances but regardless, it would be a time consuming exercise and likely prone to subjectivity and inconsistency from the handler.

The solution for insurance providers is to develop NLP-powered analytical dashboards that support quick decision making, highlight potential fraudulent activity and therefore enable their investigators to prioritise cases based on specifically defined KPIs.

## NLP, Text Analysis & Customer Service

In a world that is increasingly focused on SLAs, KPIs and ROIs, the role of Customer Support and Customer Success, particularly in technology companies, has never been more important to the overall performance of an organization. With the ever-increasing number of startups and innovative companies disrupting pretty much every industry out there, customer experience has become a key differentiator in markets flooded with consumer choice.

Let’s take a look at three ways that NLP and text analysis is helping to improve CX in particular;

• Chat bots
• Analyzing customer/agent interactions
• Sentiment analysis
• Automated routing of customer queries

### Chat bots

It’s safe to say that chat bots are a pretty big deal right now! These conversational agents are beginning to pop up everywhere as companies look to take advantage of the cutting edge AI that power them.

Chances are that you interact with multiple artificial agents on a daily basis, perhaps even without realizing it. They are making recommendations as we online shop, answering our support queries in live chats, generating personalized fitness routines and communicating with us as virtual assistants to schedule meetings.

A recent interaction I had with a personal assistant bot, Amy
Chat bots are helping to bring a personalized experience to users. When done right, not only can this reduce spend in an organization , as they require less input from human agents, but it can also add significant value to the customer experience with intelligent, targeted and round-the-clock assistance at hand.

### Analyzing customer/agent interactions

Interactions between support agents and customers can uncover interesting and actionable insights and trends. Many interactions are in text format by default (email, live chat, feedback forms) while voice-to-text technology can be used to convert phone conversations to text so they can be analyzed.

### Listening to their customers

The voice of the customer is more important today than ever before. Social media channels offer a gold mine of publicly available consumer opinion just waiting to be tapped. NLP and text analysis enables you to analyze huge volumes of social chatter to help you understand how people feel about specific events, products, brands, companies, and so on.

Analyzing the sentiment towards your brand, for example, can help you decrease churn and improve customer support by uncovering and proactively working on improving negative trends. It can help show you what you are doing wrong before too much damage has been done, but also quickly show you what you are doing right and should therefore continue doing.

Customer feedback containing significantly high levels of negative sentiment can be relayed to Product and Development teams to help them focus their time and efforts more accordingly.

Because of the multi-channel nature of customer support, you tend to have customer queries and requests coming in from a variety of sources – email, social media, feedback forms, live chat. Speed of response is a key performance metric for many organizations and so routing customer queries to the relevant department, in as few steps as possible, can be crucial.

NLP is being used to automatically route and categorize customer queries, without any human interaction. As mentioned earlier, this goes beyond simple keyword-matching with text analysis techniques being used to ‘understand’ the context and category of a piece of text and classify it accordingly.

## Conclusion

As the sheer amount of unstructured data out there grows and grows, so too does the need to gather, analyze and make sense of it. Regardless of the industry in which they operate, organizations that focus on benefitting from NLP and text analysis will no doubt gain a competitive advantage as they battle for market share.

Most of our users will make 3 or more calls to our API for every piece of text or URL they analyze. For example if you’re a publisher who wants to extract insight from a an article or URL it’s likely you’ll want to use more than one of our features to get a proper understanding of that particular article or URL.

With this in mind, we decided to make it faster, easier and more efficient for our users to run multiple analysis operations in one single call to the API.

Our Combined Calls endpoint, allows you to run more than one type of analysis on a piece of text or URL without having to call each endpoint separately.

• Run multiple operations at once
• Speed up your analysis process
• Write cleaner, more efficient code

### Combined Calls

To showcase how useful the Combined Calls endpoint can be, we’ve ran a typical process that a lot of our news and media focused users would use when analyzing URL’s or articles on news sites.

In this case, we’re going to Classify the article in question and extract any Entities and Concepts present in the text. To run a process like this would typically involve passing the same URL to the API 3 times, once for each analysis operation and following that, retrieving 3 separate results relevant to each operation. However, with Combined Calls, we’re only making 1 call to the API and retrieving 1 set of results, making it a lot more efficient and cleaner for the end user.

Code Snippet:

var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: "APP_ID",
application_key: "APP_KEY"
});

textapi.combined({
"url": "http://www.bbc.com/news/technology-33764155",
"endpoint": ["entities", "concepts", "classify"]
}, function(err, result) {
if (err === null) {
console.log(JSON.stringify(result));
} else {
console.log(err)
}
});


The code snippet above was written using our Node.js SDK. SDKs are available for a variety of languages on our SDKs page.

### Results

We’ve broken down the results below into three sections, Entities, Concepts and Classification to help with readability, but using the combined calls endpoint all of these results would be returned together.

#### Entities:

{
"results": [
{
"endpoint": "entities",
"result": {
"entities": {
"keyword": [
"internet servers",
"flaw in the internet",
"internet users",
"server software",
"exploits of the flaw",
"internet",
"System (DNS) software",
"servers",
"flaw",
"expert",
"vulnerability",
"systems",
"software",
"exploits",
"users",
"websites",
"offline",
"URLs",
"services"
],
"organization": [
"DNS",
"BBC"
],
"person": [
"Daniel Cid",
"Brian Honan"
]
},
"language": "en"
}
},


#### Concepts:

{
"endpoint": "concepts",
"result": {
"concepts": {
"http://dbpedia.org/resource/Apache": {
"support": 3082,
"surfaceForms": [
{
"offset": 1261,
"score": 0.9726336488480631,
"string": "Apache"
}
],
"types": [
"http://dbpedia.org/ontology/EthnicGroup"
]
},
"http://dbpedia.org/resource/BBC": {
"support": 61289,
"surfaceForms": [
{
"offset": 1108,
"score": 0.9997923194235071,
"string": "BBC"
}
],
"types": [
"http://dbpedia.org/ontology/Agent",
"http://schema.org/Organization",
"http://dbpedia.org/ontology/Organisation",
"http://dbpedia.org/ontology/Company"
]
},
"http://dbpedia.org/resource/Denial-of-service_attack": {
"support": 503,
"surfaceForms": [
{
"offset": 264,
"score": 0.9999442627824017,
"string": "denial-of-service attacks"
}
],
"types": [
""
]
},
"http://dbpedia.org/resource/Domain_Name_System": {
"support": 1279,
"surfaceForms": [
{
"offset": 442,
"score": 1,
"string": "Domain Name System"
},
{
"offset": 462,
"score": 0.9984593397878601,
"string": "DNS"
}
],
"types": [
""
]
},
"http://dbpedia.org/resource/Hacker_(computer_security)": {
"support": 1436,
"surfaceForms": [
{
"offset": 0,
"score": 0.7808308562314218,
"string": "Hackers"
},
{
"offset": 246,
"score": 0.9326746054676964,
"string": "hackers"
}
],
"types": [
""
]
},
"http://dbpedia.org/resource/Indian_School_Certificate": {
"support": 161,
"surfaceForms": [
{
"offset": 794,
"score": 0.7811847159512098,
"string": "ISC"
}
],
"types": [
""
]
},
"http://dbpedia.org/resource/Internet_Systems_Consortium": {
"support": 35,
"surfaceForms": [
{
"offset": 765,
"score": 1,
"string": "Internet Systems Consortium"
}
],
"types": [
"http://dbpedia.org/ontology/Agent",
"http://schema.org/Organization",
"http://dbpedia.org/ontology/Organisation",
"http://dbpedia.org/ontology/Non-ProfitOrganisation"
]
},
"http://dbpedia.org/resource/OpenSSL": {
"support": 105,
"surfaceForms": [
{
"offset": 1269,
"score": 1,
"string": "OpenSSL"
}
],
"types": [
"http://schema.org/CreativeWork",
"http://dbpedia.org/ontology/Work",
"http://dbpedia.org/ontology/Software"
]
}
},
"language": "en"
}
},


#### Classification:

{
"endpoint": "classify",
"result": {
"categories": [
{
"code": "04003005",
"confidence": 1,
"label": "computing and information technology - software"
}
],
"language": "en"
}
}


You can find more information on using Combined Calls in our Text Analysis Documentation.

We should also point out that the existing rate limits will also apply when using Combined Calls. You can read more about our rate limits here.

## Introduction

This is the second edition of our NLP terms explained blog posts. The first edition deals with some simple terms and NLP tasks while this edition, gets a little bit more complicated. Again, we’ve just chosen some common terms at random and tried to break them down in simple English to make them a bit easier to understand.

#### Part of Speech tagging (POS tagging)

Sometimes referred to as grammatical tagging or word-category disambiguation, part of speech tagging refers to the process of determining the part of speech for each word in a given sentence based on the definition of that word and its context. Many words, especially common ones, can serve as multiple parts of speech. For example, “book” can be a noun (“the book on the table”) or verb (“to book a flight”).

#### Parsing

Parsing is a major task of NLP. It’s focused on determining the grammatical analysis or Parse Tree of a given sentence. There are two forms of Parse trees Constituency based and dependency based parse trees.

#### Semantic Role Labeling

This is an important step towards making sense of the meaning of a sentence. It focuses on the detecting semantic arguments associated with a verb or verbs in a sentence and the classification of those verbs into into specific roles.

#### Machine Translation

A sub-field of computational linguistics MT investigates the use of software to translate text or speech from one language to another.

#### Statistical Machine Translation

SMT is one of a few different approaches to Machine Translation. A common task in NLP it relies on statistical methods based off bilingual corpora such as the Canadian Hansard corpus. Other approaches to Machine Translation include Rule Based Translation and Example-Based Translation.

#### Bayesian Classification

Bayesian classification is a classification method based on Bayes Theorem and is commonly used in Machine Learning and Natural Language Processing to classify text and documents. You can read more about it in Naive Bayes for Dummies.

#### Hidden Markov Model (HMM)

In order to understand a HMM we need to define a Markov Model. This is used to model randomly changing systems where it is assumed that future states only depend on the present state and not on the sequence of events that happened before it.

A HMM is a Markov model where the system being modeled is assumed to have unobserved or hidden states. There are a number of common algorithms used for hidden Markov models. The  Viterbi algorithm which will compute the most-likely corresponding sequence of states and the forward algorithm, for example, will compute the probability of the sequence of observations and both are often used in NLP applications.

In hidden Markov models, the state is not directly visible, but output, dependent on the state, is visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.

#### Conditional Random Fields (CRFs)

A class of statistical modeling methods that are often applied in pattern recognition and machine learning, where they are used for structured prediction. Ordinary classifiers will predict labels for a sample without taking neighboring samples into account, a CRF model however, will take context into account. CRF is commonly used in NLP (e.g. in Named Entity Extraction) and more recently in image recognition.

#### Affinity Propagation (AP)

AP is a clustering algorithm commonly used in Data Mining, unlike other clustering algorithms such as, k-means, AP does not require the number of clusters to be estimated before running the algorithm. A semi-supervised version of AP is commonly used in NLP.

#### Relationship extraction

Given a chunk of words or a piece of text determining the relationship between named entities.

### What’s Blockspring?

Blockspring is a really exciting, YC backed startup, who pitch themselves as “The world’s library of functions, accessible from everywhere you do work.” Their platform allows you to interact with a library of various APIs through a spreadsheet, simple code snippets and soon a chat interface.

The platform lets you run 1000+ functions directly from your spreadsheet or through simple code snippets for the more technically inclined. Accessing APIs with Blockspring is done through the concept of functions and they certainly have some cool APIs available to interact with in their library.

Where Blockspring gets really interesting though, is when you start to combine multiple functions. Your spreadsheet pretty much becomes a playpen where you can interact with one or multiple APIs and create powerful applications and “mashups”. Some of the examples of what can be done with Blockspring include, automating social activity and monitoring, gathering marketing data about user segments and usage, accessing public datasets, scraping websites and now even analyzing text and unstructured data all of which are really nicely showcased on their getting started page.

### AYLIEN and Blockspring

Like Blockspring, we want to get the power of our API into the hands of anyone that can get value from it. We launched our own Text Analysis Add-on for google sheets last year. The add-on works in the same way as Blockspring, through simple functions and acts as an interface for our Text Analysis API. Integrating with Blockspring, however, means our users can now open up their use cases by combining our functions with other complementary APIs to create powerful tools and integrations.

All of the AYLIEN end-points are available through Blockspring as simple snippets or spreadsheet functions and getting started with AYLIEN and Blockspring is really easy.

#### It’s simple to get up and running:

Step 1.

Step 2.

Grab your AYLIEN APP ID and API key and keep it handy. If you don’t have an AYLIEN account just sign up here.

Step 3.

Explore the getting started section to see examples of the functions and APIs available.

Step 4.

Try some of the different functions through their interactive docs to get a feel for how they work.

Step 5.

Go wild and start building and creating mashups of functions with code snippets or in Google Sheets.

PS: Don’t forget to add your AYLIEN keys to your Blockspring account in the Secrets section of your account settings. Once they’ve been added, you won’t have to do it again.

We’re really excited to see what the Blockspring community start to build with our various functions. Over the next couple of weeks, we’ll also be showcasing some cool mashups that we’ve put together in Blockspring so keep your eyes peeled on the blog.

### Introduction

As you may already know, we like to feature interesting and useful examples of what the AYLIEN community are building with our API. Previously, we’ve showcased a PowerShell wrapper and a Laravel wrapper. For this edition of our blog, we’re going feature a data scientist who spent some time building an R binding for our Text Analysis API.

Arnab is a solutions architect and data analytics expert based in India. He recently developed an R binding for our Text Analysis API that we think is going to be very popular amongst our R users. This R Binding makes it really quick and easy for the R community, to get up and running with our API and we’re glad he spent the time to putting it together.

### Setup

If you’re new to AYLIEN and Text Analysis, the first thing you’ll need to do is sign up for free access to the API. Take a look at our getting started page, which will take you through the signup process. We have a free plan available which allows you to make up to 1,000 calls to the API per day, for free.

The second thing you need to do is, install the following packages from your R console using the following commands:

install.packages("XML")
install.packages("plyr")


Point to file:

source("/Users/parsa/Desktop/aylienAPI-R.R")

Note: You must have XQuartz installed to view the results which you can download here.

### Utilisation

To show how easy it is to use the API with R, we’re going to run a simple analysis using the binding, by analyzing the following article: (http://www.bbc.com/sport/0/football/25912393), extracting the main body of text from the page and classifying the article based on IPTC news-codes.

Code Snippet:

aylienAPI<-function(APPLICATION_ID, APPLICATION_KEY, endpoint, parameters, type)
{
url = paste0('https://api.aylien.com/api/v1/',endpoint)
httpHeader = c(Accept="text/xml", 'X-AYLIEN-TextAPI-Application-ID' = APPLICATION_ID,
'X-AYLIEN-TextAPI-Application-Key'= APPLICATION_KEY,
'Content-Type'="application/x-www-form-urlencoded; charset=UTF-8")

paramPost<-paste0(type,"=",parameters)

paramEncode<-URLencode(paramPost)

postfields=paramEncode, verbose = FALSE)

resp
}



APPLICATION_ID = 'YOUR_APPLICATION_ID'
APPLICATION_KEY = 'YOUR_APPLICATION_KEY'


Arnab has made it really easy to call each end point, all you need to do is specify the endpoint in the code. To call the classification endpoint for example, we simply use “classify”.

endpoint = "classify"
parameters = "http://www.bbc.com/sport/0/football/25912393"
type = "url"


### Results

It’s up to you, how you want to display your results, but using the following command, displays them nice and clearly in a table format converting the output to a data frame, as shown in the image below.

resultsdf<-ldply(xmlToList(results), data.frame)
View(resultsdf)

As you can see from the Results, the API returned an accurate two-tier classification of “Sport – Soccer”.

You can also choose to retrieve data using Xpath from the XML result with the following request.

PARSED<-xmlInternalTreeParse(results)
View(xpathSApply(PARSED, "//category",xmlValue))

If you have an app or a wrapper you’ve built with our API’s, we’d love to hear about it and feature it on our blog. Get in touch at hello@aylien.com and tell us what you’re building.

### Introduction

There is a wealth of information in a LinkedIn profile. You can tell a lot about someone and how well they are suited to a role by analyzing their profile on LinkedIn, and let’s face it, LinkedIn in the number one platform to showcase yourself to potential employers and recruiters.

However, there are a number of issues that arise in relying on Linkedin profiles to understand a candidate’s suitability for a role and their current job function.

### The Experiment

We set out to find out what section of a LinkedIn profile contains the most insight into an individual’s job function by using Semantic Labeling to try and predict an individual’s job title based on the info they have on their profile.

#### How did we do it?

We scraped and parsed a number of well known LinkedIn profiles. Using the information we extracted from the profile such as keywords, summaries, job titles and skills we attempted to predict an individual’s job function from each information section to understand which best represents an individual’s ability or function.

We started out by choosing 4 general tags or labels for an individual’s profile that would point towards their high-level job function:

• Information Technology
• Finance
• Marketing
• Operations

Using the Semantic Labeling feature to check how related a tag or label, like Marketing, was to an individual’s actual job function, we could essentially predict what an individual’s actual function is.

Our findings are displayed in the sheet embedded below. The first section of the sheet contains the profiles and information extracted. The Yellow section is the prediction results based on the skills section, red is the Summary section and Green is the Job Title results.

When a label/job function is assigned following our analysis it is also accompanied by a confidence score, which indicates how confident we are in the results. This is important to note as we dive into some of the results. The “winning” results with the highest scores are marked in green.

Note:

For this blog, we kept the functions quite general but you can get quite specific as we have with Gregory’s account below.

### But what section of a profile provides the most insight?

#### Content

When analyzing a Linkedin profile or even using the search feature we’re primarily focusing on keywords mentioned in the content of that profile. Educational institutes, companies, and technologies mentioned for example.

Relying on keywords can often cause problems, there is huge a amount of latent information in a profile that is often overlooked when scanning profiles for keywords. A major problem with keyword search is that it misses related skills, e.g. someone might have “Zend Framework” on their profile, but not PHP – which is inherent, ‘cause Zend is a PHP framework. Good recruiters or somenone with programming knowledge would know this, average recruiters, however, may not.

The same could be said for someone who mentions Image Processing in their profile there is no obvious connection to other inherent knowledge such as Facial Recognition. A knowledge base such as Wikipedia, DBpedia or Freebase can be used to discover these latent connections:

#### Job Titles

Relying on job titles can also cause problems. They can be inaccurate, misleading or even made up. Especially today, as people often create their own titles. Take Rand Fishkin’s profile on LinkedIn as an example. Unless you know of MOZ and Rand’s wizardry you would have no idea he is at the forefront of Inbound, Social and SEO.

Another good example is Dharmesh Shah, founder of HubSpot’s profile. Dharmesh’s title is Founder and CTO of HubSpot.  Running our analysis on the extracted title, Information Technology with a score of .016 is the job function returned for Dharmesh, which is somewhat accurate. However, running the same analysis on his skills section gives a far more accurate result suggesting Dharmesh is actually a Marketer with a “winning” score of .23.

#### Summaries

A profile Summary can be quite insightful and can provide a strong understanding of someone’s ability and function, but the problem is they aren’t always present or they often contain very little information causing them to be overlooked or disregarded. As was the case in many of the example profiles we used.

The ones that do have a detailed summary provided some strong results. With Rand Fishkin’s profile summary returning some accurate results of Marketing and a score of .188.

There was one section that outperformed the others when providing relevant tags and confidence scores.

#### Skills

The Skills section on a LinkedIn profile is a gold mine of insight. Based on the information extracted from the skills section, we could more accurately predict an individual’s job function.

Comparing the results and labels assigned across all the information sections and on every profile we used, the Skills section produced the most accurate relationships and the highest confidence scores, which can be seen marked green in the sheets above.

### Conclusion

We don’t have an exact science or formula for deciding whether a label is accurate or not, however, our experiment still does a good job of highlighting the fact that, a lot more information and insight can be gleaned from the skills section of a linkedIn profile in deciding at first glance, or automatically how well a candidate is suited to a particular job function. We will explore these ideas in future posts.

This is the sixth in our series of blogs on getting started with AYLIEN’s various SDKs.

If you’re new to AYLIEN and you don’t have an account yet, you can go directly to the Getting Started page on the website which will take you through the signup process. We have a free plan to get started with that allows you to make up to 1,000 calls per day for free.

All of our SDK repositories are hosted on Github. For the Java SDK, the Text Analysis API is published to Maven Central, so simply add the dependency to the POM:


<dependency>
<groupId>com.aylien.textapi>
<artifactId>client>
<version>0.1.0>
</dependency>


Once you’ve installed the SDK you’re ready to start coding. For the remainder of this blog we’ll walk you through making calls and show the output you should receive in each case. Taking simple examples we’ll showcase some of the API’s endpoints like Language detection, Sentiment Analysis and hashtag suggestion.

### Configuring the SDK with your AYLIEN credentials

Once you’ve received your AYLIEN APP_ID and APP_KEY from the signup process and have downloaded the SDK you can begin making calls with the following imports and configuration code.


import com.aylien.textapi.TextAPIClient;
import com.aylien.textapi.parameters.*;
import com.aylien.textapi.responses.*;

TextAPIClient client = new TextAPIClient(
"YourApplicationId", "YourApplicationKey");


When calling the various API endpoints you can specify a piece of text directly for analysis or you can pass a url linking to the text or article you wish to analyze.

### Language Detection

First off, let’s take a look at the language detection endpoint. As a simple example we’re going to detect the language of the following sentence: ‘What language is this sentence written in?’

To do this, you can call the endpoint using the following piece of code.


String text = "What language is this sentence written in?";
LanguageParams languageParams = new LanguageParams(text,null);
Language language = client.language(languageParams);    System.out.printf("nText : %s",language.getText());
System.out.printf("nLanglanguage : %s",language.getLanguage());
System.out.printf("nConfidence %f",language.getConfidence());


You should receive an output very similar to the one shown below. This shows that the language detected was English and the confidence that it was detected correctly (a number between 0 and 1) is very close to 1, which means you can be pretty sure it is correct.

#### Language Detection Results


Text : What language is this sentence written in?
Langlanguage : en
Confidence 0.999997


### Sentiment Analysis

Next, we’ll look at analyzing the sentence “John is a very good football player” to determine it’s sentiment i.e. whether it’s positive, neutral or negative. The Sentiment Analysis endpoint will also determine if the text is subjective or objective. You can call the endpoint with the following piece of code


text = "John is a very good football player!";
SentimentParams sentimentParams = new SentimentParams(text,null,null);
Sentiment sentiment = client.sentiment(sentimentParams);
System.out.printf("nText : %s",sentiment.getText());
System.out.printf("nSentiment Polarity   : %s",sentiment.getPolarity());
System.out.printf("nPolarity Confidence  : %f",sentiment.getPolarityConfidence());
System.out.printf("nSubjectivity : %s",sentiment.getSubjectivity());
System.out.printf("nSubjectivity Confidence: %f",sentiment.getSubjectivityConfidence());


You should receive an output similar to the one shown below which indicates that the sentence is objective and is positive, both with a high degree of confidence.

#### Sentiment Analysis Results


Text : John is a very good football player!
Sentiment Polarity   : positive
Polarity Confidence  : 0.999999
Subjectivity : objective
Subjectivity Confidence: 0.989682


### Hashtag Suggestion

Finally, we’ll look at analyzing a BBC article to extract hashtag suggestions for it with the following code.


HashTagsParams hashtagsParams = new HashTagsParams(null,url,null);
HashTags hashtags = client.hashtags(hashtagsParams);
System.out.print("Hashtags : n");
System.out.print(hashtags + "n");


You should receive the output shown below.

#### Hashtag Suggestion Results


Hashtags :
#Planet #JohannesKepler #Kepler #Birmingham #Earth #Astronomy #TheAstrophysicalJournal #Warwick #Venus #Orbit #Mercury #SolarSystem #Resonance #TerrestrialPlanet #Lightyear #Imagine


If Java’s not your preferred language then check out our otherSDKs for node.js, Go, PHP, Python, Ruby and .Net (C#). For more information regarding the APIs go to the documentation section of our website.

We’ve just added support for microformat parsing to our Text Analysis API through our Microformat Extraction endpoint.

Microformats are simple conventions or entities that are used on web pages, to describe a specific type of information, for example, Contact info, Reviews, Products, People, Events, etc.

Microformats are often included in the HTML of pages on the web to add semantic information about that page. They make it easier for machines and software to scan, process and understand webpages. AYLIEN Microformat Extraction allows users to detect, parse and extract embedded Microformats when they are present on a page.

Currently, the API supports the hCard format. We will be providing support for the other formats over the coming months. The quickest way to get up and running with this endpoint is to download an SDK and checkout the documentation. We have gone through a simple example below to showcase the endpoints capabilities.

### Microformat Extraction in Action

The following piece of code sets up the credentials for accessing our API. If you don’t have an AYLIEN account, you can sign up here.


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: YOUR_APP_ID,
application_key: ‘YOUR_APP_KEY'
});



The next piece of code accesses an HTML test page containing microformats, that we have setup in codepen to illustrate how the endpoint works (check out http://codepen.io/michaelo/pen/VYxxRR.html to see the raw HTML). The code consists of a call to the microformats endpoint and a forEach statement to display any hCards detected on the page.


textapi.microformats('http://codepen.io/michaelo/pen/VYxxRR.html',
function(err, res) {
if (err !== null) {
console.log("Error: " + err);
} else {
res.hCards.forEach(function(hCard) {
console.log(hCard);
console.log("n****************************************");
console.log("End Of vCard");
console.log("******************************************");
});
}
});



As you can see from the results below, there are two hcards on the page, one for Sally Ride and the other for John Glenn. The documentation for the endpoint shows the structure of the data returned by the endpoint and lists the optional hCard fields that are currently supported. You can copy the code above and paste it into our sandbox environment to view the results for yourself and play around with the various fields.

#### Results


{ birthday: '1951-05-26',
organization: 'Sally Ride Science',
telephoneNumber: '+1.818.555.1212',
location:
{ id: '9f15e27ff48eb28c57f49fb177a1ed0af78f93ab',
latitude: '37.386013',
longitude: '-122.082932' },
photo: 'http://example.com/sk.jpg',
email: 'sally@example.com',
url: 'http://sally.example.com',
fullName: 'Sally Ride',
structuredName:
{ familyName: 'van der Harten',
givenName: 'Sally',
honorificSuffix: 'Ph.D.',
honorificPrefix: 'Dr.' },
logo: 'http://www.abc.com/pub/logos/abccorp.jpg',
id: '7d021199b0d826eef60cd31279037270e38715cd',
note: '1st American woman in space.',
countryName: 'U.S.A',
postalCode: 'LWT12Z',
id: '00cc73c1f9773a66613b04f11ce57317eecf636b',
region: 'California',
locality: 'Los Angeles' },
category: 'physicist' }

****************************************
End Of vCard
****************************************

{ birthday: '1921-07-18',
telephoneNumber: '+1.818.555.1313',
location:
latitude: '30.386013',
longitude: '-123.082932' },
photo: 'http://example.com/jg.jpg',
email: 'johnglenn@example.com',
url: 'http://john.example.com',
fullName: 'John Glenn',
structuredName:
{ familyName: 'Glenn',
givenName: 'John',
id: 'a1146a5a67d236f340c5e906553f16d59113a417',
honorificPrefix: 'Senator' },
logo: 'http://www.example.com/pub/logos/abccorp.jpg',
id: '18538282ee1ac00b28f8645dff758f2ce696f8e5',
note: '1st American to orbit the Earth',
countryName: 'U.S.A',
postalCode: 'PC123',
id: '8cc940d376d3ddf77c6a5938cf731ee4ac01e128',
region: 'Ohio',
locality: 'Columbus' } }

****************************************
End Of vCard
****************************************



Microformats Extraction allows you to automatically scan and understand webpages by pulling relevant information from HTML. This microformat information is easier for both humans and now machines to understand than other complex forms such as XML.

Our development team have been working hard adding additional features to the API which allow our users to analyze, classify and tag text in more flexible ways. Unsupervised Classification is a feature we are really excited about and we’re happy to announce that it is available as a fully functional and documented feature, as of today.

#### So what exactly is Unsupervised Classification?

It’s a training-less approach to classification, which means, unlike our standard classification, that is based on IPTC News Codes, it doesn’t rely on a predefined taxonomy to categorize text. This method of classification allows automatic tagging of text that can be tailored to a users needs, without the need for a pre-trained classifier.

#### Why are we so excited about it?

Our Unsupervised Classification endpoint will allow users to specify a set of labels, analyze a piece of text and then assign the most appropriate label to that text. This allows greater flexibility for our users to decide, how they want to tag and classify text.

There are a number of ways this endpoint can be used and we’ll walk you through a couple of simple examples; Text Classification from a URL and Customer Service Routing of social interactions.

### Classification of Text

We’ll start with a simple example to show how the feature works. The user passes a piece of text or a URL to the API, along with a number of labels. In the case below we want to find out which label, Football, Baseball, Hockey or Basketball, best represents the following article: ‘http://insider.espn.go.com/nfl/story/_/id/12300361/bold-move-new-england-patriots-miami-dolphins-new-york-jets-buffalo-bills-nfl’

#### Code Snippet:


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: 'YourAppId',
application_key: 'YourAppKey'
});

var params = {
url: 'http://insider.espn.go.com/nfl/story/_/id/12300361/bold-move-new-england-patriots-miami-dolphins-new-york-jets-buffalo-bills-nfl',
};

textapi.unsupervisedClassify(params, function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
console.log("nThe text to classify is : nn",
response.text, "n");
for (var i = 0; i < response.classes.length; i++) {
console.log("label - ", response.classes[i].label,
", score -", response.classes[i].score, "n");
}
}
});


#### Results:


The text to classify is:

"Each NFL team's offseason is filled with small moves and marginal personnel decisions... "

label -  football , score - 0.13

label -  baseball , score - 0.042

label -  hockey , score - 0.008

label -  basketball , score - 0.008


Based on the scores provided, we can confidently say, that the article is about football and should be assigned a “Football” label.

### Customer Service Routing

As another example, let’s say we want to automatically determine whether a post on social media should be routed to our Sales, Marketing or Support Departments. In this example, we’ll take the following comment: “I’d like to place an order for 1000 units.” and automatically determine whether it should be dealt with by Sales, Marketing or Support. To do this, we pass the text to the API as well as our pre-chosen labels, in this case: ‘Sales’, ‘Customer Support’, ‘Marketing’.

#### Code Snippet:


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: 'YourAppId',
application_key: 'YourAppKey'
});

var params = {
text: "I'd like to place an order for 1000 units.",
'class': ['Sales', 'Customer Support', 'Marketing']
};

textapi.unsupervisedClassify(params, function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
console.log("nThe text to classify is : nn",
response.text, "n");
for (var i = 0; i < response.classes.length; i++) {
console.log("label - ",
response.classes[i].label,
", score -", response.classes[i].score, "n");
}
}
});


#### Results:


The text to classify is:

I'd like to place an order for 1000 units.

label -  Sales , score - 0.032

label -  Customer Support , score - 0.008

label -  Marketing , score - 0.002


Similarily, based on the scores given on how closely the text is semantically matched to a label, we can decide that this inquiry should be handled by a sales agent rather than, marketing or support.

### Divide and Conquer

Our next example deals with the idea of using the unsupervised classification feature, with a hierarchical taxonomy. When classifying text, it’s sometimes necessary to add a sub-label for finer grained classification, for example “Sports – Basketball” instead of just “sports”.

So, in this example we’re going to analyze a simple piece of text: “The oboe is a woodwind musical instrument” and we’ll attempt to provide a more descriptive classification result, based on the following taxonomy;

• ‘music’: [‘Instrument’, ‘composer’],
• ‘technology’: [‘computers’, ‘space’, ‘physics’],
• ‘health’: [‘disease’, ‘medicine’, ‘fitness’],

The taxonomy has a primary label and a secondary label, for example ‘music’ (primary) and ‘instrument, Composer’ (secondary)

#### Code Snippet:


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: 'YourAppId',
application_key: 'YourAppKey'
});

var _ = require('underscore');
var taxonomy = {
'music':      ['Instrument', 'composer'],
'technology': ['computers', 'space', 'physics'],
'health':     ['disease', 'medicine', 'fitness'],
};

var topClasses = ['technology', 'music', 'health', 'sport'];
var queryText = "The oboe is a woodwind musical instrument.";
var params = {
text: queryText,
'class': topClasses
};

textapi.unsupervisedClassify(params, function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
var classificationResult = '';
console.log("nThe text to classify is : nn",
response.text, "n");
classificationResult = response.classes[0].label +
" (" + response.classes[0].score + ") ";
params = {
text: queryText,
'class': _.values(
_.pick(taxonomy, response.classes[0].label)
)[0]
};
textapi.unsupervisedClassify(params,
function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
classificationResult += " - " +
response.classes[0].label +
" (" + response.classes[0].score +
") ";
console.log("Label: ", classificationResult);
}
}
);
}

});


#### Results:


The text to classify is :

The Obo is a large musical instrument

Label    :     music (0.076)  - Instrument (0.342)


As you can see from the results, the piece of text has been assigned ‘music’ as its primary label and ‘instrument’ as its secondary label.

All the code snippets in our examples are fully functional and can be copied and pasted or tested in our sandbox. We’ll also be adding some of these and more interesting apps to our sandbox over the next week or so that will showcase some interesting use cases for Unsupervised Classification. We’d also love to hear more about how you would use this feature, so don’t hesitate to get in touch with comments or feedback.

• PREVIOUS POSTSPage 1 of 3NO NEW POSTS