### Introduction

We recently added a feature to our API that allows users to classify text according to their own labels. This unsupervised method of classification relies on Explicit Semantic Analysis in order to determine how closely matched a piece of text and a label or tag are.

This method of classification provides greater flexibility when classifying text and doesn’t rely on a particular taxonomy to understand and categorize a piece of text.

Explicit Semantic Analysis (ESA) works at the level of meaning rather than on the surface form vocabulary of a word or document. ESA represents the meaning of a piece text, as a combination of the concepts found in the text and is used in document classification, semantic relatedness calculation (i.e. how similar in meaning two words or pieces of text are to each other), and information retrieval.

In document classification, for example, documents are tagged to make them easier to manage and sort. Tagging a document with keywords makes it easier to find. However, keyword tagging alone has it’s limitations; searches carried out using vocabulary with similar meaning, but different actual words may not uncover relevant documents. However classifying text semantically i.e. representing the document as concepts and lowering the dependence on specific keywords can greatly improve a machine’s understanding of text.

### How is Explicit Semantic Analysis achieved?

Wikipedia is a large and diverse knowledge base where each article can be considered a distinct concept. In Wikipedia based ESA, a concept is generated for each article. Each concept is then represented as a vector of the words which occur in the article, weighted by their tf-idf score.

The meaning of any given word can then be represented as a vector of that word’s relatedness, or “association weighting” to the Wikipedia based concepts.

“word” 	---> <concept1, weight1>, <concept2, weight2>, <concept3, weight3> - - -


A trivial example might be:

““Mars” -----> <planet, 0.90>, <Solar system, 0.85>, <jupiter 0.30> - - - -


Comparing two word vectors (using cosine similarity) we can get a numerical value for the semantic relatedness of words i.e. we can quantify how similar the words are to each other based on their association weighting to the various concepts.

Note: In Text Analysis a vector is simply a numerical representation of a word or document. It is easier for algorithms to work with numbers than with characters. Additionally, vectors can be plotted graphically and the “distance” between them is a visual representation of how closely related in terms of meaning words and documents are to each other.

### Explicit Semantic Analysis and Documents

Larger documents are represented as a combination of individual word vectors derived from the words within a document. The resultant document vectors are known as “concept” vectors. For example, a concept vector might look something like the following:

“Mars” 		---> <planet, 0.90>, <Solar system, 0.85>, <jupiter 0.30> - - - -
“explorer” 	---> <adventurer, 0.89>, <pioneer, 0.70>, <vehicle, 0.20> - - -
;			;			;			;
;			:			:			:
“wordn” 	---> <conceptb, weightb>, <conceptd, weightd>, <conceptp, weightp> - - -


Graphically, we can represent a concept vector as the centroid of the word vectors it is composed of. The image below illustrates the centroid of a set of vectors i.e. it is the center or average position of the vectors.

So, to compare how similar two phrases are we can create their concept vectors from their constituent word vectors and then compare the two, again using cosine similarity.

### ESA and Dataless Classification

This functionality is particularly useful when you want to classify a document, but you don’t want to use a known taxonomy. It allows you to specify on the fly a proprietary taxonomy on which to base the classification. You provide the text to be classified as well as potential labels and through ESA it is determined which label is most closely related to your piece of text.

### Summary

ESA operates at the level of concepts and meaning rather than just the surface form vocabulary. As such, it can improve the accuracy of document classification, information retrieval and semantic relatedness.

If you would like to know more about this topic check out this excellent blog from Christopher Olah and this very accessible research paper from Egozi, Markovitch and Gabrilovich, both of which I referred to heavily when researching this blog post.

Keep an eye out for more in our “Text Analysis 101” series.

Our development team have been working hard adding additional features to the API which allow our users to analyze, classify and tag text in more flexible ways. Unsupervised Classification is a feature we are really excited about and we’re happy to announce that it is available as a fully functional and documented feature, as of today.

#### So what exactly is Unsupervised Classification?

It’s a training-less approach to classification, which means, unlike our standard classification, that is based on IPTC News Codes, it doesn’t rely on a predefined taxonomy to categorize text. This method of classification allows automatic tagging of text that can be tailored to a users needs, without the need for a pre-trained classifier.

#### Why are we so excited about it?

Our Unsupervised Classification endpoint will allow users to specify a set of labels, analyze a piece of text and then assign the most appropriate label to that text. This allows greater flexibility for our users to decide, how they want to tag and classify text.

There are a number of ways this endpoint can be used and we’ll walk you through a couple of simple examples; Text Classification from a URL and Customer Service Routing of social interactions.

### Classification of Text

We’ll start with a simple example to show how the feature works. The user passes a piece of text or a URL to the API, along with a number of labels. In the case below we want to find out which label, Football, Baseball, Hockey or Basketball, best represents the following article: ‘http://insider.espn.go.com/nfl/story/_/id/12300361/bold-move-new-england-patriots-miami-dolphins-new-york-jets-buffalo-bills-nfl’

#### Code Snippet:


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: 'YourAppId',
application_key: 'YourAppKey'
});

var params = {
url: 'http://insider.espn.go.com/nfl/story/_/id/12300361/bold-move-new-england-patriots-miami-dolphins-new-york-jets-buffalo-bills-nfl',
};

textapi.unsupervisedClassify(params, function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
console.log("nThe text to classify is : nn",
response.text, "n");
for (var i = 0; i < response.classes.length; i++) {
console.log("label - ", response.classes[i].label,
", score -", response.classes[i].score, "n");
}
}
});


#### Results:


The text to classify is:

"Each NFL team's offseason is filled with small moves and marginal personnel decisions... "

label -  football , score - 0.13

label -  baseball , score - 0.042

label -  hockey , score - 0.008

label -  basketball , score - 0.008


Based on the scores provided, we can confidently say, that the article is about football and should be assigned a “Football” label.

### Customer Service Routing

As another example, let’s say we want to automatically determine whether a post on social media should be routed to our Sales, Marketing or Support Departments. In this example, we’ll take the following comment: “I’d like to place an order for 1000 units.” and automatically determine whether it should be dealt with by Sales, Marketing or Support. To do this, we pass the text to the API as well as our pre-chosen labels, in this case: ‘Sales’, ‘Customer Support’, ‘Marketing’.

#### Code Snippet:


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: 'YourAppId',
application_key: 'YourAppKey'
});

var params = {
text: "I'd like to place an order for 1000 units.",
'class': ['Sales', 'Customer Support', 'Marketing']
};

textapi.unsupervisedClassify(params, function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
console.log("nThe text to classify is : nn",
response.text, "n");
for (var i = 0; i < response.classes.length; i++) {
console.log("label - ",
response.classes[i].label,
", score -", response.classes[i].score, "n");
}
}
});


#### Results:


The text to classify is:

I'd like to place an order for 1000 units.

label -  Sales , score - 0.032

label -  Customer Support , score - 0.008

label -  Marketing , score - 0.002


Similarily, based on the scores given on how closely the text is semantically matched to a label, we can decide that this inquiry should be handled by a sales agent rather than, marketing or support.

### Divide and Conquer

Our next example deals with the idea of using the unsupervised classification feature, with a hierarchical taxonomy. When classifying text, it’s sometimes necessary to add a sub-label for finer grained classification, for example “Sports – Basketball” instead of just “sports”.

So, in this example we’re going to analyze a simple piece of text: “The oboe is a woodwind musical instrument” and we’ll attempt to provide a more descriptive classification result, based on the following taxonomy;

• ‘music’: [‘Instrument’, ‘composer’],
• ‘technology’: [‘computers’, ‘space’, ‘physics’],
• ‘health’: [‘disease’, ‘medicine’, ‘fitness’],

The taxonomy has a primary label and a secondary label, for example ‘music’ (primary) and ‘instrument, Composer’ (secondary)

#### Code Snippet:


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: 'YourAppId',
application_key: 'YourAppKey'
});

var _ = require('underscore');
var taxonomy = {
'music':      ['Instrument', 'composer'],
'technology': ['computers', 'space', 'physics'],
'health':     ['disease', 'medicine', 'fitness'],
};

var topClasses = ['technology', 'music', 'health', 'sport'];
var queryText = "The oboe is a woodwind musical instrument.";
var params = {
text: queryText,
'class': topClasses
};

textapi.unsupervisedClassify(params, function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
var classificationResult = '';
console.log("nThe text to classify is : nn",
response.text, "n");
classificationResult = response.classes[0].label +
" (" + response.classes[0].score + ") ";
params = {
text: queryText,
'class': _.values(
_.pick(taxonomy, response.classes[0].label)
)[0]
};
textapi.unsupervisedClassify(params,
function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
classificationResult += " - " +
response.classes[0].label +
" (" + response.classes[0].score +
") ";
console.log("Label: ", classificationResult);
}
}
);
}

});


#### Results:


The text to classify is :

The Obo is a large musical instrument

Label    :     music (0.076)  - Instrument (0.342)


As you can see from the results, the piece of text has been assigned ‘music’ as its primary label and ‘instrument’ as its secondary label.

All the code snippets in our examples are fully functional and can be copied and pasted or tested in our sandbox. We’ll also be adding some of these and more interesting apps to our sandbox over the next week or so that will showcase some interesting use cases for Unsupervised Classification. We’d also love to hear more about how you would use this feature, so don’t hesitate to get in touch with comments or feedback.

• Brands are investing more in digital marketing today than ever before. Digital ad revenues hit a record high in the first half of 2014, surging to $23.1 Billion, suggesting ad spend is growing steadily and not slowing down anytime soon. The Ad-Tech sector is definitely, “hot right now”, innovation and disruption within the industry is primarily focused on Programmatic Advertising, Cross Platform Advertising, Social Advertising and Mobile. But are we missing something? Are we forgetting that web users don’t like ads and that ad targeting, in general, is less than satisfactory? The digital ad space today is made up of display ads, banners, pop-ups, search ads and so on that are targeted on the actions or behaviours of an internet user which can all be placed under the Behavioural Advertising umbrella. ### Behavioural vs Semantic Advertising Behavioural Advertising, is what most of us are used to and quite frankly, are sick of. It’s a form of display advertising that takes into account the habits of a web user. It focuses on the behaviours of a web user and usually relies on cookies to track the actions or activity of a user, whether that’s a search made, a visit to a site or even the geolocation of a user. Semantic advertising is different. It applies semantic technologies to online advertising solutions. The function of semantic advertising technology is to contextually analyze, properly understand and classify the meaning of a web page to ensure the most appropriate advert is displayed on that page. Semantic advertising increases the chance that the viewer will ‘click-thru’ because only ads that are relevant to what the user is viewing and, therefore, interested in, will be displayed. ### Why is Semantic Advertising important? Unfortunately, digital ads online are pretty poor, often times they’re just annoying and sometimes they can be damaging for a brand. In a nutshell, as the amount of content published and consumed online increases, ads are becoming less targeted, more intrusive and a lot less effective. Don’t believe me? Check out some of these sound bites and stats gathered by HubSpot CMO Mike Volpe. (My favourite has to be this one: “You are more likely to summit Mount Everest than click a banner ad.” ) ### Who should care? #### Publishers The publisher, for the most part, will own the web pages the ads are displayed on. Their goal is to maximize advertising revenue while providing a positive user experience. Today more than ever there are more people browsing privately, privacy laws are tougher than ever and advertisers are looking for more bang for their buck. Publishers need to embrace moves away from traditional cookie-based targeting to more semantic-based approaches. #### Advertisers The advertiser provides the ads. These ads are usually organized around campaigns which are defined by a set of ads with a particular goal or theme (e.g. car insurance) in mind. The goal of the advertiser is to promote their given product or service in the best way possible to the right audience at the right time. Reaching a target audience for advertisers is harder than ever before online. The main issue limiting the display market is context or the disconnect between ad placement and content on the webpage: misplaced ads can often impact negatively on an ad campaign and its effectiveness. Advertisers, need to focus on more strategic targeting methods, in order to meet their target customers where and when is most effective. #### Ad Networks An ad network acts as a middle-man between the advertiser and publishers. They select what ads are displayed on what pages. Ad networks need to care about keeping their advertisers happy and providing ROI. Brands and advertisers do their best to avoid wasting ad spend on misplaced or poorly targeted ad campaigns. That’s why Ad networks need to differentiate themselves by offering alternative targeting options like semantic targeting to improve ad matching and the overall return they provide. In our next blog, we’ll focus on the basics of how semantic advertising works, for more on the benefits of Semantic Ads check out our previous blog post Semantic Advertising and Text Analysis gives more targeted ad campaigns. This is the second in our series of blogs on getting started with Aylien’s various SDKs. There are SDKs available for Node.js, Python, Ruby, PHP, GO, Java and .Net (C#). Last week’s blog focused on the node.js SDK. This week we will focus on Python. If you are new to AYLIEN and don’t have an account you can take a look at our blog on getting started with the API or alternatively you can go directly to the Getting Started page on the website which will take you through the signup process. We have a free plan available which allows you to make up to 1,000 calls to the API per day for free. ### Downloading and Installing the Python SDK All of our SDK repositories are hosted on Github. You can find the the python repository here. The simplest way to install the repository is with the python package installer pip. Simply run the following from a command line tool. $ pip install --upgrade aylien-apiclient


The following libraries will be installed when you install the client library:

• httplib2
• Once you have installed the SDK you’re ready to start coding! The Sandbox area of the website has a number of sample applications in node.js which help to demonstrate what the API can do. For the remainder of this blog we will walk you through making calls to three of the API endpoints using the Python SDK.

### Utilizing the SDK with your AYLIEN credentials


from aylienapiclient import textapi
c = textapi.Client("YourApplicationID", "YourApplicationKey")


When calling the various endpoints you can specify a piece of text directly for analysis or you can pass a URL linking to the text or article you wish to analyze.

### Language Detection

First let’s take a look at the language detection endpoint. We’re going to detect the language of the sentence “’What language is this sentence written in?’

You can do this by simply running the following piece of code:


language = c.Language({'text': 'What Language is this sentence written in?'})
print ("Language Detection Results:nn")
print("Language: %sn" % (language["lang"]))
print("Confidence: %Fn" %  (language["confidence"]))


You should receive an output very similar to the one shown below which shows that the language detected was English and the confidence that it was detected correctly (a number between 0 and 1) is very close to 1 indicating that you can be pretty sure it is correct.

#### Language Detection Results:


Language	: en
Confidence	: 0.9999984486883192


### Sentiment Analysis

Next we will look at analyzing the sentence “John is a very good football player” to determine it’s sentiment i.e. whether it’s positive, neutral or negative. The API will also determine if the text is subjective or objective. You can call the endpoint with the following piece of code


sentiment = c.Sentiment({'text': 'John is a very good football player!'})
print ("nnSentiment Analysis Results:nn")
print("Polarity: %sn" % sentiment["polarity"])
print("Polarity Confidence: %sn" %  (sentiment["polarity_confidence"]))
print("Subjectivity: %sn" % sentiment["subjectivity"])
print("Subjectivity Confidence: %sn" %  (sentiment["subjectivity_confidence"]))


You should receive an output similar to the one shown below which indicates that the sentence is positive and objective, both with a high degree of confidence.

#### Sentiment Analysis Results:


Polarity: positive
Polarity Confidence: 0.9999988272764874
Subjectivity: objective
Subjectivity Confidence: 0.9896821594138254


### Article Classification

Next we will take a look at the classification endpoint. The Classification Endpoint automatically assigns an article or piece of text to one or more categories making it easier to manage and sort. The classification is based on IPTC International Subject News Codes and can identify up to 500 categories. The code below analyzes a BBC news article about the first known picture of an oceanic shark giving birth.


category = c.Classify({'url': 'http://www.bbc.com/news/science-environment-30747971'} )
print("nArticle Classification Results:nn")
print("Label	: %sn" % category["categories"][0]["label"])
print("Code	: %sn" % category["categories"][0]["code"])
print("Confidence	: %sn" % category["categories"][0]["confidence"])



When you run this code you should receive an output similar to that shown below, which assigns the article an IPTC label of “science and technology – animal science” with an IPTC code of 13012000.

#### Article Classification Results:


Label   : science and technology - animal science
Code    : 13012000
Confidence      : 0.9999999999824132


If python is not your preferred language then check out our SDKs for node.js, Ruby, PHP, GO, Java and .Net (C#). For more information regarding the APIs go to the documentation section of our website.

We will be publishing ‘getting started’ blogs for the remaining languages over the coming weeks so keep an eye out for them. If you haven’t already done so, you can get free access to our API on our sign up page.

Happy Hacking!

The internet has had a massive impact on marketing and advertising in general. It has provided an effective way for businesses to access target prospects with branded and targeted marketing material at scale. However, how effective are traditional digital advertising techniques? Have we become immune to flashy banner ads and keyword focused promotional material? Apparently not! But it seems things can improve.

Spending on ads served to internet enabled devices, desktops, laptops, mobiles and tablets will reach \$137.53 billion this year and will continue to grow, according to eMarketer’s latest estimates of worldwide paid media spending.

Advertising online is based around matching ads or promotional material (banner ads, links, video and interactive ads) with appropriate web pages where the right audience will see them. Traditionally ad targeting is done by manual classification of pages or by using information retrieval techniques to find keywords from the page, and match these to keywords associated with ads.

While this has proven to be a pretty effective promotion channel thus far, it does have its problems. It is true that a lot about the effectiveness of an ad is down to the creative, the look and feel, the text used etc but if it’s showing up in the wrong place in front of the wrong people it isn’t going to be effective.

Relevance is key!

Ads today are often intrusive, robotic and just not relevant! Well placed and effective ads all have particular attributes that stand out from the rest. They are relevant and they promote a product or service that the visitor is likely to be interested in.

In the case below I visited a few pages to see how they faired by way of “targeted” advertising. The ad served to me on Mashable was for Eukanba dog food. Is this relevant?  I don’t have a dog and I have never bought or researched dog food online. It also isn’t relative to anything else on the page and therefore there is very little chance I would click on that ad.

So what can we do to get more effective ads in front of the right people? By incorporating text analysis and semantic capabilities into ad placement strategies, we can focus on more than just keyword matching and serve relevant ads in the right place at the right time.

What is semantic targeting?

Semantic advertising aims to analyze web pages to properly understand and classify the meaning of the page in order to ensure that viewers of the page are shown the most appropriate ads.

Semantically targeted ads increase the chance that the viewer will “click-through” because only ads that are relevant to what the user is viewing or the page they are on will be displayed. For example, say you visit a mountain biking blog, you are far more likely to click on an ad for cycling gear or bike helmets than one for car insurance as it is far more relevant at that time.

Focused on more than keywords

Words can have multiple meanings and scanning web pages for keywords in order to serve certain ads isn’t always effective. For example, the word “apple” may result in ads being displayed about Apple accessories or an organic fruit delivery service which means depending on the meaning of the word relevant to the content it’s included in the ad could be very poorly or well targeted. A better approach would be to analyze the rest of the page to understand the context and if there are mentions of other fruits and organic farming etc… It is probably safe to say the delivery service as would be more appropriate.

Brand Protection

Ads can often turn up in some pretty inappropriate places if they are targeted by one factor and one factor only, say keywords for example.

In the example above, the ad was most likely targeted to the webpage visitor based on keyword matching. Matching “grilling” as a keyword in the article title with the grilling competition advertisement. Is this ad relevant? No. Does it promote the company’s brand in a positive light? No. Is it an effective advertisement? Certainly not.

Not behaviour based

These days we have become a lot more private in our web use. People are conscious of behaviours being tracked for advertising purposes whether for display ads or retargeting techniques and have become a lot more savvy by choosing to block ads completely where possible, using private search engines or by clearing their cookies on a regular basis. This poses a significant problem as it takes away a particularly effective form of targeting based on behavioural tracking. Semantic ad Targeting allows advertisers to move away from behaviour tracking and think on their feet by serving relevant ads in real time based on analysis of the text on pages.

Conclusion

Using NLP and Text Analysis techniques advertisers can analyze web pages to understand the context of keywords, extract entities and concepts mentioned in a text and classify webpages automatically. Allowing them to look beyond keywords and search terms to automatically match ads with relevant content on webpages. Being smarter and more strategic about how we target prospects and embracing new technology, could one day mean that ads will become so relevant that, we actually find them useful and don’t feel the need to block them.

Human beings are remarkably adept at understanding each other, given that we speak in languages of our own construction which are merely symbols of the information we’re trying to convey.

We’re skilled at understanding for two reasons. First, we’ve had, literally, millions of years to acquire the necessary skills. Second, we speak in, generally, the same terms, the same languages. Still, it’s an incredible feat, to extract understanding and meaning from such an avalanche of signal.

Consider this: researchers in Japan used the K Computer, currently the fourth most powerful supercomputer in the world, to process a single second of human brain activity.

It took the computer 40 minutes to process that single second of brain activity.

For machines to reach the level of understanding that’s required for today’s applications and news organizations, then, would require those machines to sift through astronomical amounts of data, separating the meaningful from the meaningless. Much like our brains consciously process only a fraction of the information they store, a machine that could separate the wheat from the chaff would be capable of extracting remarkable insights.

We live in the dawn of the computer age, but in the thirty years since personal computing went mainstream, we’ve seen little progress in how computers work on a fundamental level. They’ve gotten faster, smaller, and more powerful, but they still require huge amounts of human input to function. We tell them what to do, and they do it. But what if what we’re truly after is understanding? To endow machines with the ability to learn from us, to interact with us, to understand what we want? That’s the next phase in the evolution of computers.

Enter NLP

Natural Language Processing (NLP) is the catalyst that will spark that phase. NLP is a branch of Artificial Intelligence that allows computers to not just process, but to understand human language, thus eliminating the language barrier.

Chances are, you already use applications that employ NLP:

• Google Translate: human language translation is already changing the way humans communicate, by breaking down language barriers.
• Siri and Google Now: contextual services built into your smartphone rely heavily on NLP. NLP is why Google knows to show you directions when you say “How do I get home?”.

There are many other examples of NLP in products you already use, of course. The technology driving NLP, however, is not quite where it needs to be (which is why you get so frustrated when Siri or Google Now misunderstands you). In order to truly reach its potential, this technology, too, has a next step: understand you. It’s not enough to recognize generic human traits or tendencies; NLP has to be smart enough to adapt to your needs.

Most startups and developers simply don’t have the time or the resources to tackle these issues themselves. That’s where we come in. AYLIEN (that’s us) has combined three years of our own research with emerging academic studies on NLP to provide a set of common NLP functionalities in the form of an easy-to-use API bundle.

Announcing the AYLIEN Text Analysis API

The Text API consists of eight distinct Natural Language Processing, Information Retrieval, and Machine Learning APIs which, when combined, allow developers to extract meaning and insight from any document with ease.

Here’s how we do it.

### Article Extraction

This tool extracts the main body of an article, removing all extraneous clutter, but leaving intact vital elements like embedded images and video.

### Article Summarization

This one does what it says on the tin: summarizes a given article in just a few sentences.

### Classification

The Classification feature uses a database of more than 500 categories to properly tag an article according to IPTC NewsCode standards.

### Entity Extraction

This tool can extract any entities (people, locations, organizations) or values (URLS, emails, phone numbers, currency amounts and percentages) mentioned in a given text.

### Concept Extraction

Concept Extractions continues the work of Entity Extraction, linking the entities mentioned to the relevant DBPedia and Linked Data entries, including their semantic types (such as DBPedia and schema.org types).

### Language Detection

Language Detection, of course, detects the language of a document from a database of 62 languages, returning that information in ISO 639-1 format.

### Sentiment Analysis

Sentiment Analysis detects the tone, or sentiment, of a text in terms of polarity (positive or negative) and subjectivity (subjective of objective).

### Hashtag Suggestion

Because discoverability is crucial to social media, Hashtag Suggestion automatically suggests ultra-relevant hashtags to engage audiences across social media.

This suite of tools is the result of years of research mixed in with a good, old-fashioned hard work. We’re excited about the future of the Semantic Web, and we’re proud to offer news organizations and developers an easy-to-use API bundle that gets us one step closer to recognizing our vision.

We’re happy to announce that you can start using the Text API from today for free. Happy hacking, and let us know what you think.