Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

Brands are investing more in digital marketing today than ever before. Digital ad revenues hit a record high in the first half of 2014, surging to $23.1 Billion, suggesting ad spend is growing steadily and not slowing down anytime soon. The Ad-Tech sector is definitely, “hot right now”, innovation and disruption within the industry is primarily focused on Programmatic Advertising, Cross Platform Advertising, Social Advertising and Mobile.

But are we missing something? Are we forgetting that web users don’t like ads and that ad targeting, in general, is less than satisfactory?

The digital ad space today is made up of display ads, banners, pop-ups, search ads and so on that are targeted on the actions or behaviours of an internet user which can all be placed under the Behavioural Advertising umbrella.

 

image

Behavioural vs Semantic Advertising

Behavioural Advertising, is what most of us are used to and quite frankly, are sick of. It’s a form of display advertising that takes into account the habits of a web user. It focuses on the behaviours of a web user and usually relies on cookies to track the actions or activity of a user, whether that’s a search made, a visit to a site or even the geolocation of a user.

Semantic advertising is different. It applies semantic technologies to online advertising solutions. The function of semantic advertising technology is to contextually analyze, properly understand and classify the meaning of a web page to ensure the most appropriate advert is displayed on that page. Semantic advertising increases the chance that the viewer will ‘click-thru’ because only ads that are relevant to what the user is viewing and, therefore, interested in, will be displayed.

Why is Semantic Advertising important?

Unfortunately, digital ads online are pretty poor, often times they’re just annoying and sometimes they can be damaging for a brand. In a nutshell, as the amount of content published and consumed online increases, ads are becoming less targeted, more intrusive and a lot less effective. Don’t believe me? Check out some of these sound bites and stats gathered by HubSpot CMO Mike Volpe. (My favourite has to be this one: “You are more likely to summit Mount Everest than click a banner ad.” )

Who should care?

Publishers

The publisher, for the most part, will own the web pages the ads are displayed on. Their goal is to maximize advertising revenue while providing a positive user experience.

Today more than ever there are more people browsing privately, privacy laws are tougher than ever and advertisers are looking for more bang for their buck. Publishers need to embrace moves away from traditional cookie-based targeting to more semantic-based approaches.

Advertisers

The advertiser provides the ads. These ads are usually organized around campaigns which are defined by a set of ads with a particular goal or theme (e.g. car insurance) in mind. The goal of the advertiser is to promote their given product or service in the best way possible to the right audience at the right time.

Reaching a target audience for advertisers is harder than ever before online.  The main issue limiting the display market is context or the disconnect between ad placement and content on the webpage: misplaced ads can often impact negatively on an ad campaign and its effectiveness. Advertisers, need to focus on more strategic targeting methods, in order to meet their target customers where and when is most effective.

Ad Networks

An ad network acts as a middle-man between the advertiser and publishers. They select what ads are displayed on what pages. Ad networks need to care about keeping their advertisers happy and providing ROI. Brands and advertisers do their best to avoid wasting ad spend on misplaced or poorly targeted ad campaigns. That’s why Ad networks need to differentiate themselves by offering alternative targeting options like semantic targeting to improve ad matching and the overall return they provide.

In our next blog, we’ll focus on the basics of how semantic advertising works, for more on the benefits of Semantic Ads check out our previous blog post Semantic Advertising and Text Analysis gives more targeted ad campaigns.

image

0

This is the fourth edition, in our series of blogs on getting started with AYLIEN’s various SDKs. There are SDKs available for Node.js, Python, Ruby, PHP, GO, Java and .Net (C#). For this week’s instalment we’re going to focus on C#.

If you are new to AYLIEN and Text Analysis and you do not have an account yet, you can take a look at our blog on how to get started with the API or alternatively you can go directly to our Getting Started page which will take you through the signup process. We provide a free plan to get started which allows users to make up to 1,000 calls per day to the API for free.

Downloading and Installing the C# SDK

All of our SDK repositories are hosted on Github. You can find the C# repository here. The simplest way to install the repository is with “nuget package manager”. Simply type the following from a command line tool.


nuget install Aylien.TextApi

Alternatively, from Visual Studio under the “Project” Menu choose “Manage Nuget Packages” and search for the AYLIEN package under online packages.

Once you have installed the SDK you’re ready to start coding. The Sandbox area of the website has a number of sample applications in node.js which would help to demonstrate what the APIs can do. In the remainder of this blog we will walk you through making calls using the C# SDK and show the output you should receive in each case.

Configuring the SDK with your AYLIEN credentials

Once you have received your AYLIEN APP_ID and APP_KEY from the signup process and you have downloaded the SDK, you can start making calls by adding the AYLIEN namespace to your C# code.


using Aylien.TextApi;
using System;

And initialising a client with your AYLIEN credentials


Client client = new Client(
                "YOUR_APP_ID", "YOUR_APP_KEY");

When calling the various API you can specify whether you want to analyze a piece of text directly or a URL linking to the text or article you wish to analyze.

Language Detection

First let’s take a look at the language detection endpoint by analyzing the following sentence: ‘What language is this sentence written in?’

You can call this endpoint using the following piece of code.


Language language = client.Language(text: "What language is this sentence written in?");
Console.WriteLine("Text: {0}", language.Text);
Console.WriteLine("Language: {0}", language.Lang);
Console.WriteLine("Confidence: {0}", language.Confidence);

You should receive an output very similar to the one shown below which shows the language detected as English and a confidence score. The confidence score is very close to 1, so, you can be pretty sure it’s correct.

Language Detection Results


Text: What language is this sentence written in?
Language: en
Confidence: 0.9999982

Sentiment Analysis

Next, we’ll look at analyzing the sentence “John is a very good football player” to determine it’s sentiment i.e. whether it’s positive, neutral or negative. The endpoint will also determine if the text is subjective or objective. You can call the endpoint with the following piece of code


Sentiment sentiment = client.Sentiment(text: "John is a very good football player!");
Console.WriteLine("Text: {0}", sentiment.Text);
Console.WriteLine("Sentiment Polarity  : {0}", sentiment.Polarity);
Console.WriteLine("Polarity Confidence  : {0}", sentiment.PolarityConfidence);
Console.WriteLine("Subjectivity  : {0}", sentiment.Subjectivity);
Console.WriteLine("Subjectivity Confidence  : {0}", sentiment.SubjectivityConfidence);

You should receive an output similar to the one shown below. This indicates that the sentence is objective and is positive, both with a high degree of confidence.

Sentiment Analysis Results


Text: John is a very good football player!
Sentiment Polarity  : positive
Polarity Confidence  : 0.999998827276487
Subjectivity  : objective
Subjectivity Confidence  : 0.989682159413825

Article Classification

We’re now going to take a look at the Classification endpoint. The Classification endpoint automatically assigns an article or piece of text to one or more categories making it easier to manage and sort. Our classification is based on IPTC International Subject News Codes and can identify up to 500 categories. The code below analyses a BBC news article about scientists who have managed to slow down the speed of light.


Classify classify= client.Classify(url: "http://www.bbc.com/news/uk-scotland-glasgow-west-30944584");
Console.Write("nClassification: n");
foreach(var item in classify.Categories)
     {
     Console.WriteLine("Label        :   {0}     ", item.Label.ToString());
     Console.WriteLine("IPTC code    :   {0}     ", item.Code.ToString());
     Console.WriteLine("Confidence   :   {0}     ",                                                                      item.Confidence.ToString());
     }

When you run this code you should receive an output similar to that shown below which assigns the article an IPTC label of “applied science – particle physics” with an IPTC code of 13001004.

Article Classification Results


Classification:
Label        :   applied science - particle physics
IPTC code    :   13001004
Confidence   :   0.9877892

Hashtag Analysis

Next, we’ll look analyze the same BBC article and extract hashtag suggestions for sharing the article on social media.


Hashtags hashtags = client.Hashtags(url: "http://www.bbc.com/news/uk-scotland-glasgow-west-30944584");
Console.Write("nHashtags: n");
        foreach(var item in hashtags.HashtagsMember)
	{
	Console.WriteLine(item.ToString());
	}

You should receive the output shown below.

Hashtag Suggestion Results


Hashtags:
#Glasgow
#HeriotWattUniversity
#Scotland
#Moon
#QuantumRealm
#LiquidCrystal
#Tie
#Bicycle
#Wave-particleDuality
#Earth
#Physics

Check out our SDKs for node.js, Go, PHP, Python, Java and Ruby if C# isn’t your preferred language. For more information regarding the APIs go to the documentation section of our website.

0

Last week’s getting started blog focused on the Python SDK. This week we’re going to focus on using the API with Go. This is the third in our series of blogs on getting started with AYLIEN’s various SDKs. You can access all our SDK repositories on here

If you are new to our API and Text Analysis in general and you don’t have an account you can go directly to the Getting Started page on the website which will take you through how to open an account. You can choose a free plan to get started, which allows you to make up to 1,000 calls per day to the API for free.

Downloading and Installing the Go SDK

The simplest way to install the repository is with “go get”. Simply type the following from a command line tool.


$ go get github.com/AYLIEN/aylien_textapi_go

Utilizing the SDK with your AYLIEN credentials

Once you’ve subscribed to our API and have downloaded the SDK you can start making calls by adding the following code to your go program.


import (
"fmt"
textapi "github.com/AYLIEN/aylien_textapi_go"
)
auth := textapi.Auth{"YOUR_APP_ID ", "YOUT_APP_KEY"}
client, err := textapi.NewClient(auth, true)
if err != nil {
panic(err)
}

When calling the API you can specify whether you wish to analyze a piece of text directly for a URL linking to the text or article you wish to analyze.

Language Detection

We’re going to first showcase the Language Detection endpoint by analyzing the following sentence “What language is this sentence written in?“ By using the following piece of code.


languageParams := &textapi.LanguageParams{Text: "What language is this sentence written in?"}
lang, err := client.Language(languageParams)
if err != nil {
panic(err)
}
fmt.Printf("nLanguage Detection Resultsn")
fmt.Printf("Text            :   %sn", lang.Text)
fmt.Printf("Language        :   %sn", lang.Language)
fmt.Printf("Confidence      :   %fnn", lang.Confidence)

You should receive an output very similar to the one shown below, which shows, the language detected was English and the confidence that it was detected correctly (a number between 0 and 1) is very close to 1 which means you can be pretty sure it is correct.

Language Detection Results


Text            :   What language is this sentence written in?
Language        :   en
Confidence      :   0.999997

Sentiment Analysis

Next we’ll look at analyzing the following short piece of text “John is a very good football player” to determine it’s sentiment i.e. if it’s positive, neutral or negative.


sentimentParams := &textapi.SentimentParams{Text: "John is a very good football player!"}
sentiment, err := client.Sentiment(sentimentParams)
if err != nil {
panic(err)
}
fmt.Printf("Sentiment Analysis Resultsn")
fmt.Printf("Text            :   %sn", sentiment.Text)
fmt.Printf("Sentiment Polarity  :   %sn", sentiment.Polarity)
fmt.Printf("Polarity Confidence  :   %fn", sentiment.PolarityConfidence)
fmt.Printf("Subjectivity  : %sn", sentiment.Subjectivity)
fmt.Printf("Subjectivity Confidence  :   %fnn", sentiment.SubjectivityConfidence)

You should receive an output similar to the one shown below which indicates that the sentence is objective and is positive, both with a high degree of confidence.

Sentiment Analysis Results


Text            :   John is a very good football player!
Sentiment Polarity  :   positive
Polarity Confidence  :   0.999999
Subjectivity  : objective
Subjectivity Confidence  :   0.989682

Article Classification

AYLIEN’s Classification endpoint automatically assigns an article or piece of text to one or more categories, making it easier to manage and sort. The classification is based on IPTC International Subject News Codes and can identify up to 500 categories. The code below analyses a BBC news article about a one ton pumpkin ;).


classifyParams := &textapi.ClassifyParams{URL: "http://www.bbc.com/earth/story/20150114-the-biggest-fruit-in-the-world"}
class, err := client.Classify(classifyParams)
if err != nil {
panic(err)
}
fmt.Printf("Classification Analysis Resultsn")
for _, v := range class.Categories {
fmt.Printf("Classification Label        :   %sn", v.Label)
fmt.Printf("Classification Code         :   %sn", v.Code)
fmt.Printf("Classification Confidence   :   %fnn", v.Confidence)

When you run this code you should receive an output similar to that shown below which assigns the article an IPTC label of “natural science – biology” with an IPTC code of 13004008.

Classification Results


Classification Label        :   natural science - biology
Classification Code         :   13004008
Classification Confidence   :   0.929754

Hashtag Suggestion

Next, we’ll have a look at analyzing the same BBC article and extracting hashtag suggestions for it.


hashtagsParams := &textapi.HashtagsParams{URL:
"http://www.bbc.com/earth/story/20150114-the-biggest-fruit-in-the-world"}
hashtags, err := client.Hashtags(hashtagsParams)
if err != nil {
panic(err)
}
fmt.Printf("Hashtag Suggestion Resultsn")
for _, v := range hashtags.Hashtags {
fmt.Printf("%sn", v)
}

You should receive an output similar to the one below.

Hashtags


Hashtag Suggestion Results
#Carbon
#Sugar
#Squash
#Agriculture
#Juicer
#BBCEarth
#TopsfieldMassachusetts
#ArnoldArboretum
#AtlanticGiant
#HarvardUniversity
#Massachusetts

If Go isn’t your weapon of choice then check out our SDKs for node.js, Ruby, PHP, Python, Java and .Net (C#). For more information regarding our API go to the documentation section of our website.

We will be publishing ‘getting started’ blogs for the remaining languages over the coming weeks so keep an eye out for them.

0

Introduction

This is our third blog in the “Text Analysis 101; A basic understanding for Business Users” series. The series is aimed at non-technical readers, who would like to get a working understanding of the concepts behind Text Analysis. We try to keep the blogs as jargon free as possible and the formulas to a minimum.

This week’s blog will focus on Topic Modelling. Topic Modelling is an unsupervised Machine Learning (ML) technique. This means that it does not require a training dataset of manually tagged documents from which to learn. It is capable of working directly with the documents in question.

Our first two blogs in the series focused on document classification using both supervised and unsupervised (clustering) methods.

What Topic Modelling is and why it is useful.

As the name suggests, Topic Modelling discovers the abstract topics that occur in a collection of documents. For example, assume that you work for a legal firm and have a large number of documents to consider as part of an eDiscovery process.

As part of the eDiscovery process we attempt to identify certain topics that we may be interested in and discard topics we have no interest in. However, for the most part we are talking about large volumes of documents and often times we have no idea which documents are relevant or irrelevant. Topic modelling enables the discovery of high-level topics that exist in the target documents, and also the degree to which each topic is referred to in each document, i.e. the composition of topics for each document. If the documents are ordered chronologically then topic modelling can also provide insight into how the topics evolve over time.

LDA  – A model for “generating” documents

Latent Dirichlet Allocation (LDA) is the name given to a model commonly used for describing the generation of documents. There are a few basic things to understand about LDA

  1. LDA views documents as if each document were a bag of words, imagine taking the words in a document and pouring them into a bag. All of the word order and the grammar would be lost but all of the words would still be present i.e. if there are twelve occurrences of the word “the” in the document then there will be twelve “the”s in the bag.
  2. LDA also views documents as if they were “generated” by a mixture of topics i.e. a document might be generated from 50% sports, 20% finance and 30% gossip.
  3. LDA considers that any given topic will have a high probability of generating certain words and a low probability of generating other words. For example, the “Sports” topic will have a high probability of  generating words like “football”, basketball”, “baseball” and will have a low probability of producing words like “kitten”, “puppy” and “orangutan”. The presence of certain words within a document will, therefore, give an indication of the topics which make up the document.

So in summary, from the LDA view, documents are created by the following process.

  1. Choose the topics from which the document will be generated and the proportion of the document to come from each topic. For example, we could choose the three  topics and proportions from above i.e 50% sports, 20% finance and 30% gossip.
  2. Generate appropriate words from the topics chosen in the proportions specified.

For example, if our document had 10 words and three topics in proportion 50% sports, 20% finance and 30% gossip, the LDA process might generate the following “bag of words” to make up the document.

baseball dollars fans playing Kardashian pays magazine chat stadium ball

The 5 red words are from the sports topic, the 2 blue words are from the finance topic and the three green words are from the gossip topic.

Collapsed Gibbs Sampling

We know that LDA assumes documents are bags of words composed in proportion from the topics that generated the words. Collapsed Gibbs Sampling tries to work backwards to figure out the words that belong to each topic and secondly, the topic proportions that make up each document. Below is an attempt to describe this method in simple terms.

  1. Keep a copy of each document for reference.
  2. Pour all of the word from each documents into a bag. The bag will then contain every word from every document, some words will appear multiple times.
  3. Decide the number of topics (K) that you will divide the documents into and have a bowl for each topic.
  4. Randomly pour the words from the bag into the topic bowls putting an equal number in each bowl. At this point, we have a first guess at the makeup of words in each topic. It is a completely random guess so is not of any practical use yet. It needs to be improved. It is also a first guess at the topic makeup of each document i.e. you can count the number of words in each document that are from each topic to figure out the proportions of topics that make up the document.

Improving on the first random guess to find the topics.

The Collapsed Gibbs Sampling algorithm can work from this first random guess and over many iterations to discover the topics. Below is a simplified description of how this achieved.

For each document in the document set, go through each word one by one and do the following:

  1. For each of our K topics
    1. Find the percentage of words in the document that were generated from this topic. This will give us an indication of how import the topic (as represented by our current guess of words in the bowl) is to the document. i.e. how much of the document came from the topic
    2. Find the percentage of the topic that came from this word across all documents. This will give us an indication of how important the word is to the topic.
    3. Multiply the two percentages together, this will give an indication of how likely it is that the topic in question generated this word
  2. Compare the answers to the multiplication from each topic and move the word to the bowl with the highest answer.
  3. Keep repeating this process over and over again until the words stop moving from bowl to bowl i.e. the topics will have converged into K distinct topics.

At this point we have the words that make up each topic so we can assign a label to the topic i.e. if the topic contains the words dog, cat, tiger, buffalo we would assign the label “Animals” to the topic. Now that we have the words in each topic we can analyse each document or “bag of words” to see what proportion of each topic it was generated from.

We now have the words which make up each topic, we have a label for the topic and we have the topics and proportions within each document and that’s pretty much it. There are two blogs that I used as part of our research, which you might want to take a look at. The LDA Buffet by Matthew L Jockers and An Introduction to LDA by Edwin Chen.

Keep an eye out for more in our “Text Analysis 101” series.

image

 

0

This is the second in our series of blogs on getting started with Aylien’s various SDKs. There are SDKs available for Node.js, Python, Ruby, PHP, GO, Java and .Net (C#). Last week’s blog focused on the node.js SDK. This week we will focus on Python.

If you are new to AYLIEN and don’t have an account you can take a look at our blog on getting started with the API or alternatively you can go directly to the Getting Started page on the website which will take you through the signup process. We have a free plan available which allows you to make up to 1,000 calls to the API per day for free.

Downloading and Installing the Python SDK

All of our SDK repositories are hosted on Github. You can find the the python repository here. The simplest way to install the repository is with the python package installer pip. Simply run the following from a command line tool.


$ pip install --upgrade aylien-apiclient

The following libraries will be installed when you install the client library:

  • httplib2
  • Once you have installed the SDK you’re ready to start coding! The Sandbox area of the website has a number of sample applications in node.js which help to demonstrate what the API can do. For the remainder of this blog we will walk you through making calls to three of the API endpoints using the Python SDK.

    Utilizing the SDK with your AYLIEN credentials

    Once you have received your AYLIEN credentials and have downloaded the SDK you can start making calls by adding the following code to your python script.

    
    from aylienapiclient import textapi
    c = textapi.Client("YourApplicationID", "YourApplicationKey")
    

    When calling the various endpoints you can specify a piece of text directly for analysis or you can pass a URL linking to the text or article you wish to analyze.

    Language Detection

    First let’s take a look at the language detection endpoint. We’re going to detect the language of the sentence “’What language is this sentence written in?’

    You can do this by simply running the following piece of code:

    
    language = c.Language({'text': 'What Language is this sentence written in?'})
    print ("Language Detection Results:nn")
    print("Language: %sn" % (language["lang"]))
    print("Confidence: %Fn" %  (language["confidence"]))
    

    You should receive an output very similar to the one shown below which shows that the language detected was English and the confidence that it was detected correctly (a number between 0 and 1) is very close to 1 indicating that you can be pretty sure it is correct.

    Language Detection Results:

    
    Language	: en
    Confidence	: 0.9999984486883192
    

    Sentiment Analysis

    Next we will look at analyzing the sentence “John is a very good football player” to determine it’s sentiment i.e. whether it’s positive, neutral or negative. The API will also determine if the text is subjective or objective. You can call the endpoint with the following piece of code

    
    sentiment = c.Sentiment({'text': 'John is a very good football player!'})
    print ("nnSentiment Analysis Results:nn")
    print("Polarity: %sn" % sentiment["polarity"])
    print("Polarity Confidence: %sn" %  (sentiment["polarity_confidence"]))
    print("Subjectivity: %sn" % sentiment["subjectivity"])
    print("Subjectivity Confidence: %sn" %  (sentiment["subjectivity_confidence"]))
    

    You should receive an output similar to the one shown below which indicates that the sentence is positive and objective, both with a high degree of confidence.

    Sentiment Analysis Results:

    
    Polarity: positive
    Polarity Confidence: 0.9999988272764874
    Subjectivity: objective
    Subjectivity Confidence: 0.9896821594138254
    

    Article Classification

    Next we will take a look at the classification endpoint. The Classification Endpoint automatically assigns an article or piece of text to one or more categories making it easier to manage and sort. The classification is based on IPTC International Subject News Codes and can identify up to 500 categories. The code below analyzes a BBC news article about the first known picture of an oceanic shark giving birth.

    
    category = c.Classify({'url': 'http://www.bbc.com/news/science-environment-30747971'} )
    print("nArticle Classification Results:nn")
    print("Label	: %sn" % category["categories"][0]["label"])
    print("Code	: %sn" % category["categories"][0]["code"])
    print("Confidence	: %sn" % category["categories"][0]["confidence"])
    
    

    When you run this code you should receive an output similar to that shown below, which assigns the article an IPTC label of “science and technology – animal science” with an IPTC code of 13012000.

    Article Classification Results:

    
    Label   : science and technology - animal science
    Code    : 13012000
    Confidence      : 0.9999999999824132
    

    If python is not your preferred language then check out our SDKs for node.js, Ruby, PHP, GO, Java and .Net (C#). For more information regarding the APIs go to the documentation section of our website.

    We will be publishing ‘getting started’ blogs for the remaining languages over the coming weeks so keep an eye out for them. If you haven’t already done so, you can get free access to our API on our sign up page.

    Happy Hacking!

    image

    0

    If you’re a regular reader of our blog, you will have heard us mention Time to First Hello World (TTFHW) quite a bit. It’s all part of our focus on Developer Experience and our efforts to make our API’s as easy as possible to use and integrate with.

    In line with this initiative, in late 2014 we launched Software Development Kits for Ruby, PHP, Node.js and Python and we promised to add more SDKs for other popular languages early in the new year. The idea behind the SDKs was to make it as easy as possible for our users to get up and running with the API and making calls as quickly as possible.

    We’re happy to announce that AYLIEN Text Analysis SDKs are now available for Java, C# and Go. We know there was a few of our users waiting on the Java SDK so we’re particularly happy to now offer the Java SDK among others. You can download them directly from our AYLIEN GitHub repository below.

    If you have requests for features or improvements to our API, or our other product offerings, make sure you let us know about them. Also, if you haven’t played with it yet, check out our Developer Sandbox. It’s a Text Analysis playground for developers. A place you can go to test the API, fiddle with ideas and build the foundations of your Text Analysis service.Happy Hacking!

     

    image

    0

    Introduction

    This is our second blog on harnessing Machine Learning (ML) in the form of Natural Language Processing (NLP) for the Automatic Classification of documents. By classifying text, we aim to assign a document or piece of text to one or more classes or categories making it easier to manage or sort. A Document Classifier often returns or assigns a category “label” or “code” to a document or piece of text. Depending on the Classification Algorithm or strategy used, a classifier might also provide a confidence measure to indicate how confident it is that the result is correct.

    In our first blog, we looked at a supervised method of Document Classification. In supervised methods, Document Categories are predefined by using a training dataset with manually tagged documents. A classifier is then trained on the manually tagged dataset so that it will be able to predict any given Document’s Category from then on.

    In this blog, we will focus on Unsupervised Document Classification. Unsupervised ML techniques differ from supervised in that they do not require a training dataset and in the case of documents, the categories are not known in advance. For example, let’s say we have a large number of emails that we want to analyze as part of an eDiscovery Process. We may have no idea what the emails are about or what topics they deal with and we want to automatically find out what are the most common topics present in the dataset. Unsupervised techniques such as Clustering can be used to automatically discover groups of similar documents within a collection of documents.

     An Overview of Document Clustering

    Document Clustering is a method for finding structure within a collection of documents, so that similar documents can be grouped into categories. The first step in the Clustering process is to create word vectors for the documents we wish to cluster. A vector is simply a numerical representation of the document, where each component of the vector refers to a word, and the value of that component indicates the presence or importance of that word in the document. The distance matrix between these vectors is then fed to algorithms, which group similar vectors together into clusters. A simple example will help to illustrate how documents might be transformed into vectors.

    A simple example of transforming documents into vectors

    Using the words within a document as the “features” that describe a document, we need to find a way to represent these features numerically as a vector. As we did in our first blog in the series we will consider three very short documents to illustrate the process.

     

     

    We start by taking all of the words across the three documents in our training set and create a table or vector from these words.

    <some,tigers,live,in,the,zoo,green,is,a,color,he,has,gone,to,New,York>

    Then for each of the documents, we create a vector by assigning a 1 if the word exists in the document and a 0 if it doesn’t. In the table below each row is a vector describing a single document.

     

     

    Preprocessing the data

    As we described in our blog on Supervised Methods of Classification it is likely that some preprocessing of the data would be needed prior to creating the vectors.  In our simple example, we have given equal importance (a value of 1) to each and every word when creating Document Vectors and no word appears more than once. To improve the accuracy, we could give different weighting to words based on their importance to the document in question and their frequency within the document set as a whole. A common methodology used to do this is TF-IDF (term frequency – inverse document frequency). The TF-IDF weighting for a word increases with the number of times the word appears in the document but decreases based on how frequently the word appears across the entire document set. This has the effect of giving a lower overall weighting to words which occur more frequently in the document set such as “a”, “it”, etc

    Clustering Algorithms

    In the graph below each “dot” is a vector which represents a document. The graph shows the output from a Clustering Algorithm with an X marking the center of each cluster (known as a ‘centroid’). In this case the vector’s only have two features (or dimensions) and can easily be plotted on a two-dimensional graph as shown below.

    K-Means Clustering Algorithm output example:

    Source: http://blog.mpacula.com/2011/04/27/k-means-clustering-example-python/

    Two extreme cases to illustrate the concept of discovering the clusters

    If we want to group the vectors together into clusters, we first need to look at the two extreme cases to illustrate how it can be done. Firstly, we assume that there is only one cluster and that all of the document vectors belong in this cluster. This is a very simple approach which is not very useful when it comes to managing or sorting the documents effectively.

    The second extreme case is to decide that each document is a cluster by itself, so that if we had n documents we would have N clusters. Again, this is a very simple solution with not much practical use.

    Finding the K clusters from the N Document Vectors

    Ideally from N documents we want to find K distinct clusters that separate the document into useful and meaningful categories. There are many Clustering Algorithms available to help us achieve this. For this blog, we will look at the k-means algorithm in more detail to illustrate the concept.

    How many clusters (K)?

    One simple rule of thumb for deciding the optimum number of clusters (K) to have is:

    K = sqrt(N/2).

    There are many more methods of finding K which you can read about here.

    Finding the Clusters

    Again, there are many ways we can find clusters. To illustrate the concept we’ll look at the steps used in one popular method, the K-means algorithm which follows the following steps:

    1. Find the value of K using our simple rule of thumb above.
    2. Randomly assign each of the K cluster centroids throughout the dataset.
    3. Assign each data point to the cluster whose centroid is closest to it.
    4. Recompute the centroid location for each cluster as an average of the vector points within the cluster (this will find the new “center” of the cluster).
    5. Reassign each vector data point to the centroid closest to it i.e. some will now switch from one cluster to another as the centroids positions have changed.
    6. Repeat steps 4 and 5 until none of the data points switch centroids i.e the clusters have “converged”.

    That’s pretty much it, you now have your n documents assigned to K clusters! If you have difficulty visualising the steps above, watch this excellent video tutorial by Viktor Lavrenko of the University of Edinburgh, which explains it in more depth.

    Keep an eye out for more in our “Text Analysis 101” series. The next blog will look at how Topic Modelling is performed.

    image

    0

    We recently developed and released SDKs for Node.js, Python, Ruby and PHP with Java, Go and C# coming out next week. This is the first blog in a series on using AYLIEN’s various Software Development Kits (SDKs) and for todays blog we are going to focus on using the Node.js SDK.

    If you are new to AYLIEN and do not yet have an account you can take a look at our blog on getting started with the API or alternatively you can go directly to the Getting Started page on the website which will take you through the signup process. We have a free plan to get you started which allows you to make up to 1,000 calls per day to the API for free.

    Downloading and Installing the Node.js SDK

    All of our SDK repositories are hosted on Github. The simplest way to install the repository is with the node package manager “npm”, by typing the following from the command line

    $ npm install aylien_textapi

    Once you have installed the SDK you are ready to start coding! The Sandbox area of the website has a number of sample applications available which you can use to get things moving.

    For this guide in particular we are going to walk you through some of the basic functionality of the API incorporating the “Basic Functions” sample app from the Sandbox. We’ll illustrate making a call to three of the endpoints individually and interpret the output that you should receive in each case.

    Accessing the SDK with your AYLIEN credentials

    Once you have received your AYLIEN credentials from the signup process and have downloaded the SDK you can begin making calls by adding the following code to your node.js application.

    var AYLIENTextAPI = require('aylien_textapi');
    var textapi = new AYLIENTextAPI({
      application_id:"YOUR_APP_ID",
      application_key: "YOUR_APP_KEY"
    });

    When calling the various endpoints you can specify whether your input is a piece of text or you can pass a URL linking to the text or article you wish to analyze.

    Language Detection

    First let’s take a look at the language detection endpoint. The Language Detection endpoint is pretty straight forward, it is used to detect the language of a sentence. In this case we are analyzing the following sentence: “What language is this sentence written in?“

    You can call the endpoint using the following piece of code.

    textapi.language('What language is this sentence written in?', function(err, resp) {
      	console.log("nLanguage Detection Results:n");
      	if (err !== null) {
        console.log("Error: " + err);
      } else {
        console.log("Language	:",resp.lang);
        console.log("Confidence	:",resp.confidence);
      }
    });
    

    You should receive an output very similar to the one below which shows that the language detected was English. It also shows a confidence score (a number between 0 and 1) of close to 1, indicating that you can be pretty sure English is correct.

    Language Detection Results:
    
    Language	: en
    Confidence	: 0.9999984486883192

    Sentiment Analysis

    Next we’ll look at analyzing the sentence “John is a very good football player” to determine it’s sentiment i.e. whether it’s positive , neutral or negative. The endpoint will also determine if the text is subjective or objective. You can call the endpoint with the following piece of code.

    textapi.sentiment('John is a very good football player!', function(err, resp) {
      console.log("nSentiment Analysis Results:n");
      if (err !== null) {
        console.log("Error: " + err);
      } else {
        console.log("Subjectivity	:",resp.subjectivity);
        console.log("Subjectivity Confidence	:",resp.subjectivity_confidence);
        console.log("Polarity	:",resp.polarity);
        console.log("Polarity Confidence	:",resp.polarity_confidence);
      }
    });

    You should receive an output similar to the one shown below which indicates that the sentence is objective and is positive, both with a high degree of confidence.

    Sentiment Analysis Results:
    
    Subjectivity	: objective
    Subjectivity Confidence	: 0.9896821594138254
    Polarity	: positive
    Polarity Confidence	: 0.9999988272764874

    Article Classification

    Next we will take a look at the classification endpoint. The Classification endpoint automatically assigns an article or piece of text to one or more categories making it easier to manage and sort. The classification is based on IPTC International Subject News Codes and can identify up to 500 categories. The code below analyzes a BBC news article about the Philae Lander which found organic molecules on the surface of a comet.

    textapi.classify('http://www.bbc.com/news/science-environment-30097648', function(err, resp) {
        console.log("nArticle Classification Results:n");
      	if (err !== null) {
        console.log("Error: " + err);
      } else {
        for (var i=0; i < resp.categories.length; i++){
            console.log("Label	:",resp.categories[i].label);
        	console.log("Code	:",resp.categories[i].code);
    		console.log("Confidence	:",resp.categories[i].confidence);
        }
      }
    });
    

    When you run this code you should receive an output similar to that shown below which assigns the article an IPTC label of “science and technology – space program” with an IPTC code of 13008000.

    Article Classification Results:
    
    Label	: science and technology - space programme
    Code	: 13008000
    Confidence	: 0.9999999983009931

    Now that you have seen how simple it is to access the power of Text Analysis through the SDK jump over to our Sandbox and start playing with the sample apps. If Node.js is not your preferred language then check out our SDKs for Python, Ruby, PHP on our website. We will be publishing ‘getting started’ blogs for each of these languages over the coming weeks so keep an eye out for those too. If you haven’t already done so you can get free access to our API on our sign up page.

     

    image

    0