###### Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!

Here at AYLIEN we have a team of researchers who like to keep abreast of, and regularly contribute to, the latest developments in the field of Natural Language Processing. Recently, one of our research scientists, Sebastian Ruder, attended EMNLP 2016 in Austin, Texas. In this post, Sebastian has highlighted some of the stand-out papers and trends from the conference.

Image: Jackie Cheung

I spent the past week in Austin, Texas at EMNLP 2016, the Conference on Empirical Methods in Natural Language Processing.

There were a lot of papers at the conference (179 long papers, 87 short papers, and 9 TACL papers all in all) — too many to read each single one. The entire program can be found here. In the following, I will highlight some trends and papers that caught my eye:

#### Reinforcement learning

One thing that stood out was that RL seems to be slowly finding its footing in NLP, with more and more people using it to solve complex problems:

#### Dialogue

Dialogue was a focus of the conference with all of the three keynote speakers dealing with different aspects of dialogue: Christopher Potts talked about pragmatics and how to reason about the intentions of the conversation partner; Stefanie Tellex concentrated on how to use dialogue for human-robot collaboration; finally, Andreas Stolcke focused on the problem of addressee detection in his talk.

Among the papers, a few that dealt with dialogue stood out:

• Andreas and Klein model pragmatics in dialogue with neural speakers and listeners;
• Liu et al. show how not to evaluate your dialogue system;
• Ouchi and Tsuboi select addressees and responses in multi-party conversations;
• Wen et al. study diverse architectures for dialogue modelling.

#### Sequence-to-sequence

Seq2seq models were again front and center. It is not common for a method to have its own session two years after its introduction (Sutskever et al., 2014). While in the past years, many papers employed seq2seq e.g. for Neural Machine Translation, some papers this year focused on improving the seq2seq framework:

#### Semantic parsing

Since seq2seq’s use for dialogue modelling was popularised by Vinyals and Le, it is harder to get it to work with goal-oriented tasks that require an intermediate representation on which to act. Semantic parsing is used to convert a message into a more meaningful representation that can be used by another component of the system. As this technique is useful for sophisticated dialogue systems, it is great to see progress in this area:

#### X-to-text (or natural language generation)

While mapping from text-to-text with the seq2seq paradigm is still prevalent, EMNLP featured some cool papers on natural language generation from other inputs:

#### Parsing

Parsing and syntax are a mainstay of every NLP conference and the community seems to particularly appreciate innovative models that push the state-of-the-art in parsing: The ACL ’16 outstanding paper by Andor et al. introduced a globally normalized model for parsing, while the best EMNLP ‘16 paper by Lee et al. combines a global parsing model with a local search over subtrees.

#### Word embeddings

There were still papers on word embeddings, but it felt less overwhelming than at the past EMNLP or ACL, with most methods trying to fix a particular flaw rather than training embeddings for embeddings’ sake. Pilevhar and Collier de-conflate senses in word embeddings, while Wieting et al. achieve state-of-the-art results for character-based embeddings.

#### Sentiment analysis

Sentiment analysis has been popular in recent years (as attested by the introductions of many recent papers on sentiment analysis). Sadly, many of the conference papers on sentiment analysis reduce to leveraging the latest deep neural network for the task to beat the previous state-of-the-art without providing additional insights. There are, however, some that break the mold: Teng et al. find an effective way to incorporate sentiment lexicons into a neural network, while Hu et al. incorporate structured knowledge into their sentiment analysis model.

#### Deep Learning

By now, it is clear to everyone: Deep Learning is here to stay. In fact, deep learning and neural networks claimed the two top spots of keywords that were used to describe the submitted papers. The majority of papers used at least an LSTM; using no neural network seems almost contrarian now and is something that needs to be justified. However, there are still many things that need to be improved — which leads us to…

#### Uphill Battles

While making incremental progress is important to secure grants and publish papers, we should not lose track of the long-term goals. In this spirit, one of the best workshops that I’ve attended was the Uphill Battles in Language Processing workshop, which featured 12 talks and not one, but four all-star panels on text understanding, natural language generation, dialogue and speech, and grounded language. Summaries of the panel discussions should be available soon at the workshop website.

This was my brief review of some of the trends of EMNLP 2016. I hope it was helpful.

### Introduction

Deep Learning is a new area of Machine Learning research that has been gaining significant media interest owing to the role it is playing in artificial intelligence applications like image recognition, self-driving cars and most recently the AlphaGo vs. Lee Sedol matches. Recently, Deep Learning techniques have become popular in solving traditional Natural Language Processing problems like Sentiment Analysis.

For those of you that are new to the topic of Deep Learning, we have put together a list of ten common terms and concepts explained in simple English, which will hopefully make them a bit easier to understand. We’ve done the same in the past for Machine Learning and NLP terms, which you might also find interesting.

### Perceptron

In the human brain, a neuron is a cell that processes and transmits information. A perceptron can be considered as a super-simplified version of a biological neuron.

A perceptron will take several inputs and weigh them up to produce a single output. Each input is weighted according to its importance in the output decision.

### Artificial Neural Networks

Artificial Neural Networks (ANN) are models influenced by biological neural networks such as the central nervous systems of living creatures and most distinctly, the brain.

ANN’s are processing devices, such as algorithms or physical hardware, and are loosely modeled on the cerebral cortex of mammals, albeit on a considerably smaller scale.

Let’s call them a simplified computational model of the human brain.

### Backpropagation

A neural network learns by training, using an algorithm called backpropagation. To train a neural network it is first given an input which produces an output. The first step is to teach the neural network what the correct, or ideal, output should have been for that input. The ANN can then take this ideal output and begin adapting the weights to yield an enhanced, more precise output (based on how much they contributed to the overall prediction) the next time it receives a similar input.

This process is repeated many many times until the margin of error between the input and the ideal output is considered acceptable.

### Convolutional Neural Networks

A convolutional neural network (CNN) can be considered as a neural network that utilizes numerous identical replicas of the same neuron. The benefit of this is that it enables a network to learn a neuron once and use it in numerous places, simplifying the model learning process and thus reducing error. This has made CNNs particularly useful in the area of object recognition and image tagging.

CNNs learn more and more abstract representations of the input with each convolution. In the case of object recognition, a CNN might start with raw pixel data, then learn highly discriminative features such as edges, followed by basic shapes, complex shapes, patterns and textures.

source: http://stats.stackexchange.com/questions/146413

### Recurrent Neural Network

Recurrent Neural Networks (RNN) make use of sequential information. Unlike traditional neural networks, where it is assumed that all inputs and outputs are independent of one another, RNNs are reliant on preceding computations and what has previously been calculated. RNNs can be conceptualized as a neural network unrolled over time. Where you would have different layers in a regular neural network, you apply the same layer to the input at each timestep in an RNN, using the output, i.e. the state of the previous timestep as input. Connections between entities in a RNN form a directed cycle, creating a sort of internal memory, that helps the model leverage long chains of dependencies.

### Recursive Neural Network

A Recursive Neural Network is a generalization of a Recurrent Neural Network and is generated by applying a fixed and consistent set of weights repetitively, or recursively, over the structure. Recursive Neural Networks take the form of a tree, while Recurrent is a chain. Recursive Neural Nets have been utilized in Natural Language Processing for tasks such as Sentiment Analysis.

### Supervised Neural Network

For a supervised neural network to produce an ideal output, it must have been previously given this output. It is ‘trained’ on a pre-defined dataset and based on this dataset, can produce  accurate outputs depending on the input it has received. You could therefore say that it has been supervised in its learning, having for example been given both the question and the ideal answer.

### Unsupervised Neural Network

This involves providing a programme or machine with an unlabeled data set that it has not been previously trained for, with the goal of automatically discovering patterns and trends through clustering.

Gradient Descent is an algorithm used to find the local minimum of a function. By initially guessing the solution and using the function gradient at that point, we guide the solution in the negative direction of the gradient and repeat this technique until the algorithm eventually converges at the point where the gradient is zero – local minimum. We essentially descend the error surface until we arrive at a valley.

### Word Embedding

Similar to the way a painting might be a representation of a person, a word embedding is a representation of a word, using real-valued numbers. Word embeddings can be trained and used to derive similarities between both other words, and other relations. They are an arrangement of numbers representing the semantic and syntactic information of words in a format that computers can understand.

Word vectors created through this process manifest interesting characteristics that almost look and sound like magic at first. For instance, if we subtract the vector of Man from the vector of King, the result will be almost equal to the vector resulting from subtracting Woman from Queen. Even more surprisingly, the result of subtracting Run from Running almost equates to that of Seeing minus See. These examples show that the model has not only learnt the meaning and the semantics of these words, but also the syntax and the grammar to some degree.

So there you have it – some pretty technical deep learning terms explained in simple english. We hope this helps you get your head around some of the tricky terms you might come across as you begin to explore deep learning.

We help our users understand and classify content so they can extract insight from it. Being able to classify and tag content like news articles, blogs and web pages allows our users to manage and categorize content effectively and more importantly at scale. Up until now we’ve offered two forms of content classification/categorization, one based on IPTC Subject Codes specifically useful for our news and media customers and the second, a flexible tagging feature based on Semantic Labeling for those who wish to apply custom labels to text.

From today on, however, we’ll offer a third classification feature that’s primarily focused on providing an advertising focused classification method. This allows our Ad Tech users to tag and categorize text based on Interactive Advertising Bureau (IAB) standards.

We’re super excited about our IAB classification feature which categorizes content based on the IAB Quality Assurance Guidelines. This classification feature automatically categorizes text based into hierarchical groups based on the IAB quality assurance guideline taxonomy thus providing easily referenceable and usable tags of which you can see examples of below.

IAB QAG Taxonomy

The IAB QAG contextual taxonomy was developed by the IAB in conjunction with taxonomy experts from academia, ad measurement companies, and members of the IAB Networks & Exchanges Committee in order to define content categories on at least two different tiers, making content classification a lot more consistent across the advertising industry. The first tier being a broad level category and the second a more detailed description or more specifically a root and leaf type structure.

Example Article:

Results:

"categories": [
{
"leaf": {
"confidence": 0.07787707145827048,
"id": "IAB2-10","label":
"Automotive>Electric Vehicle"},
"root": {
"confidence": 0.6789603849779564,
"id": "IAB2","label":
"Automotive>"
}
}

]

IAB classification was the most requested feature addition we’ve had over the last 6 months. As more and more companies invest in online advertising, publishers, agencies, ad networks and brands all want to be sure their ads are being displayed in the best place possible.

More Accurate Content Tagging = Better Ad Targeting

Using automatically generated IAB certified labels means, our users can intelligently categorize large amounts of content, retrospectively or in near real time, to intelligently understand and tag content. These tags can then be used to improve how content is managed and where ads are placed using intelligent semantic/contextual driven targeting powered by the IAB approved taxonomy ensuring ad impressions are displayed in the right place at the right time.

Building our classifier on the IAB QAG taxonomy means it is a lot easier for our users to build solutions and applications that conform to industry standards and integrate well with Ad Tech solutions like Open RTB, ad exchanges and platforms.

We’ve also updated our SDKs to make it quick and easy to get up and running. Check out our live IAB demo or visit our documentation to see how easy it is to start classifying text according to the IAB guidelines.

Most of our users will make 3 or more calls to our API for every piece of text or URL they analyze. For example if you’re a publisher who wants to extract insight from a an article or URL it’s likely you’ll want to use more than one of our features to get a proper understanding of that particular article or URL.

With this in mind, we decided to make it faster, easier and more efficient for our users to run multiple analysis operations in one single call to the API.

Our Combined Calls endpoint, allows you to run more than one type of analysis on a piece of text or URL without having to call each endpoint separately.

• Run multiple operations at once
• Speed up your analysis process
• Write cleaner, more efficient code

### Combined Calls

To showcase how useful the Combined Calls endpoint can be, we’ve ran a typical process that a lot of our news and media focused users would use when analyzing URL’s or articles on news sites.

In this case, we’re going to Classify the article in question and extract any Entities and Concepts present in the text. To run a process like this would typically involve passing the same URL to the API 3 times, once for each analysis operation and following that, retrieving 3 separate results relevant to each operation. However, with Combined Calls, we’re only making 1 call to the API and retrieving 1 set of results, making it a lot more efficient and cleaner for the end user.

Code Snippet:

var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: "APP_ID",
application_key: "APP_KEY"
});

textapi.combined({
"url": "http://www.bbc.com/news/technology-33764155",
"endpoint": ["entities", "concepts", "classify"]
}, function(err, result) {
if (err === null) {
console.log(JSON.stringify(result));
} else {
console.log(err)
}
});


The code snippet above was written using our Node.js SDK. SDKs are available for a variety of languages on our SDKs page.

### Results

We’ve broken down the results below into three sections, Entities, Concepts and Classification to help with readability, but using the combined calls endpoint all of these results would be returned together.

#### Entities:

{
"results": [
{
"endpoint": "entities",
"result": {
"entities": {
"keyword": [
"internet servers",
"flaw in the internet",
"internet users",
"server software",
"exploits of the flaw",
"internet",
"System (DNS) software",
"servers",
"flaw",
"expert",
"vulnerability",
"systems",
"software",
"exploits",
"users",
"websites",
"offline",
"URLs",
"services"
],
"organization": [
"DNS",
"BBC"
],
"person": [
"Daniel Cid",
"Brian Honan"
]
},
"language": "en"
}
},


#### Concepts:

{
"endpoint": "concepts",
"result": {
"concepts": {
"http://dbpedia.org/resource/Apache": {
"support": 3082,
"surfaceForms": [
{
"offset": 1261,
"score": 0.9726336488480631,
"string": "Apache"
}
],
"types": [
"http://dbpedia.org/ontology/EthnicGroup"
]
},
"http://dbpedia.org/resource/BBC": {
"support": 61289,
"surfaceForms": [
{
"offset": 1108,
"score": 0.9997923194235071,
"string": "BBC"
}
],
"types": [
"http://dbpedia.org/ontology/Agent",
"http://schema.org/Organization",
"http://dbpedia.org/ontology/Organisation",
"http://dbpedia.org/ontology/Company"
]
},
"http://dbpedia.org/resource/Denial-of-service_attack": {
"support": 503,
"surfaceForms": [
{
"offset": 264,
"score": 0.9999442627824017,
"string": "denial-of-service attacks"
}
],
"types": [
""
]
},
"http://dbpedia.org/resource/Domain_Name_System": {
"support": 1279,
"surfaceForms": [
{
"offset": 442,
"score": 1,
"string": "Domain Name System"
},
{
"offset": 462,
"score": 0.9984593397878601,
"string": "DNS"
}
],
"types": [
""
]
},
"http://dbpedia.org/resource/Hacker_(computer_security)": {
"support": 1436,
"surfaceForms": [
{
"offset": 0,
"score": 0.7808308562314218,
"string": "Hackers"
},
{
"offset": 246,
"score": 0.9326746054676964,
"string": "hackers"
}
],
"types": [
""
]
},
"http://dbpedia.org/resource/Indian_School_Certificate": {
"support": 161,
"surfaceForms": [
{
"offset": 794,
"score": 0.7811847159512098,
"string": "ISC"
}
],
"types": [
""
]
},
"http://dbpedia.org/resource/Internet_Systems_Consortium": {
"support": 35,
"surfaceForms": [
{
"offset": 765,
"score": 1,
"string": "Internet Systems Consortium"
}
],
"types": [
"http://dbpedia.org/ontology/Agent",
"http://schema.org/Organization",
"http://dbpedia.org/ontology/Organisation",
"http://dbpedia.org/ontology/Non-ProfitOrganisation"
]
},
"http://dbpedia.org/resource/OpenSSL": {
"support": 105,
"surfaceForms": [
{
"offset": 1269,
"score": 1,
"string": "OpenSSL"
}
],
"types": [
"http://schema.org/CreativeWork",
"http://dbpedia.org/ontology/Work",
"http://dbpedia.org/ontology/Software"
]
}
},
"language": "en"
}
},


#### Classification:

{
"endpoint": "classify",
"result": {
"categories": [
{
"code": "04003005",
"confidence": 1,
"label": "computing and information technology - software"
}
],
"language": "en"
}
}


You can find more information on using Combined Calls in our Text Analysis Documentation.

We should also point out that the existing rate limits will also apply when using Combined Calls. You can read more about our rate limits here.

## Introduction

This is the second edition of our NLP terms explained blog posts. The first edition deals with some simple terms and NLP tasks while this edition, gets a little bit more complicated. Again, we’ve just chosen some common terms at random and tried to break them down in simple English to make them a bit easier to understand.

#### Part of Speech tagging (POS tagging)

Sometimes referred to as grammatical tagging or word-category disambiguation, part of speech tagging refers to the process of determining the part of speech for each word in a given sentence based on the definition of that word and its context. Many words, especially common ones, can serve as multiple parts of speech. For example, “book” can be a noun (“the book on the table”) or verb (“to book a flight”).

#### Parsing

Parsing is a major task of NLP. It’s focused on determining the grammatical analysis or Parse Tree of a given sentence. There are two forms of Parse trees Constituency based and dependency based parse trees.

#### Semantic Role Labeling

This is an important step towards making sense of the meaning of a sentence. It focuses on the detecting semantic arguments associated with a verb or verbs in a sentence and the classification of those verbs into into specific roles.

#### Machine Translation

A sub-field of computational linguistics MT investigates the use of software to translate text or speech from one language to another.

#### Statistical Machine Translation

SMT is one of a few different approaches to Machine Translation. A common task in NLP it relies on statistical methods based off bilingual corpora such as the Canadian Hansard corpus. Other approaches to Machine Translation include Rule Based Translation and Example-Based Translation.

#### Bayesian Classification

Bayesian classification is a classification method based on Bayes Theorem and is commonly used in Machine Learning and Natural Language Processing to classify text and documents. You can read more about it in Naive Bayes for Dummies.

#### Hidden Markov Model (HMM)

In order to understand a HMM we need to define a Markov Model. This is used to model randomly changing systems where it is assumed that future states only depend on the present state and not on the sequence of events that happened before it.

A HMM is a Markov model where the system being modeled is assumed to have unobserved or hidden states. There are a number of common algorithms used for hidden Markov models. The  Viterbi algorithm which will compute the most-likely corresponding sequence of states and the forward algorithm, for example, will compute the probability of the sequence of observations and both are often used in NLP applications.

In hidden Markov models, the state is not directly visible, but output, dependent on the state, is visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.

#### Conditional Random Fields (CRFs)

A class of statistical modeling methods that are often applied in pattern recognition and machine learning, where they are used for structured prediction. Ordinary classifiers will predict labels for a sample without taking neighboring samples into account, a CRF model however, will take context into account. CRF is commonly used in NLP (e.g. in Named Entity Extraction) and more recently in image recognition.

#### Affinity Propagation (AP)

AP is a clustering algorithm commonly used in Data Mining, unlike other clustering algorithms such as, k-means, AP does not require the number of clusters to be estimated before running the algorithm. A semi-supervised version of AP is commonly used in NLP.

#### Relationship extraction

Given a chunk of words or a piece of text determining the relationship between named entities.

### What’s Blockspring?

Blockspring is a really exciting, YC backed startup, who pitch themselves as “The world’s library of functions, accessible from everywhere you do work.” Their platform allows you to interact with a library of various APIs through a spreadsheet, simple code snippets and soon a chat interface.

The platform lets you run 1000+ functions directly from your spreadsheet or through simple code snippets for the more technically inclined. Accessing APIs with Blockspring is done through the concept of functions and they certainly have some cool APIs available to interact with in their library.

Where Blockspring gets really interesting though, is when you start to combine multiple functions. Your spreadsheet pretty much becomes a playpen where you can interact with one or multiple APIs and create powerful applications and “mashups”. Some of the examples of what can be done with Blockspring include, automating social activity and monitoring, gathering marketing data about user segments and usage, accessing public datasets, scraping websites and now even analyzing text and unstructured data all of which are really nicely showcased on their getting started page.

### AYLIEN and Blockspring

Like Blockspring, we want to get the power of our API into the hands of anyone that can get value from it. We launched our own Text Analysis Add-on for google sheets last year. The add-on works in the same way as Blockspring, through simple functions and acts as an interface for our Text Analysis API. Integrating with Blockspring, however, means our users can now open up their use cases by combining our functions with other complementary APIs to create powerful tools and integrations.

All of the AYLIEN end-points are available through Blockspring as simple snippets or spreadsheet functions and getting started with AYLIEN and Blockspring is really easy.

#### It’s simple to get up and running:

Step 1.

Step 2.

Grab your AYLIEN APP ID and API key and keep it handy. If you don’t have an AYLIEN account just sign up here.

Step 3.

Explore the getting started section to see examples of the functions and APIs available.

Step 4.

Try some of the different functions through their interactive docs to get a feel for how they work.

Step 5.

Go wild and start building and creating mashups of functions with code snippets or in Google Sheets.

PS: Don’t forget to add your AYLIEN keys to your Blockspring account in the Secrets section of your account settings. Once they’ve been added, you won’t have to do it again.

We’re really excited to see what the Blockspring community start to build with our various functions. Over the next couple of weeks, we’ll also be showcasing some cool mashups that we’ve put together in Blockspring so keep your eyes peeled on the blog.

We’ve just added support for microformat parsing to our Text Analysis API through our Microformat Extraction endpoint.

Microformats are simple conventions or entities that are used on web pages, to describe a specific type of information, for example, Contact info, Reviews, Products, People, Events, etc.

Microformats are often included in the HTML of pages on the web to add semantic information about that page. They make it easier for machines and software to scan, process and understand webpages. AYLIEN Microformat Extraction allows users to detect, parse and extract embedded Microformats when they are present on a page.

Currently, the API supports the hCard format. We will be providing support for the other formats over the coming months. The quickest way to get up and running with this endpoint is to download an SDK and checkout the documentation. We have gone through a simple example below to showcase the endpoints capabilities.

### Microformat Extraction in Action

The following piece of code sets up the credentials for accessing our API. If you don’t have an AYLIEN account, you can sign up here.


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: YOUR_APP_ID,
application_key: ‘YOUR_APP_KEY'
});



The next piece of code accesses an HTML test page containing microformats, that we have setup in codepen to illustrate how the endpoint works (check out http://codepen.io/michaelo/pen/VYxxRR.html to see the raw HTML). The code consists of a call to the microformats endpoint and a forEach statement to display any hCards detected on the page.


textapi.microformats('http://codepen.io/michaelo/pen/VYxxRR.html',
function(err, res) {
if (err !== null) {
console.log("Error: " + err);
} else {
res.hCards.forEach(function(hCard) {
console.log(hCard);
console.log("n****************************************");
console.log("End Of vCard");
console.log("******************************************");
});
}
});



As you can see from the results below, there are two hcards on the page, one for Sally Ride and the other for John Glenn. The documentation for the endpoint shows the structure of the data returned by the endpoint and lists the optional hCard fields that are currently supported. You can copy the code above and paste it into our sandbox environment to view the results for yourself and play around with the various fields.

#### Results


{ birthday: '1951-05-26',
organization: 'Sally Ride Science',
telephoneNumber: '+1.818.555.1212',
location:
{ id: '9f15e27ff48eb28c57f49fb177a1ed0af78f93ab',
latitude: '37.386013',
longitude: '-122.082932' },
photo: 'http://example.com/sk.jpg',
email: 'sally@example.com',
url: 'http://sally.example.com',
fullName: 'Sally Ride',
structuredName:
{ familyName: 'van der Harten',
givenName: 'Sally',
honorificSuffix: 'Ph.D.',
honorificPrefix: 'Dr.' },
logo: 'http://www.abc.com/pub/logos/abccorp.jpg',
id: '7d021199b0d826eef60cd31279037270e38715cd',
note: '1st American woman in space.',
countryName: 'U.S.A',
postalCode: 'LWT12Z',
id: '00cc73c1f9773a66613b04f11ce57317eecf636b',
region: 'California',
locality: 'Los Angeles' },
category: 'physicist' }

****************************************
End Of vCard
****************************************

{ birthday: '1921-07-18',
telephoneNumber: '+1.818.555.1313',
location:
latitude: '30.386013',
longitude: '-123.082932' },
photo: 'http://example.com/jg.jpg',
email: 'johnglenn@example.com',
url: 'http://john.example.com',
fullName: 'John Glenn',
structuredName:
{ familyName: 'Glenn',
givenName: 'John',
id: 'a1146a5a67d236f340c5e906553f16d59113a417',
honorificPrefix: 'Senator' },
logo: 'http://www.example.com/pub/logos/abccorp.jpg',
id: '18538282ee1ac00b28f8645dff758f2ce696f8e5',
note: '1st American to orbit the Earth',
countryName: 'U.S.A',
postalCode: 'PC123',
id: '8cc940d376d3ddf77c6a5938cf731ee4ac01e128',
region: 'Ohio',
locality: 'Columbus' } }

****************************************
End Of vCard
****************************************



Microformats Extraction allows you to automatically scan and understand webpages by pulling relevant information from HTML. This microformat information is easier for both humans and now machines to understand than other complex forms such as XML.

Our development team have been working hard adding additional features to the API which allow our users to analyze, classify and tag text in more flexible ways. Unsupervised Classification is a feature we are really excited about and we’re happy to announce that it is available as a fully functional and documented feature, as of today.

#### So what exactly is Unsupervised Classification?

It’s a training-less approach to classification, which means, unlike our standard classification, that is based on IPTC News Codes, it doesn’t rely on a predefined taxonomy to categorize text. This method of classification allows automatic tagging of text that can be tailored to a users needs, without the need for a pre-trained classifier.

#### Why are we so excited about it?

Our Unsupervised Classification endpoint will allow users to specify a set of labels, analyze a piece of text and then assign the most appropriate label to that text. This allows greater flexibility for our users to decide, how they want to tag and classify text.

There are a number of ways this endpoint can be used and we’ll walk you through a couple of simple examples; Text Classification from a URL and Customer Service Routing of social interactions.

### Classification of Text

We’ll start with a simple example to show how the feature works. The user passes a piece of text or a URL to the API, along with a number of labels. In the case below we want to find out which label, Football, Baseball, Hockey or Basketball, best represents the following article: ‘http://insider.espn.go.com/nfl/story/_/id/12300361/bold-move-new-england-patriots-miami-dolphins-new-york-jets-buffalo-bills-nfl’

#### Code Snippet:


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: 'YourAppId',
application_key: 'YourAppKey'
});

var params = {
url: 'http://insider.espn.go.com/nfl/story/_/id/12300361/bold-move-new-england-patriots-miami-dolphins-new-york-jets-buffalo-bills-nfl',
};

textapi.unsupervisedClassify(params, function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
console.log("nThe text to classify is : nn",
response.text, "n");
for (var i = 0; i < response.classes.length; i++) {
console.log("label - ", response.classes[i].label,
", score -", response.classes[i].score, "n");
}
}
});


#### Results:


The text to classify is:

"Each NFL team's offseason is filled with small moves and marginal personnel decisions... "

label -  football , score - 0.13

label -  baseball , score - 0.042

label -  hockey , score - 0.008

label -  basketball , score - 0.008


Based on the scores provided, we can confidently say, that the article is about football and should be assigned a “Football” label.

### Customer Service Routing

As another example, let’s say we want to automatically determine whether a post on social media should be routed to our Sales, Marketing or Support Departments. In this example, we’ll take the following comment: “I’d like to place an order for 1000 units.” and automatically determine whether it should be dealt with by Sales, Marketing or Support. To do this, we pass the text to the API as well as our pre-chosen labels, in this case: ‘Sales’, ‘Customer Support’, ‘Marketing’.

#### Code Snippet:


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: 'YourAppId',
application_key: 'YourAppKey'
});

var params = {
text: "I'd like to place an order for 1000 units.",
'class': ['Sales', 'Customer Support', 'Marketing']
};

textapi.unsupervisedClassify(params, function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
console.log("nThe text to classify is : nn",
response.text, "n");
for (var i = 0; i < response.classes.length; i++) {
console.log("label - ",
response.classes[i].label,
", score -", response.classes[i].score, "n");
}
}
});


#### Results:


The text to classify is:

I'd like to place an order for 1000 units.

label -  Sales , score - 0.032

label -  Customer Support , score - 0.008

label -  Marketing , score - 0.002


Similarily, based on the scores given on how closely the text is semantically matched to a label, we can decide that this inquiry should be handled by a sales agent rather than, marketing or support.

### Divide and Conquer

Our next example deals with the idea of using the unsupervised classification feature, with a hierarchical taxonomy. When classifying text, it’s sometimes necessary to add a sub-label for finer grained classification, for example “Sports – Basketball” instead of just “sports”.

So, in this example we’re going to analyze a simple piece of text: “The oboe is a woodwind musical instrument” and we’ll attempt to provide a more descriptive classification result, based on the following taxonomy;

• ‘music’: [‘Instrument’, ‘composer’],
• ‘technology’: [‘computers’, ‘space’, ‘physics’],
• ‘health’: [‘disease’, ‘medicine’, ‘fitness’],

The taxonomy has a primary label and a secondary label, for example ‘music’ (primary) and ‘instrument, Composer’ (secondary)

#### Code Snippet:


var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id: 'YourAppId',
application_key: 'YourAppKey'
});

var _ = require('underscore');
var taxonomy = {
'music':      ['Instrument', 'composer'],
'technology': ['computers', 'space', 'physics'],
'health':     ['disease', 'medicine', 'fitness'],
};

var topClasses = ['technology', 'music', 'health', 'sport'];
var queryText = "The oboe is a woodwind musical instrument.";
var params = {
text: queryText,
'class': topClasses
};

textapi.unsupervisedClassify(params, function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
var classificationResult = '';
console.log("nThe text to classify is : nn",
response.text, "n");
classificationResult = response.classes[0].label +
" (" + response.classes[0].score + ") ";
params = {
text: queryText,
'class': _.values(
_.pick(taxonomy, response.classes[0].label)
)[0]
};
textapi.unsupervisedClassify(params,
function(error, response) {
if (error !== null) {
console.log(error, response);
} else {
classificationResult += " - " +
response.classes[0].label +
" (" + response.classes[0].score +
") ";
console.log("Label: ", classificationResult);
}
}
);
}

});


#### Results:


The text to classify is :

The Obo is a large musical instrument

Label    :     music (0.076)  - Instrument (0.342)


As you can see from the results, the piece of text has been assigned ‘music’ as its primary label and ‘instrument’ as its secondary label.

All the code snippets in our examples are fully functional and can be copied and pasted or tested in our sandbox. We’ll also be adding some of these and more interesting apps to our sandbox over the next week or so that will showcase some interesting use cases for Unsupervised Classification. We’d also love to hear more about how you would use this feature, so don’t hesitate to get in touch with comments or feedback.

• ### Introduction

This is our second blog on harnessing Machine Learning (ML) in the form of Natural Language Processing (NLP) for the Automatic Classification of documents. By classifying text, we aim to assign a document or piece of text to one or more classes or categories making it easier to manage or sort. A Document Classifier often returns or assigns a category “label” or “code” to a document or piece of text. Depending on the Classification Algorithm or strategy used, a classifier might also provide a confidence measure to indicate how confident it is that the result is correct.

In our first blog, we looked at a supervised method of Document Classification. In supervised methods, Document Categories are predefined by using a training dataset with manually tagged documents. A classifier is then trained on the manually tagged dataset so that it will be able to predict any given Document’s Category from then on.

In this blog, we will focus on Unsupervised Document Classification. Unsupervised ML techniques differ from supervised in that they do not require a training dataset and in the case of documents, the categories are not known in advance. For example, let’s say we have a large number of emails that we want to analyze as part of an eDiscovery Process. We may have no idea what the emails are about or what topics they deal with and we want to automatically find out what are the most common topics present in the dataset. Unsupervised techniques such as Clustering can be used to automatically discover groups of similar documents within a collection of documents.

### An Overview of Document Clustering

Document Clustering is a method for finding structure within a collection of documents, so that similar documents can be grouped into categories. The first step in the Clustering process is to create word vectors for the documents we wish to cluster. A vector is simply a numerical representation of the document, where each component of the vector refers to a word, and the value of that component indicates the presence or importance of that word in the document. The distance matrix between these vectors is then fed to algorithms, which group similar vectors together into clusters. A simple example will help to illustrate how documents might be transformed into vectors.

### A simple example of transforming documents into vectors

Using the words within a document as the “features” that describe a document, we need to find a way to represent these features numerically as a vector. As we did in our first blog in the series we will consider three very short documents to illustrate the process.

We start by taking all of the words across the three documents in our training set and create a table or vector from these words.

<some,tigers,live,in,the,zoo,green,is,a,color,he,has,gone,to,New,York>

Then for each of the documents, we create a vector by assigning a 1 if the word exists in the document and a 0 if it doesn’t. In the table below each row is a vector describing a single document.

### Preprocessing the data

As we described in our blog on Supervised Methods of Classification it is likely that some preprocessing of the data would be needed prior to creating the vectors.  In our simple example, we have given equal importance (a value of 1) to each and every word when creating Document Vectors and no word appears more than once. To improve the accuracy, we could give different weighting to words based on their importance to the document in question and their frequency within the document set as a whole. A common methodology used to do this is TF-IDF (term frequency – inverse document frequency). The TF-IDF weighting for a word increases with the number of times the word appears in the document but decreases based on how frequently the word appears across the entire document set. This has the effect of giving a lower overall weighting to words which occur more frequently in the document set such as “a”, “it”, etc

### Clustering Algorithms

In the graph below each “dot” is a vector which represents a document. The graph shows the output from a Clustering Algorithm with an X marking the center of each cluster (known as a ‘centroid’). In this case the vector’s only have two features (or dimensions) and can easily be plotted on a two-dimensional graph as shown below.

K-Means Clustering Algorithm output example:

Source: http://blog.mpacula.com/2011/04/27/k-means-clustering-example-python/

### Two extreme cases to illustrate the concept of discovering the clusters

If we want to group the vectors together into clusters, we first need to look at the two extreme cases to illustrate how it can be done. Firstly, we assume that there is only one cluster and that all of the document vectors belong in this cluster. This is a very simple approach which is not very useful when it comes to managing or sorting the documents effectively.

The second extreme case is to decide that each document is a cluster by itself, so that if we had n documents we would have N clusters. Again, this is a very simple solution with not much practical use.

### Finding the K clusters from the N Document Vectors

Ideally from N documents we want to find K distinct clusters that separate the document into useful and meaningful categories. There are many Clustering Algorithms available to help us achieve this. For this blog, we will look at the k-means algorithm in more detail to illustrate the concept.

#### How many clusters (K)?

One simple rule of thumb for deciding the optimum number of clusters (K) to have is:

K = sqrt(N/2).

There are many more methods of finding K which you can read about here.

#### Finding the Clusters

Again, there are many ways we can find clusters. To illustrate the concept we’ll look at the steps used in one popular method, the K-means algorithm which follows the following steps:

1. Find the value of K using our simple rule of thumb above.
2. Randomly assign each of the K cluster centroids throughout the dataset.
3. Assign each data point to the cluster whose centroid is closest to it.
4. Recompute the centroid location for each cluster as an average of the vector points within the cluster (this will find the new “center” of the cluster).
5. Reassign each vector data point to the centroid closest to it i.e. some will now switch from one cluster to another as the centroids positions have changed.
6. Repeat steps 4 and 5 until none of the data points switch centroids i.e the clusters have “converged”.

That’s pretty much it, you now have your n documents assigned to K clusters! If you have difficulty visualising the steps above, watch this excellent video tutorial by Viktor Lavrenko of the University of Edinburgh, which explains it in more depth.

Keep an eye out for more in our “Text Analysis 101” series. The next blog will look at how Topic Modelling is performed.

We recently developed and released SDKs for Node.js, Python, Ruby and PHP with Java, Go and C# coming out next week. This is the first blog in a series on using AYLIEN’s various Software Development Kits (SDKs) and for todays blog we are going to focus on using the Node.js SDK.

If you are new to AYLIEN and do not yet have an account you can take a look at our blog on getting started with the API or alternatively you can go directly to the Getting Started page on the website which will take you through the signup process. We have a free plan to get you started which allows you to make up to 1,000 calls per day to the API for free.

All of our SDK repositories are hosted on Github. The simplest way to install the repository is with the node package manager “npm”, by typing the following from the command line

\$ npm install aylien_textapi

Once you have installed the SDK you are ready to start coding! The Sandbox area of the website has a number of sample applications available which you can use to get things moving.

For this guide in particular we are going to walk you through some of the basic functionality of the API incorporating the “Basic Functions” sample app from the Sandbox. We’ll illustrate making a call to three of the endpoints individually and interpret the output that you should receive in each case.

### Accessing the SDK with your AYLIEN credentials

var AYLIENTextAPI = require('aylien_textapi');
var textapi = new AYLIENTextAPI({
application_id:"YOUR_APP_ID",
application_key: "YOUR_APP_KEY"
});

When calling the various endpoints you can specify whether your input is a piece of text or you can pass a URL linking to the text or article you wish to analyze.

### Language Detection

First let’s take a look at the language detection endpoint. The Language Detection endpoint is pretty straight forward, it is used to detect the language of a sentence. In this case we are analyzing the following sentence: “What language is this sentence written in?“

You can call the endpoint using the following piece of code.

textapi.language('What language is this sentence written in?', function(err, resp) {
console.log("nLanguage Detection Results:n");
if (err !== null) {
console.log("Error: " + err);
} else {
console.log("Language	:",resp.lang);
console.log("Confidence	:",resp.confidence);
}
});


You should receive an output very similar to the one below which shows that the language detected was English. It also shows a confidence score (a number between 0 and 1) of close to 1, indicating that you can be pretty sure English is correct.

Language Detection Results:

Language	: en
Confidence	: 0.9999984486883192

### Sentiment Analysis

Next we’ll look at analyzing the sentence “John is a very good football player” to determine it’s sentiment i.e. whether it’s positive , neutral or negative. The endpoint will also determine if the text is subjective or objective. You can call the endpoint with the following piece of code.

textapi.sentiment('John is a very good football player!', function(err, resp) {
console.log("nSentiment Analysis Results:n");
if (err !== null) {
console.log("Error: " + err);
} else {
console.log("Subjectivity	:",resp.subjectivity);
console.log("Subjectivity Confidence	:",resp.subjectivity_confidence);
console.log("Polarity	:",resp.polarity);
console.log("Polarity Confidence	:",resp.polarity_confidence);
}
});

You should receive an output similar to the one shown below which indicates that the sentence is objective and is positive, both with a high degree of confidence.

Sentiment Analysis Results:

Subjectivity	: objective
Subjectivity Confidence	: 0.9896821594138254
Polarity	: positive
Polarity Confidence	: 0.9999988272764874

### Article Classification

Next we will take a look at the classification endpoint. The Classification endpoint automatically assigns an article or piece of text to one or more categories making it easier to manage and sort. The classification is based on IPTC International Subject News Codes and can identify up to 500 categories. The code below analyzes a BBC news article about the Philae Lander which found organic molecules on the surface of a comet.

textapi.classify('http://www.bbc.com/news/science-environment-30097648', function(err, resp) {
console.log("nArticle Classification Results:n");
if (err !== null) {
console.log("Error: " + err);
} else {
for (var i=0; i < resp.categories.length; i++){
console.log("Label	:",resp.categories[i].label);
console.log("Code	:",resp.categories[i].code);
console.log("Confidence	:",resp.categories[i].confidence);
}
}
});


When you run this code you should receive an output similar to that shown below which assigns the article an IPTC label of “science and technology – space program” with an IPTC code of 13008000.

Article Classification Results:

Label	: science and technology - space programme
Code	: 13008000
Confidence	: 0.9999999983009931

Now that you have seen how simple it is to access the power of Text Analysis through the SDK jump over to our Sandbox and start playing with the sample apps. If Node.js is not your preferred language then check out our SDKs for Python, Ruby, PHP on our website. We will be publishing ‘getting started’ blogs for each of these languages over the coming weeks so keep an eye out for those too. If you haven’t already done so you can get free access to our API on our sign up page.

PREVIOUS POSTSPage 1 of 2NO NEW POSTS