Feature Bundle – Text Analysis for Publishers


As a publisher, you generate a huge amount of content and keeping it organized and accessible within your network can be a real challenge. You also know that the work doesn’t stop when you click Publish, but in fact what you do with the content after its creation can be just as important as the content creation itself.

We understand the challenges that you face in the ever-changing publishing industry. We hear about them quite a lot from our customers who are facing increased competition amid a constant battle to stay relevant and at the forefront of their readers’ minds (and screens!).

Content is at the heart of reader engagement, page views, visits, time on site, CTR’s, social shares, all of the things that matter to a publisher. Given the rate at which you need to curate and produce content today, it’s becoming even more difficult to stay on top of what you’re writing about, what’s popular and what’s driving revenue.

How can Text Analysis help?

Using Machine Learning and Natural Language Processing techniques, it’s easier than ever to understand content at scale.

To make things easy for you, we’ve put together a list of Text Analysis features (or endpoints as we call them) that are being widely used by our publishing industry customers. So if you’re new to using our API this will help you get up to speed quickly about what we can help you with;

  • Article Extraction
  • Classification (Categorization)
  • Entity & Concept Extraction
  • Summarization

Note: We have a really cool and easy-to-use online Demo showcasing all of these features – check it out 🙂

1. Article Extraction

Webpages can be noisy and cluttered, they’re often awash with an overwhelming amount of ads, images, videos and pop-ups that appear alongside informative textual content. As a publisher, separating the wheat from the chaff to extract what really matters – the story itself – can be quite the challenge, whether that’s on your own network of sites or external outlets.

Our Article Extraction endpoint is used to extract the main body of text from articles, web pages and RSS feeds. In doing this, it provides us with the ‘clean’ text data and ignores other media such as images, videos or ads.

Here’s a quick example. On the left we have a web page containing text, images, video ads and links to other stories. On the right we have the results of this same web page after we ran it through Article Extraction.



Using our Article Extraction feature allows you to easily break down a webpage and extract what matters. We extract the main body of text from a web page, the published date, the author and also any image or video present.

This means you can automatically extract what matters from an article while disregarding what doesn’t, meaning you now have a much cleaner, indexed datasource available for further analysis.

2. Classification / Categorization

Categorizing mass amounts of content is an arduous task. For the most part we rely on human input to categorize articles can work, upon submitting an article for publication the author selects some category tags. This approach can work but often times it results in further problems, it can take too long and the potential for human error and less-than-perfect classification is very high.

Wouldn’t it be easier to automatically add tags based on a taxonomy?

Our Classification by Taxonomy endpoint classifies, or categorizes, a piece of text according to your choice of taxonomy, either IPTC Subject Codes or IAB QAG.

  • IPTC News Codes – International standard for categorizing news content
  • IAB QAG – The Interactive Advertising Bureau’s quality guidelines for classifying ads

We took this article about the 2016 Masters golf tournament from the BBC website and received the following results;

What you can see in the image above is the IPTC ID code and label for the golf category and our confidence that it is a correct classification. A score of 1 reflects complete confidence in the results.


Automatically categorizing content based off a standard taxonomy means the content you produce can be easily segmented based on topics. It also means you can eliminate the all too prevalent problem of authors over-using tags.

User Spotlight: Scredible

Using advanced NLP powered by AYLIEN, the team at Scredible have created a fantastic tool for content curation, sharing and publishing. Scredible categorizes content to provide personalized content based on topics and categories so they can deliver focused, relevant content to their users.

3. Entity & Concept Extraction

Your content contains a wealth of mentioned entities and values that can provide some really interesting and important information on a piece of text. The challenge is mining this information, particularly at scale. To understand content, you want to know the who, the what and the how much from each and every article.

Entity Extraction extracts named entities (people, organizations, products and locations) and values (URLs, emails, telephone numbers, currency amounts and percentages) mentioned in a body of text or web pages.

Concept Extraction extracts named entities mentioned in a document, disambiguates and cross-links them to DBpedia and Linked Data entities, along with their semantic types (including DBpedia and schema.org types).

Concept extraction disambiguates similarly named entities. Take Apple for example. If it is mentioned in an article is it referring to the company or the fruit? The last thing you want is to recommend an article about fruit to your tech readers who have just read about the latest iOS release!

Concept extraction analyzes the content around the word and through Machine Learning and Natural Language Processing (NLP) techniques, performs the disambiguation.


By extracting entities and concepts you can produce a rich tagging system to assist with your own archiving needs, content recommendation and even ad targeting. You can easily understand what people, places, organizations, brand
s for example are mentioned in the articles you publish.

User Spotlight: Complex Media

Complex Media is a New York based media platform for youth culture that has a monthly audience of over 120 million people. Complex use entities and concepts to match video content to published articles to improve ad targeting and CTRs. By identifying mentions of celebrities, brands etc they can place relevant ads on the article page containing them. Then within that video they can display ads for mentioned brands. For example, an article that mentions both Drake and Adidas will include a video of a Drake song with an Adidas advertisement shown before it plays.

4. Summarization

Have you ever began to read a long article or story and wish you could just grab a quick summary that would give you a good overview of the text? Of course you have! And so have your readers. As content consumers in general, sometimes we want it all and sometimes we just want it fast. Either way, we have you covered.

The Summarization endpoint enables you to generate a summary of an article by automatically selecting key sentences to give a cut-down but reflective overview of the main body of text. You can choose to summarize a piece of text in 1-10 sentences.

As an example, we have taken a URL from The Guardian online and produced the following summary;

Without reading the actual full article or seeing the headline you can probably establish what this article is about after reading the 5 key sentences above, which only takes around 30 seconds.

Here’s the original article from The Guardian.


The summarization end-point provides an intelligent summary of the content you’ve analyzed. This is particularly useful when for example providing snapshot teasers to your readers or for curating both internal and external content.

User Spotlight: The Magazine Channel

The Magazine Channel use our Summarization endpoint within their flagship app, Inkworthy.


We hope this post gives you some inspiration and helps you to understand the various Text Analysis features available to you and how they can help you as a publisher.

Text Analysis API - Sign up

Let's Talk