Feature Bundle – Text Analysis for News Aggregators

Feature Bundle – Text Analysis for News Aggregators


News aggregation apps and services are changing the way news is discovered and consumed. As a provider or developer of these services, you know that competition is increasing immensely as each app promises to deliver a more personalized and streamlined experienced than those that have come before. Ultimately, the winners and losers in this battle for market share will be decided by those who best understand the content they are sharing, and use this knowledge to provide cutting edge personalization and reader satisfaction.

Enter Text Analysis – a component of Machine Learning and Natural Language Processing that is playing an increasingly important role in news aggregation.

How can Text Analysis help?

Using Machine Learning and Natural Language Processing techniques, it’s easier than ever before to understand, analyze and segment news content at scale.

To make things easy for you, we’ve put together a list of Text Analysis features (or endpoints as we call them) that are being widely used by our customers for news aggregation;

  1. Article Extraction
  2. Classification / Categorization
  3. Entity & Concept Extraction
  4. Summarization

Note: We have a really cool and easy-to-use online Demo showcasing all of these features – check it out 🙂

You can also check out the case study we put together for one of our News Aggregation customers – Streem

Let’s begin by taking a look at the first feature on our list, Article Extraction.

1. Article Extraction

Webpages can often be cluttered and noisy, awash with an overwhelming amount of images, ads, videos and pop-ups that appear alongside the informative textual content. As a news aggregator, you often need to be able to extract the elements that matter most from a web page – the story itself, the title, the author, publication date – and ignore those that don’t matter to you.

Extracting this information manually from the thousands of stories you analyze every day is simply not sustainable or in any way efficient.

Our Article Extraction endpoint is used to extract the main body of text from articles, web pages and RSS feeds. In doing this, it provides us with the ‘clean’ text data and ignores other media such as images, videos or ads. You get what matters, minus the noise.

Here’s a quick example. On the left we have a web page containing text, images, video ads and links to other stories. On the right we have the results of this same web page after we ran it through Article Extraction.


Using our Article Extraction feature allows you to easily break down a webpage and extract what matters. We extract the main body of text from a web page, the published date, the author and also any image or video present.

This means you can automatically extract what matters from an article while disregarding what doesn’t, meaning you now have a much cleaner, indexed datasource available for further analysis.

2. Classification / Categorization

Categorizing huge amounts of content every day can be a laborious task. Relying on human input to perform this work is an option, but it is an inefficient approach that is prone to error and the potential for less-than-perfect classification is very high. With the sheer amount of content produced daily, you may also need a small army of staff to pull this off!

Wouldn’t it be easier to automatically tag stories based on a taxonomy?

Our Classification by Taxonomy endpoint classifies, or categorizes, a piece of text according to your choice of taxonomy, either IPTC Subject Codes or IAB QAG.

  • IPTC News Codes – International standard for categorizing news content
  • IAB QAG – The Interactive Advertising Bureau’s quality guidelines for classifying ads

We took this article about the Tesla Model S from TechCrunch website and received the following classification results;

What you can see in the image above is the IPTC ID code and label for the Automotive and Electric Vehicle categories, along with our confidence that it is a correct classification. A score of 1 reflects complete confidence in the results. We provide this score so you can set your own confidence threshold. For example – You may want to flag results below a certain score for human analysis.


Automatically categorizing content based off a standard taxonomy means the content you aggregate can be easily segmented based on topics and specific areas of interest. It also means you can eliminate the all too prevalent problem of over-using tags and stories therefore getting lost in an incorrect category or area of your site/app.

3. Entity & Concept Extraction

News stories contain a wealth of mentioned entities and values that can provide some really interesting and important information on a piece of text. The challenge is mining this information, particularly at scale. To best aggregate and segment news content, you want to know the who, the what and the how much from each and every article.

Entity Extraction extracts named entities (people, organizations, products and locations) and values (URLs, emails, telephone numbers, currency amounts and percentages) mentioned in a body of text or web pages.

Concept Extraction extracts named entities mentioned in a document, disambiguates and cross-links them to DBpedia and Linked Data entities, along with their semantic types (including DBpedia and schema.org types).

Concept extraction disambiguates similarly named entities. Take Amazon for example. If it is mentioned in an article, is it referring to the commerce giant or the rainforest? The last thing you want is to recommend an article about the environment to your tech readers who have just read about Amazon’s latest tablet release!

Concept extraction analyzes the content around the word and through Machine Learning and Natural Language Processing (NLP) techniques, performs the disambiguation.


By extracting entities and concepts you can produce a rich tagging system to assist with content aggregation, recommendation and even ad targetin
g. You can easily understand what people, places, organizations, brands for example are mentioned in the articles you share.

4. Summarization

As a news aggregator, you strive to provide and recommend the best and most relevant content to your readers. You also want to provide them with a snapshot teaser of an article so they can decide whether or not to read it. As content consumers in general, sometimes we want it all and sometimes we just want it fast. Either way, we have you covered.

The Summarization endpoint enables you to generate a summary of an article by automatically selecting key sentences to give a cut-down but reflective overview of the main body of text. You can choose to summarize a piece of text in 1-10 sentences. Depending on your method of distribution you may choose a smaller number of sentences, perhaps for an RSS feed, or a larger number for stories above a certain word count that would take a considerable amount of time to read fully.

As an example, we have taken a story about MacBook chargers from The Business Insider and produced the following summary;

Without reading the actual full article or seeing the headline you can probably establish what this article is about after reading the 5 key sentences above, which only takes around 30 seconds.

Here’s the original article from The Business Insider.


The summarization end-point provides an intelligent summary of the content you share. This is particularly useful when, for example, providing snapshot teasers to your readers or providing a reflective overview of stories with larger word counts.


We hope this post gives you some inspiration and helps you to understand the various Text Analysis features available to you and how they can help you with your news aggregation efforts.

News API - Sign up

Let's Talk