Getting started with Real-Time Topic Clustering

We recently announced our biggest feature rollout to date. In this release we introduced the ability to group news articles together by similarity with Real-time Clustering.

This new feature enables News API users to:

  • Determine what topics of interest are developing in the news landscape
  • Identify events of interest that are developing in the world’s news content
  • Deduplicate content in news streams

You can read more about the full feature release here.

What is Real-time Clustering?

real-time-clustering

Real-Time Clustering is a proprietary clustering model that utilizes a combination of enrichment data and other signals to group articles covering the same event or topic together in real-time.

This ability enables News API users to greatly improve the efficiency and accuracy of their applications by: 

  • Identifying significant events/topics of interest in the news
  • Discovering trending events/topics as they unfold in the news
  • Tracking breaking news events or topics
  • Summarizing event/topics details
  • Interrogating and investigating these events/topics
  • Deduplicating news articles
  • Blocking out noise in feeds that are shown to users/analysts

What is a cluster?

Clusters are collections of stories that are grouped together based on how similar they are. In the News API a cluster is a JSON object that provides a cluster ID along with metadata about the stories in that cluster.

A cluster has the following properties:

  • A unique ID in the News API
  • One or more stories associated with it
  • A story will only ever belong to one cluster
  • The predicted location of the story
  • Timestamps of the earliest and latest stories
  • A representative “Hero” story that best summarizes the event the cluster refers to

Getting Started with Real-Time Clustering

Getting up and running with Real-time Clustering is really easy and there are a number of ways you can retrieve clusters in the News API, which we’ll guide you through below. For more technical descriptions and code, please refer to our Clusters documentation.

Clustering requires an Advanced or Enterprise license key. Start a free trial to get your API credentials or contact sales to upgrade your account.

Retrieving clusters using the Clusters endpoint

The Clusters endpoint allows you to search for clusters made up of articles that were published within a specific timeframe – particularly useful for monitoring “breaking” news events or emerging topics that are receiving a certain level of coverage.

Each cluster comes with metadata on the number of stories that make up that cluster, making it easy to discover and identify new or growing clusters. Typically the size of a cluster can be a strong indicator of a new or important event as it unfolds. Other filter options include source location, allowing you to build localized clustering searches with ease.

Our Clusters endpoint is used as a first step in the discovery process to identify the clusters that matter to you. Once you have retrieved the cluster objects, you can then query the Stories endpoint with their IDs to gather the stories (articles) associated with each cluster. Below is a sample query to the Clusters endpoint using our Python SDK.

In this query, we’re making a simple call to the Clusters endpoint looking for Clusters of 10 or more stories published within the last 6 hours.

import os
import aylien_news_api
from aylien_news_api.rest import ApiException
from pprint import pprint

configuration = aylien_news_api.Configuration()
configuration.api_key['X-AYLIEN-NewsAPI-Application-ID'] = os.environ.get('NEWSAPI_APP_ID')
configuration.api_key['X-AYLIEN-NewsAPI-Application-Key'] = os.environ.get('NEWSAPI_APP_KEY')

client = aylien_news_api.ApiClient(configuration)
api_instance = aylien_news_api.DefaultApi(client)

try:
    api_response = api_instance.list_clusters(
        time_end='NOW-6HOURS',
        story_count_min=10
    )
    pprint(api_response)
except ApiException as e:
    print("Exception when calling DefaultApi->list_clusters: %s\n" % e)

You can also use the Trends endpoint to retrieve cluster information. The Trends endpoint allows you to filter clusters based on the stories contained within the clusters. For example, you can filter clusters that contain stories with a specific category label, mention a specific entity, or even have a specific sentiment score.

This method is useful for identifying events about a specific topic or entity, in real-time.

The Trends endpoint returns the ID of clusters sorted by the count of stories associated with each cluster. When you have your cluster ID, you can then get the cluster metadata from the Clusters endpoint and then get the stories for each of the clusters from the Stories endpoint.

  • A story will only ever belong to one cluster
  • The relationship between the story and cluster does not change – it will not be reassigned to another cluster at a later time

Please keep in mind, using this method you’ll be restricted to the top 100 clusters (by cluster size) for your query. If you’re hoping to run a process that requires real-time monitoring, you should ensure that your query is very specific and covers a small enough time interval to retrieve all of the relevant clusters.

from __future__ import print_function
import time
import aylien_news_api
from aylien_news_api.rest import ApiException
from pprint import pprint
configuration = aylien_news_api.Configuration()

# Configure API key authorization: app_id
configuration.api_key['X-AYLIEN-NewsAPI-Application-ID'] = 'YOUR_API_KEY'
configuration = aylien_news_api.Configuration()

# Configure API key authorization: app_key
configuration.api_key['X-AYLIEN-NewsAPI-Application-Key'] = 'YOUR_API_KEY'
configuration.host = "https://api.aylien.com/news"

# Create an instance of the API class
api_instance = aylien_news_api.DefaultApi(aylien_news_api.ApiClient(configuration))


def get_cluster_from_trends():
    response = api_instance.list_trends(
        field='clusters',
        categories_taxonomy='iptc-subjectcode',
        categories_id=['11000000'],
        published_at_end='NOW-12HOURS',
        entities_body_links_dbpedia=[
            'http://dbpedia.org/resource/United_States_Congress']
    )

    return [item.value for item in response.trends]


"""
Returns the representative story, number of stories, and time value for a given cluster
"""


def get_cluster_metadata(cluster_id):
    response = api_instance.list_clusters(
        id=[cluster_id]
    )

    clusters = response.clusters

    if clusters is None or len(clusters) == 0:
        return None

    first_cluster = clusters[0]

    return {
        "cluster": first_cluster.id,
        "representative_story": first_cluster.representative_story,
        "story_count": first_cluster.story_count,
        "time": first_cluster.time
    }


def get_top_stories(cluster_id):
    """
    Returns 3 stories associated with the cluster from the highest-ranking publishers
    """
    response = api_instance.list_stories(
        clusters=[cluster_id],
        sort_by="source.rankings.alexa.rank.US",
        per_page=3
    )

    return response.stories


clusters = {}
cluster_ids = get_cluster_from_trends()

for cluster_id in cluster_ids:
    metadata = get_cluster_metadata(cluster_id)
    if metadata is not None:
        stories = get_top_stories(cluster_id)
        metadata["stories"] = stories
        pprint(metadata)
    else:
        print("{} empty".format(cluster_id))

Retrieving clusters using the Stories endpoint

With the Stories endpoint, you can gather a stream or list of news articles that match your query and group them by cluster using the cluster ID that’s attached to the story. This is useful for deduplication of news articles when working with a news stream or ticker, because you can “collapse” stories in a real-time news stream that you are monitoring.

The snippet below, for example, retrieves recently published stories published in the last 6 hours that mention Donald Trump and groups all the stories by the clusters they’ve been assigned to.

from __future__ import print_function
import time
import aylien_news_api
from aylien_news_api.rest import ApiException
from pprint import pprint
configuration = aylien_news_api.Configuration()

# Configure API key authorization: app_id
configuration.api_key['X-AYLIEN-NewsAPI-Application-ID'] = 'YOUR_API_KEY'
configuration = aylien_news_api.Configuration()

# Configure API key authorization: app_key
configuration.api_key['X-AYLIEN-NewsAPI-Application-Key'] = 'YOUR_API_KEY'
configuration.host = "https://api.aylien.com/news"

# Create an instance of the API class
api_instance = aylien_news_api.DefaultApi(aylien_news_api.ApiClient(configuration))


def get_stories():
    """
    Returns a list of story objects
    """
    response = api_instance.list_stories(
        title='Donald Trump',
        published_at_start='NOW-6HOURS',
        per_page=100
    )

    return response.stories


stories = get_stories()
clustered_stories = {}
clusters = []

for story in stories:
    if len(story.clusters) > 0:
        cluster = story.clusters[0]
        if cluster not in clusters:
            clustered_stories[cluster] = [story.title]
        else:
            clustered_stories[cluster].append(story.title)

for cluster in clustered_stories:
    print(cluster, len(
        clustered_stories[cluster]), clustered_stories[cluster][0])

Start using the real-time clustering features

Real-Time Clustering is available for Advanced and Enterprise users or users on a 14-day free trial.

If you’re an existing News API user and would like to get started with Real-time Clustering, feel free to get in touch with your account manager by filling out our contact sales form.


News API - Sign up
Let's Talk