Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78
Product

Introduction

Last week we showed you how to search and sort news stories by video and image volume and today we are going to introduce you to another recently added News API feature – Clustering.


What is Clustering?

Clustering is a data mining technique used to group similarly related objects together in groups or collections. It’s an unsupervised classification method, meaning that data is classified without any pre-trained labels or categories, and is used for exploratory data analysis to find hidden patterns or groupings in data. It’s a common technique used in text mining.

In relation to our News API, clustering allows you to group similar news stories that are returned from your specific search or query, without the need for pre-trained classifiers or labels.

As Donald Hebb put it, cells that fire together wire together. This principle is relevant to clustering in that it refers to how the brain uses coincidence for association. Similar or coincidental News API stories are clustered using a measure of similarity and the semantic importance of words and phrases within the content.

source: http://sherrytowers.com/2013/10/24/k-means-clustering/


Clustering with the News API

Clustering is now available for the following News API endpoints;

  • /stories
  • /related_stories
  • /coverages

We’ve also added three algorithms that you can chose from when clustering, depending on the type of data and format of results that you require.

1. STC (Suffix Tree Clustering)

STC is a linear time clustering algorithm (linear in the size of the document set), which is based on identifying phrases that are common to groups of documents. A phrase is an ordered sequence of one or more words. This algorithm treats documents as a string of words rather than a collection of words and thus operates using the proximity of information between words. Learn more

2. K-means

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Learn more

3. Lingo

Algorithm for clustering search results, which emphasizes cluster description quality, using algebraic transformations of the term-document matrix and frequent phrase extraction using suffix arrays. Learn more


Why is clustering news stories useful?

Topical Analysis

Clustering enables you to group specific topic areas that may otherwise not be available as entities, concepts or classification labels.

Deduplication

Because Clustering groups stories with semantic similarities, it enables users to extract the most relevant story from each cluster, thus performing a deduplication of sorts. To do this, you can take one story ID from each cluster according to your own search requirements. For example, you could take the story ID with the most social shares, the most recently published, or the ID with the highest volume of images/videos.

Narrowing your search by cluster

Clustering is a great way to narrow News API search results and dive even deeper into specific areas or topics of interest. By taking the returned clusters of stories, you can use their labels (by appending them to your query) to further narrow your search. This will lead to results being returned from the API that only belong to a specific cluster.

As a use case example, consider a news application. Grouping stories by similarity in clusters makes it easier to provide an end-user with intuitively grouped news stories. For example a user can choose a specific group (cluster) to drill-down that interests them in their news app, giving them only results from their chosen cluster/topic.

As an example, let’s take the search query;

“Brexit” AND “Ireland”

Here is a sample of returned stories;

Perhaps now we are interested in exploring further the topic of Irish passport applications relevant to our original search. We then search create a new search;

“Brexit” AND “Ireland” AND “Irish Passports”

This produces more specific results and clusters relating to our chosen subject area – the spike in Irish passport applications relevant to Brexit and Ireland and relevant stories.


Examples

To show you examples of clusters that are generated from a variety of specific searches, we’ve chosen two topics that are currently popular in world news and selected the top five clusters from each search query.

Brexit AND Ireland

As one of Britain’s biggest trading partners, the potential impact of Brexit on the Irish people and economy has been a hot topic of conversation since the poll results were released last week.

Below we have embedded the API call and JSON results for the “Brexit AND Ireland” search query, using the lingo algorithm and only returning English language stories;

Query:

Results:

We’ve then listed 5 of the top clusters returned and used a simple column chart to visualize the grouping of the different stories and the labels they were automatically assigned.

  1. Investment in Ireland
  2. Peace Deal
  3. Risk
  4. Applying for Irish Passports
  5. Financial Services

Cluster Story Volumes – Brexit AND Ireland

Perhaps unsurprisingly, the concept of ‘Investment in Ireland’ represents the largest cluster. With the highest percentage of Irish exports going to Britain, the concern here is certainly warranted.

Two other stand-out clusters are ‘Peace Deal’ and ‘Applying for Irish Passports’. Without getting too deep into the politics, Brexit would result in the island of Ireland containing the EU-member Republic and the non-EU North. As for the passport cluster, many Britons have been scrambling to see if they are entitled to an Irish passport, in the hope that they can continue to travel freely within the EU. Google search trends saw a sharp spike in Irish passport-related searches immediately after the Brexit results.

Olympics AND Golf

For the first time since 1904, golf will be contested at the Olympic Games in Brazil this year. However, the future of the sport at the games is already in doubt as a number of high-profile golfers are withdrawing or threatening to do so, citing Zika Virus fears.

Below we have embedded the API call and JSON results for the “Brexit AND Ireland” search query, using the lingo algorithm and only returning English language stories;

Query:

Results:

For this example we’ve taken the top six clusters from the “Olympics AND Golf” search query;

  1. Jordan Spieth
  2. Louis Oosthuizen of South Africa
  3. Zika Virus Fears
  4. Day’s Withdrawal
  5. Wins on the PGA Tour
  6. Olympic Success

Cluster Story Volumes – Olympics AND Golf

For the non-golfers among you, the three mentioned players (Day, Spieth, Oosthuizen) are among a large group of golfers that are unlikely to travel to Brazil this Summer. Day and Spieth are ranked #1 and #2 in the world respectively, so you can really see the gravity of the situation. What should be a celebration of the return of golf to olympics has become a bit of a media-circus, as we can see from our clusters above where the topic of player withdrawals and the Zika Virus has taken center stage.


Usage Tips

Clustering is disabled by default so you will need to pass the cluster parameter to one of the three endpoints and set it as true.

Lingo is the default algorithm so you will need to modify cluster.algorithm to either ‘stc’ or ‘kmeans’ should you wish to use either one.

For the best quality clustering results and to avoid irrelevant labels, we recommend that you set your language of choice.


Conclusion

Using the News API’s clustering features you easily group or cluster stories that are semantically related. A task that would usually only be possible with a significant amount of code and data mining knowledge.





News API - Sign up




0

Introduction

In this post, we are going to introduce you to the Support Vector Machine (SVM) machine learning algorithm. We will follow a similar process to our recent post Naive Bayes for Dummies; A Simple Explanation by keeping it short and not overly-technical. The aim is to give those of you who are new to machine learning a basic understanding of the key concepts of this algorithm.


Support Vector Machines – What are they?

A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this post.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.

tumblr_inline_o9aa8dYRkB1u37g00_540

 

Support Vectors

Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.


What is a hyperplane?

As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.

How do we find the right hyperplane?

Or, in other words, how do we best segregate the two classes within the data?

The distance between the hyperplane and the nearest data point from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.

 

tumblr_inline_o9aa9nH3WQ1u37g00_540

But what happens when there is no clear hyperplane?

This is where it can get tricky. Data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non separable dataset.

 

tumblr_inline_o9aaafOy3F1u37g00_540

 

In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view. Explaining this is easiest with another simplified example. Imagine that our two sets of colored balls above are sitting on a sheet and this sheet is lifted suddenly, launching the balls into the air. While the balls are up in the air, you use the sheet to separate them. This ‘lifting’ of the balls represents the mapping of data into a higher dimension. This is known as kernelling. You can read more on Kerneling here.

tumblr_inline_o9aabehtqP1u37g00_540

 

Because we are now in three dimensions, our hyperplane can no longer be a line. It must now be a plane as shown in the example above. The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it.


Pros & Cons of Support Vector Machines

Pros

  • Accuracy
  • Works well on smaller cleaner datasets
  • It can be more efficient because it uses a subset of training points

Cons

  • Isn’t suited to larger datasets as the training time with SVMs can be high
  • Less effective on noisier datasets with overlapping classes

SVM Uses

SVM is used for text classification tasks such as category assignment, detecting spam and sentiment analysis. It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification. SVM also plays a vital role in many areas of handwritten digit recognition, such as postal automation services.

There you have it, a very high level introduction to Support Vector Machines. If you’d like to dive deeper into SVM we recommend checking out (need to find a link to a video or a more in depth blog).

 




Text Analysis API - Sign up




1

Introduction

Today we want to show you a simple yet effective feature that we have just added based on a request from one of our brilliant users – finding and sorting stories by the number of images and videos they contain.

Finding stories by image and video count

No matter what news related project you’re working on, be it a particular research project or a news aggregation solution, the ability to search and sort stories based on the media (images and videos) contained in them is extremely useful. This capability means you can provide more flexibility and insight on the content you curate and push to stakeholders or app users.

We’ve added a number of new parameters that make it easier to understand the makeup of stories you collect or search for. Whether you are searching for content that is image and video-rich or content that is void of either media type, you now have complete control with the ability to find stories with specific image and video counts.

For example, you may wish to only retrieve stories that contain at least one video. Or perhaps you want stories that contain both multiple videos and images. We could go on and on with examples, but we’re sure you get the gist!

Here are the new search parameters added to the /stories, /related_stories, /coverages, /time_series, /trends and /histograms endpoints:

  • media.images.count.min
  • media.images.count.max
  • media.videos.count.min
  • media.videos.count.max

This latest update gives you increased flexibility and control over your News API searches and & increases analysis options as number of images/videos can now be combined with other search parameters and trend analyses to give you greater insights than ever before.

Simple Examples

Note: you need to add your AYLIEN News API credentials for the sample calls to work. If you don’t have an account, you can sign up here.

Finding stories with media

Find stories containing a minimum of 2 images –

https://api.newsapi.aylien.com/api/v1/stories?media.images.count.min=2

Find stories containing exactly 2 images and minimum of 1 video –

https://api.newsapi.aylien.com/api/v1/stories?media.images.count.min=2&media.images.count.max=2&media.videos.count.min=1

Sorting results by image and video count

Sorting your results by image or video count is also possible and it couldn’t be easier.

We’ve added the following options for the `sort_by` parameters in the /stories endpoint:

  • media.images.count
  • media.videos.count

Sample query;

Find stories that contain ‘Obama’ in the title, published in the past 24 hours and sorted by image count –

https://api.newsapi.aylien.com/api/v1/stories?title=obama&published_at.start=NOW-1DAY&sort_by=media.images.count

Interesting use case

Comparing image and video presence by news category

We thought it would be interesting to take three news categories and compare the image and video volume for each, to see if there was much variance in how often each media type is used in each category. The three categories we chose were Health & Fitness, Technology and Personal Finance.

We collected a total of about 3M stories across 3 categories from the past 30 days. Using the new parameters for image and video counts we produced the following results, starting with images;

% of stories containing 0, 1 or 2+ images

 

The charts above tell us that the majority of news content contains at least one image. On average, from these three categories, 84% of stories contain at least 1 image, while just 16% contains no image at all.

Videos:

With the recent sharp rise in video content popularity it is perhaps surprising to see the large number of stories published with no videos;

  • Technology 98.09%
  • Health & Fitness 99.47%
  • Personal Finance 99.9%

The chart below shows the percentage of stories from each category that did contain videos.

 

 

A few thoughts spring to mind immediately;

  1. As we mentioned the number of news stories containing videos is surprisingly low.
  2. Stories from the Technology category contain considerably more video content than Personal Finance and Health & Fitness categories.
  3. Video content in Personal Finance-related content is almost non-existent. We saw 0% of our results in this category containing 2 or more videos. In fact, out of just under 70,000 stories, only 73 contained at least 1 video.

Conclusion

Finding, sorting and analyzing news content by the media contained in stories adds an interesting and powerful new dimension to our News API. We hope this post has given you a good introduction to these new parameters and sparked some ideas in your mind as to how you can take advantage of them in your own projects and apps.

We always take feedback and feature requests from our users onboard and we were thrilled to deliver this update so soon after receiving the request (kudos to our awesome Engineers and team!). We have plenty more to come but if you have any requests we would love to here them.

Stay tuned as we will be taking a more in-depth look at each of these latest parameters soon.

 




News API - Sign up




0

Product

Introduction

In a world that is increasingly focused on SLAs, KPIs and ROIs, the role of Customer Support and Customer Success, particularly in SaaS companies, has never been more important to the overall performance of an organization.

These ‘departments’ are no longer siloed entities within organizations and a support team is no longer a nice-to-have addition to a product or service offering. Rather, organizations are now focused on building customer success-centric processes throughout, as the shift from nice-to-have to need-to-have becomes apparent. A customer success-centric company focuses on every detail of, and interaction with, their customers to ultimately increase brand loyalty and reduce churn.

How can Text Analysis help?

Using Machine Learning and Natural Language Processing techniques, it’s easier than ever to understand and analyze all your customer interactions, whether it is direct via email, feedback forms, NPS surveys, live chat or mentions on social media channels.

To make things easy for you, we’ve put together a list of Text Analysis features (or endpoints as we call them) that are being widely used by our customers to boost their support offerings. So if you’re new to using our API this will help you get up to speed quickly;

  1. Semantic Labelling
  2. Entity & Concept Extraction
  3. Sentiment Analysis

Note: We have a really cool and easy-to-use Demo showcasing all of these features. You can test them out with your own text data (articles, tweets URLs etc) or try some of our sample queries.

Let’s begin by taking a look at the first feature on our list, Semantic Labelling.

 

1. Semantic Labelling

Because of the multi-channel nature of customer support, you tend to have customer queries and requests coming in from a variety of sources – email, social media, feedback forms, live chat. Speed of response is a key performance metric for many organizations and so routing customer queries to the relevant department, in as few steps as possible, can be crucial.

An obvious solution is to employ a gatekeeper of sorts, who manually alerts each individual department when a customer request is received from one of the aforementioned channels. But it’s 2016, and we believe the role of human-switchboard should be a defunct one!

Wouldn’t it be easier if customer queries were automatically routed to the correct department or individual, without any human interaction?

Of course it would! Semantic Labelling selects which label best represents a piece of text based on semantic similarity. By providing specific labels to the various departments within your organization, support queries can be analyzed, tagged and routed accordingly.

Here’s a quick example. A user of one of your products is unable to find documentation on your website. They open up your feedback form and ask the question;

 

 

You have 3 main departments within your organization that respond to customer queries; Support, Marketing and Sales. The Semantic Labelling feature will automatically analyze this question and order your labels (departments) accordingly;

 

 

In this case, Sales or Marketing do not need to be notified of such a request, so it can be automatically routed to the Support team.

Benefits

By getting customer queries to the right person within your organization ASAP, you significantly reduce the risk of it being lost or sent to the wrong department or individual. The main benefit here is streamlining and automating the support process to ensure your customer gets a response in the shortest time possible.

 

2. Entity & Concept Extraction

Customer support request content can contain a wealth of mentioned entities and values that can provide some really interesting and important information. The challenge is mining this information, particularly at scale. To best understand this content, you want to know the who, the what and the how much from each and every customer query or request.

Entity Extraction extracts named entities (people, organizations, products and locations) and values (URLs, emails, telephone numbers, currency amounts and percentages) mentioned in any body of text.

 

 

Concept Extraction extracts named entities mentioned in your support requests and cross-links them to DBpedia and Linked Data entities, for greater understanding.

Concept extraction also disambiguates similarly named entities. Take Apple for example. If it is mentioned is it a reference to the company or the fruit? Concept extraction analyzes the content around the word and through Machine Learning and Natural Language Processing (NLP) techniques, performs the disambiguation.

 

Benefits

By extracting entities and concepts from your support requests and analyzing at scale you can gain some incredibly useful insights and trends and get answers to key questions, such as;

  • What words, phrases, brands, products are mentioned most?
  • Has there been a spike in mentions of a particular product or service?
  • Are they mentioning or comparing us to other brands?
  • Are they mentioning specific people or places in relation to our brand or product? Perhaps our service is failing in certain areas.

 

3. Sentiment Analysis

At the heart of our customer support efforts is the happiness of our customers. Naturally, we want them to love our product or service, and to be cared for when something goes wrong or doesn’t quite work as intended. This, of course, is the overall goal and it is one that will affect all areas of your organization as it directly impacts brand loyalty, customer retention and churn levels. But how exactly do we measure happiness and understand how our customers feel about us, our products, our brands etc? Sentiment Analysis.

Sentiment Analysis detects the sentiment of a body of text in terms of polarity (positive or negative) and subjectivity (subjective or objective). When used, particularly at scale, it can show you how people feel towards the topics that are important to you – particularly your brand and product offerings.

Whether you are analyzing feedback forms, chat transcripts, emails or social media, Sentiment Analysis will help you hear the true voice of the customer and how they feel about specific topics or areas of interest to you.

 

Benefits

Analyzing the sentiment towards your brand can help you decrease churn and improve customer support by uncovering and proactively working on improving negative trends. It can help show you what you are doing wrong before too much damage has been done, but also quickly show you what you are doing right and should therefore continue doing.

Customer feedback containing significantly high levels of negative sentiment can be relayed to Product and Dev teams to help them focus their time and efforts more accordingly.

Aspect-Based Sentiment Analysis

While Sentiment Analysis provides fantastic insights, the overall sentiment of customer feedback or comments on social media won’t always pinpoint the root cause of the author’s frustrations, or praise.

This is where Aspect-Based Sentiment Analysis (ABSA) comes in. With ABSA, you can dive deeper and analyze the sentiment toward industry-specific aspects.

 

 

Customer requests, feedback forms, NPS surveys and social media posts, such as tweets, facebook posts and reviews, may contain fine-grained sentiment about different aspects (e.g. a product or service) that are mentioned in the content. For instance, a review about an airline may contain opinionated sentences about its staff, food, punctuality and value. This information can be highly valuable for understanding customers’ opinion about a particular service or product.

Let’s put this example to the test. We grabbed an airline review and ran it through our ABSA endpoint. Here’s the review:

As a backpacking student Ryanair was really my only option (being the cheapest!). I think their bad reputation is built mostly on people who simply don’t read the terms of their flight. It’s a cheap A-B service that saves you a tonne of cash if you stick to their rules. I actually found the staff to be very friendly and helpful. No complaints there! What I will complain about is the food on offer. I understand plane food is bad in general but this is another level of bad. The food is disgusting a overpriced unfortunately. Also, the seats are a uncomfortable for people like me who are over 6 feet tall. It was difficult to relax at times. My flight arrived on time which was great. This meant I made my connecting train. Happy days! Overall I was very pleased with Ryanair. The price I paid was really cheap in comparison to the others on offer and as someone looking to save money, this was really all I was after – a cheap flight.

And here are the results;

As you can see from the results above, the ABSA endpoint automatically pulls airline industry-specific aspects (such as food, staff, punctuality, comfort and value), performs Sentiment Analysis on each aspect and gives sample sentences to indicate examples of where the the score was derived.

Benefits

ABSA helps you locate aspect-specific issues and proactively resolve them before they snowball into a more serious issue. It enables you to pinpoint failings in support offerings or product/service communications by monitoring the levels of customer sentiment and generating trend reports.

ABSA is currently available for four industries – Airlines, Hotels, Cars and Restaurants – and we are already working on expanding this list. Here are the aspects covered for our initial four industries;

 

Conclusion

We hope this post gives you some inspiration and helps you to understand the various Text Analysis features available to you and how they can help you with your customer support efforts.

 




Text Analysis API - Sign up




0

In recent times deep learning techniques have become more and more prevalent in NLP tasks; just take a look at the list of accepted papers at this year’s NAACL conference, and you can’t miss it. We’ve now completely moved away from traditional NLP approaches to focus on deep learning and how it can be leveraged in language problems, as successfully as it has in both image and audio recognition tasks.

One of these approaches that has seen great success and is backed by a wave of research papers and funding is the concept of word embeddings.

Word embeddings

For those of you who aren’t familiar with them, word embeddings are essentially dense vector representations of words.

Similar to the way a painting might be a representation of a person, a word embedding is a representation of a word, using real-valued numbers. They are an arrangement of numbers representing the semantic and syntactic information of words and their context, in a format that computers can understand.

Here’s a nice little primer you should read if you’re looking for a more in depth description: http://sebastianruder.com/word-embeddings-1/index.html

Word embeddings can be trained and used to derive similarities and relations between words. This means that by encoding each word as a small set of unique digits, say 100, 200 digits or more even that represent the word “mother” and another set of digits that represent “father” we can better understand the context of that word.

 

Source: https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html

Word vectors created through this process manifest interesting characteristics that almost look and sound like magic at first. For instance, if we subtract the vector of Man from the vector of King, the result will be almost equal to the vector resulting from subtracting Woman from Queen. Even more surprisingly, the result of subtracting Walked from Walking almost equates to that of Swam minus Swimming. These examples show that the model has not only learnt the meaning and the semantics of these words, but also the syntax and the grammar to some degree.

 

 

Relations between words according to word embeddings

Source: https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html

As our very own NLP Research Scientist, Sebastian Ruder, explains that “word embeddings are one of the few currently successful applications of unsupervised learning. Their main benefit arguably is that they don’t require expensive annotation, but can be derived from large unannotated corpora that are readily available. Pre-trained embeddings can then be used in downstream tasks that use small amounts of labeled data.”

Although word embeddings have almost become the de facto input layer in many NLP tasks, they do have some drawbacks. Let’s take a look at some of the challenges we face with word2vec, probably the most popular and commercialized model used today.

Word2vec Challenges

Inability to handle unknown or OOV words

Perhaps the biggest problem with word2vec is the inability to handle unknown or out-of-vocabulary (OOV) words.

If your model hasn’t encountered a word before, it will have no idea how to interpret it or how to build a vector for it. You are then forced to use a random vector, which is far from ideal. This can particularly be an issue in domains like Twitter where you have a lot of noisy and sparse data, with words that may only have been used once or twice in a very large corpus.

No shared representations at sub-word levels

There are no shared representations at sub-word levels with word2vec. For example, you and I might encounter a new word that ends in “less”, and from our knowledge of words that end similarly we can guess that it’s probably an adjective indicating a lack of something, like flawless or careless.

Word2vec represents every word as an independent vector, even though many words are morphologically similar, just like our two examples above.

This can also become a challenge in morphologically rich, and polysynthetic languages such as Arabic, German or Turkish.

Scaling to new languages requires new embedding matrices

Scaling to new languages requires new embedding matrices and does not allow for parameter sharing, meaning cross-lingual use of the same model isn’t an option.

Cannot be used to initialize state-of-the-art architectures

As explained earlier, pre-training word embeddings on weakly supervised or unsupervised data has become increasingly popular, as have various state-of-the-art architectures that take character sequences as input. If you have a model that takes character-based input, you normally can’t leverage the benefits of pre-training, which forces you to randomize embeddings.

Summary

So while the application of deep learning techniques like word embeddings and word2vec in particular have brought about great improvements and advancements in NLP, they are not without their flaws.

At AYLIEN, Representation Learning, the wider field word embeddings comes under, is an active area of research for us. Our scientists are actively working on better embedding models and approaches for overcoming some of the challenges we mentioned.

Stay tuned for some exciting updates over the next few weeks ;).

 




Text Analysis API - Sign up




0

Introduction

Our News API has a variety of analysis focused endpoints that allow our users to make sense of the data extracted from news content in a meaningful way. In this post we look at one in particular, /time_series, and how it can be used to uncover trends in news publication times relating to specific topics and keywords.

We will show you how to draw meaningful information from these trends that can help with a number of tasks, including planning of resources and deciding the best times to both publish content and distribute it on social media.

Most popular days to publish news content – Overall

To begin with, we thought it would be interesting to uncover the most popular days for posting news content by analyzing all of our News API sources over a 7 day period. This query returned all results without any search criteria. We grabbed everything that was available.

Looking at the graph below, we can see that publication volumes peak midweek with Tuesday, Thursday and Wednesday being the most popular days. Monday comes next, however only slightly behind.

We begin to see a clear decline in published stories as we approach the weekend, with a 17% drop from Thursday to Friday. The weekend is clearly the least popular time to publish news content, with a decrease in content publication in the region of 60-70% versus midweek days (Monday – Thursday).

Why?

You can probably think of many reasons why the majority of news content is published midweek, the most obvious being the traditional Monday-Friday work week which places many of us in front of computer screens during these days. However, the next graph hints at a possible reason.

The graph below represents social media shares by days of the week. We analyzed Facebook shares of our News API content for one week and produced the following graph.

As you can see, our two charts are almost identical, with the trend in Facebook shares almost matching the trend in content publication.

We believe this is not a coincidence. Publishers want to give their content the best possible chance of being shared and circulated on social media, so perhaps they are aligning their publication times with peaks in Internet user activity, which naturally corresponds with peaks in social media activity.

Now we know what days are most popular, let’s drill down a level and find out what times of the day are most popular.

Most popular time of day to publish content – Overall

Please note: For this analysis we used the GMT timezone.

The graph below represents a 24-hour period starting at 0 (00:00). As you can see, the peak time for content publication is 15:00, followed by the two preceding hours, 14:00 and 13:00.

Understandably, content publication tails off towards midnight and into the AM but it is worth noting the peaks at 20:00 and 22:00 and the sharp rise from 06:00 – 09:00.

By putting these peaks together, we can perhaps see a trend emerging;

  • 06:00 – 09:00 Breakfast/Commute
  • 13:00 – 15:00 Lunch
  • 19:00 – 20:00 Dinner
  • 22:00 Before bed

Think about the most common times that people check social media and this all starts to make sense!

 

But what if we want to be industry-specific with our analysis?

While overall publication patterns give interesting insights, we want to know when content is being published relating to specific people, organizations and keywords. Let’s take a look three popular topics to see if we can spot any interesting trends emerging on a daily, weekly or monthly basis.

 

Applying specific search criteria

In the hope of producing some varied and interesting insights into publication patterns we took three very different topics of interest and produced daily, weekly and monthly results using the Time Series endpoint. Here’s our three keywords;

  • Nasdaq (finance)
  • Samsung (technology)
  • NFL (sport)

Daily trends

Nasdaq and Samsung follow a very similar pattern with a peak on Tuesday followed by a consistent decline towards the weekend. Perhaps the most notable observation here is the comparatively high volumes of content published on Sundays versus the analysis we did on overall publications.

The NFL search gives us an interesting spike on Thursdays, which bucks the trend from what we have seen so far in our analysis. You don’t have to be a football fan to guess why this spike occurs – Thursday night football !

Although the majority of NFL games are played on Sunday (hence the high content volume on Monday for post-game analysis), publishers and marketers know how engaged their readers are with social media midweek and they certainly take advantage of this. Thursday Night Football involves just one game. There is usually 14 games played on a Sunday. That’s 4% more content published on a Thursday even though 86% of the weekly games are played on a Sunday. That’s the power of social media!

Hourly trends

We can now take a look at each of our three keywords and produce graphs to show trends in their publication times. We want to see;

  • Publication time, by keyword
  • EST timezone
  • with previous trends displayed

As you can see from the graph below, we have a blue line representing the average time and volume of publications relating to Nasdaq on a per day basis. The grey lines represent previous trends – these are individual days’ data that, when combined, produce our blue line average.

Why display previous trends?

Displaying previous trends enables us to see peaks and troughs in content publication relating to specific topics or keywords, This can be particularly useful when planning or budgeting for your API usage. For example, our News API enables you to cap your monthly usage at a volume that suits your needs and budget. Previous trends can help you decide what this cap should be, to ensure you do not miss important story content when a spike in publication occurs. You don’t have to hit your cap each month (you still only get charged for what you use) but it is always a good idea to have that ‘buffer’ in place should a spike occur.

 

Samsung

While the Nasdaq stock market operates in consistent patterns (as seen by our graph above) our graph for Samsung is a little less clear-cut! OK, it’s clearly a mess, but we can definitely see a trend towards publishing during the morning and into lunch, with a clear tail-off in volume after 1pm.

Again, if I was analyzing Samsung-related content it would be extremely useful to not only see the average trend spikes but also the maximum volumes that have been published during my specified time period.

 

NFL

Our NFL hourly analysis shows two very
distinct peaks at roughly 11:00 and 15:00. These peaks represent a 70-80% increase in content publication, versus average, in a single day. Sure, you won’t always know when a superstar quarterback is about to announce his retirement, but at least you can gauge the amount of content that will require analysis the next time such an event occurs.

Monthly trends

And finally, a quick look at monthly trends. We thought it was interesting to note the consistency with Nasdaq compared to the NFL, which appears to be more prone to irregular fluctuations in publication volumes. We can only speculate as to why this is the case, but perhaps some NFL games attract more media attention than others, while Nasdaq is consistently big news in the stock market world.

Conclusion

We hope this analysis has given you an insight into the level of useful data that can be achieved with the Time Series endpoint within the News API. Producing these trend graphs is step one in our overall analysis.

While the graphs and data mentioned above can be extremely useful in establishing volume trends over time, further research and analysis is required in uncovering why these trends are occurring and how best to take advantage of them. We’ll cover this in a follow-on post next week.

 




News API - Sign up




0

Product

Introduction

News aggregation apps and services are changing the way news is discovered and consumed. As a provider or developer of these services, you know that competition is increasing immensely as each app promises to deliver a more personalized and streamlined experienced than those that have come before. Ultimately, the winners and losers in this battle for market share will be decided by those who best understand the content they are sharing, and use this knowledge to provide cutting edge personalization and reader satisfaction.

Enter Text Analysis – a component of Machine Learning and Natural Language Processing that is playing an increasingly important role in news aggregation.

How can Text Analysis help?

Using Machine Learning and Natural Language Processing techniques, it’s easier than ever before to understand, analyze and segment news content at scale.

To make things easy for you, we’ve put together a list of Text Analysis features (or endpoints as we call them) that are being widely used by our customers for news aggregation;

  1. Article Extraction
  2. Classification / Categorization
  3. Entity & Concept Extraction
  4. Summarization

Note: We have a really cool and easy-to-use online Demo showcasing all of these features – check it out 🙂

You can also check out the case study we put together for one of our News Aggregation customers – Streem

Let’s begin by taking a look at the first feature on our list, Article Extraction.

1. Article Extraction

Webpages can often be cluttered and noisy, awash with an overwhelming amount of images, ads, videos and pop-ups that appear alongside the informative textual content. As a news aggregator, you often need to be able to extract the elements that matter most from a web page – the story itself, the title, the author, publication date – and ignore those that don’t matter to you.

Extracting this information manually from the thousands of stories you analyze every day is simply not sustainable or in any way efficient.

Our Article Extraction endpoint is used to extract the main body of text from articles, web pages and RSS feeds. In doing this, it provides us with the ‘clean’ text data and ignores other media such as images, videos or ads. You get what matters, minus the noise.

 

 

Here’s a quick example. On the left we have a web page containing text, images, video ads and links to other stories. On the right we have the results of this same web page after we ran it through Article Extraction.

 

 

Benefits

Using our Article Extraction feature allows you to easily break down a webpage and extract what matters. We extract the main body of text from a web page, the published date, the author and also any image or video present.

This means you can automatically extract what matters from an article while disregarding what doesn’t, meaning you now have a much cleaner, indexed datasource available for further analysis.

2. Classification / Categorization

Categorizing huge amounts of content every day can be a laborious task. Relying on human input to perform this work is an option, but it is an inefficient approach that is prone to error and the potential for less-than-perfect classification is very high. With the sheer amount of content produced daily, you may also need a small army of staff to pull this off!

Wouldn’t it be easier to automatically tag stories based on a taxonomy?

Our Classification by Taxonomy endpoint classifies, or categorizes, a piece of text according to your choice of taxonomy, either IPTC Subject Codes or IAB QAG.

  • IPTC News Codes – International standard for categorizing news content
  • IAB QAG – The Interactive Advertising Bureau’s quality guidelines for classifying ads

 

We took this article about the Tesla Model S from TechCrunch website and received the following classification results;

 

 

What you can see in the image above is the IPTC ID code and label for the Automotive and Electric Vehicle categories, along with our confidence that it is a correct classification. A score of 1 reflects complete confidence in the results. We provide this score so you can set your own confidence threshold. For example – You may want to flag results below a certain score for human analysis.

Benefits

Automatically categorizing content based off a standard taxonomy means the content you aggregate can be easily segmented based on topics and specific areas of interest. It also means you can eliminate the all too prevalent problem of over-using tags and stories therefore getting lost in an incorrect category or area of your site/app.

3. Entity & Concept Extraction

News stories contain a wealth of mentioned entities and values that can provide some really interesting and important information on a piece of text. The challenge is mining this information, particularly at scale. To best aggregate and segment news content, you want to know the who, the what and the how much from each and every article.

Entity Extraction extracts named entities (people, organizations, products and locations) and values (URLs, emails, telephone numbers, currency amounts and percentages) mentioned in a body of text or web pages.

 

Concept Extraction extracts named entities mentioned in a document, disambiguates and cross-links them to DBpedia and Linked Data entities, along with their semantic types (including DBpedia and schema.org types).

Concept extraction disambiguates similarly named entities. Take Amazon for example. If it is mentioned in an article, is it referring to the commerce giant or the rainforest? The last thing you want is to recommend an article about the environment to your tech readers who have just read about Amazon’s latest tablet release!

Concept extraction analyzes the content around the word and through Machine Learning and Natural Language Processing (NLP) techniques, performs the disambiguation.

 

Benefits

By extracting entities and concepts you can produce a rich tagging system to assist with content aggregation, recommendation and even ad targetin
g. You can easily understand what people, places, organizations, brands for example are mentioned in the articles you share.

4. Summarization

As a news aggregator, you strive to provide and recommend the best and most relevant content to your readers. You also want to provide them with a snapshot teaser of an article so they can decide whether or not to read it. As content consumers in general, sometimes we want it all and sometimes we just want it fast. Either way, we have you covered.

The Summarization endpoint enables you to generate a summary of an article by automatically selecting key sentences to give a cut-down but reflective overview of the main body of text. You can choose to summarize a piece of text in 1-10 sentences. Depending on your method of distribution you may choose a smaller number of sentences, perhaps for an RSS feed, or a larger number for stories above a certain word count that would take a considerable amount of time to read fully.

As an example, we have taken a story about MacBook chargers from The Business Insider and produced the following summary;

 

 

Without reading the actual full article or seeing the headline you can probably establish what this article is about after reading the 5 key sentences above, which only takes around 30 seconds.

Here’s the original article from The Business Insider.

Benefits

The summarization end-point provides an intelligent summary of the content you share. This is particularly useful when, for example, providing snapshot teasers to your readers or providing a reflective overview of stories with larger word counts.

Conclusion

We hope this post gives you some inspiration and helps you to understand the various Text Analysis features available to you and how they can help you with your news aggregation efforts.

 




News API - Sign up




0

Product

Introduction

Our News API is much more than just a news aggregator. It collects news based on a variety of different search criteria: keywords, entities, categories, sentiment and so on. However it’s the ability to index and analyze the content sourced that makes the News API extremely powerful.

Whether you’re pushing data from the News API into an app, resurfacing the analysis in a news feed or building intuitive dashboards with the data extracted, the News API allows you to get a deep understanding for what is happening in the news on a near real time basis.

Our News API has a variety of analysis focused endpoints that allow our users to make sense of the data extracted from news content in a meaningful way. In this blog we’ll talk you through some of the features/endpoints that can help you dive into the stories you collect with the News API and the data extracted. We’ll introduce the endpoints that our News API users leverage to;

  • Track events over time
  • Spot Trends in news content
  • Compare sources and author’s opinion
  • Monitor coverage of the same or similar stories across the web

Leveraging time stamped data: /time_series

Simply put, a Time Series is a sequence of data points plotted over a period of time. Our Time Series endpoint is used when you want to analyze a set of data points relative to a certain timeframe. This makes it easier to analyze timestamped data and means the data can be easily visualized in a meaningful way.

We’ve included a simple example below where we’ve used the Time Series endpoint to understand the volume of stories over a given time period. You’ll also notice in the example that we’ve used the Time Series endpoint to track how the polarity or sentiment of stories changes over time, not just the volume.

 

 

Our customers are using this to monitor how something like a topic, entities or even a particular article might be talked about on social media over a certain time period. It helps them spot stories or topics that might be quickly gathering pace online, hinting at their popularity, importance and virality. We even have users utilising the Time Series endpoint combined with others to identify when is the best time to post articles of a particular nature, for example, posting Personal Finance categorised articles at the end of a fiscal quarter versus mid quarter.

Visit our documentation for more info on the Time Series endpoint.

Working with numerical data points and metrics: /histograms

A histogram is a graph which represents the distribution of numerical data. Our histogram endpoint allows you to get an aggregated profile of a certain metric. It’s up to you which metric you use.

 

 

Above we’ve shared two sample graphs we built with data extracted using the Histogram endpoint. We’ve graphed the number of social shares per story and the length of articles in words sourced through the API.

Our News API customers use this feature to uncover insights like what content categories are most popular across social media or how many words does an author tend to use when writing about a certain topic, which is useful if you’re sending them a news tip for instance.

Visit our documentation for more info on the Histograms endpoint.

Uncovering useful insights: /trends

The Trends endpoint is designed to make it easier to identify most frequently appearing entities, keywords and topical or sentiment-related categories. Meaning you can analyze how often something occurs in the content you source through the API.

You can use the Trends endpoint to monitor the frequency of Categories, Entities and Keywords in the results sourced from the API. For example, you would use the Trends endpoint if you wanted to identify the most mentioned entities from a collection of news articles or the frequency of content category in a collection of articles.

 

 

This endpoint enables our News API users to get a better understanding of topics and mentions of people, organizations, and places in news stories, as shown in the example above. It also means you can conduct distribution or quantitative focused analyses on say, the breakdown of one topic or category to another, or the overall sentiment of articles, as shown in the pie charts below.

Visit our documentation for more info on the Trends endpoint.

Story Coverages and finding related articles: /coverages & /related_stories

The Coverages and Related Stories endpoints provides a 360 degree view on the news reactions to a story. They are designed to provide an easy way for our News API users to source news articles covering the same story. Our users utilize both endpoints in different ways and for a variety of different reasons.

Coverages: Coverages allows you to understand the reach an article has from a news coverage point of view. Our PR focused users utilize this endpoint to get an understanding for how well a press release is performing, based on the number of Coverages it’s had.

Visit our documentation for more info on the Coverages endpoint.

Related Stories: Related Stories looks for semantically similar articles, which are those articles that might be covering the same story or dealing with the same topic. It provides an understanding for how a news story is breaking and the overall reach of a story.

When combined with other parameters like location source or author, it can be used to compare how coverage of the same story might differ between geographical regions or an author’s angle.

Note: Both the Related Stories and Coverages endpoint support GET and POST methods. This means you can provide a URL or Raw text as your input to the News API.

Our API users use this feature to identify related articles to those sourced in their search or from existing text or news articles by passing a URL or the raw text of an article.

As you can see below we’ve used the Related Stories endpoint to source related news content based on the text of a tweet. The API provides a number of semantically similar stories in it’s results.

Input:

 

Output:

 

Visit our documentation for more info on the Related endpoint.

We hope this gives you a better understanding for how our News API can provide you with an intelligent way of sourcing and analyzing news conent at scale.

 




News API - Sign up




0