Data Science, Product

Categorizing Media Output Using the IAB-QAG Taxonomy – June Media Roundup

Extracting insights from millions of articles at once can create a lot of value, since it lets us understand what information thousands of journalists are producing about what’s happening in the world. But extracting accurate insights depends on filtering out noise and finding relevant content. To allow our users access to relevant content, our News API analyzes thousands of news articles in near real-time and categorizes them according to what content is about.

Having content at web-scale arranged into categories provides accurate information about what the media are publishing as the stories emerge. This allows us to do two things, depending on what we want to use the API for: we can either look at a broad picture of what is being covered in the press, or we can carry out a detailed analysis of the coverage about a specific industry, organization, or event.

For this month’s roundup, we decided to do both. First we’re going to take a look at what news categories the media covered the most to see what the content is about in the most written-about categories, and then we’ll pick one category for a more detailed look. First we’ll take a high-level look at sports content, because it’s what the world’s media wrote the most about, and then we’ll dive into stories about finance, to see what insights the News API can produce for us in a business field.

The 100 categories with the highest volume of stories

The range of the subject matter contained in content published every day is staggering, which makes understanding all of this content at scale particularly difficult. However, the ability to classify new content based on well known, industry-standard taxonomies means it can be easily categorized and understood.

Our News API categorizes every article it analyzes according to two taxonomies: Interactive Advertising Bureau’s QAG taxonomy and IPTC’s Newscodes. We chose to use the IAB-QAG taxonomy, which contains just under 400 categories and subcategories, and decided to look into the top 100 categories and subcategories that the media published the most about in June. This left us with just over 1.75 million of the stories that our News API has gathered and analyzed.

Take a look at the most popular ones in the visualization below.

Note: you can interact with all of the visualizations on this blog – click on each data point for more information, and exclude the larger data points if you want to see more detail on the smaller ones.



As you can see, stories about sport accounted for the most stories published in June. It might not surprise people to see that the media publish a lot about sport, but the details you can pick out here are pretty interesting – like the fact that there were more stories about soccer than food, religion, or fashion last month.

The chart below puts the volume of stories about sports into perspective – news outlets published almost 13 times more stories about sports than they did about music.


What people wrote about sports

Knowing that people wrote so much about sport is great, but we still don’t know what people were talking about in all of this content. To find this out, we decided to dive into the stories about sports and see what the content was about – take a look at the chart below showing the most-mentioned sports sub-categories last month.

In this blog we’re only looking into stories that were in the top 100 sub-categories overall, so if your favourite sport isn’t listed below, that means it wasn’t popular enough and you’ll need to query our API for yourself to look into it (sorry, shovel racers).



You can see how soccer dominates the content about sport, even though it’s off-season for every major soccer league. To put this volume in perspective, there were more stories published about soccer than about baseball and basketball combined. Bear in mind, last month saw the MLB Draft and the NBA finals, so it wasn’t exactly a quiet month for either of these sports.

We then analyzed the stories about soccer with the News API’s entities feature to see what people, countries, and organisations people were talking about.



If you check the soccer schedules for June, you’ll see the Confederations Cup is the only major tournament taking place, which is a competition between international teams. However you can see above that the soccer coverage was still dominated by stories about the clubs with the largest fan bases. The most-mentioned clubs above also top the table in a Forbes analysis  f clubs with the greatest social media reach among fans.

Finance

So we’ve just taken a look at what people and organizations dominated the coverage in the news categories that the media published the most in. But even though the sports category is the single most popular one, online content is so wide-ranging that sports barely accounted for 10% of the 1.75 million stories our News API crawled last month.

We thought it would be interesting to show you how to use the API to look into business fields and spot a high-level trend in the news content last month. Using the same analysis that we used on sports stories above, we decided to look at stories about finance. Below is a graph of the most-mentioned entities in stories published in June that fell into the finance category.



You can see that the US and American institutions dominate the coverage of the financial news. This is hardly surprising, considering America’s role as the main financial powerhouse in the world. But what sticks out a little here is that the Yen is the only currency entity mentioned, even though Japan isn’t mentioned as much as other countries.

To find out what kind of coverage the Yen was garnering last month, we analyzed the sentiment of the stories with “Yen” in the title to see how many contained positive, negative, or neutral sentiment.



We can see that there is much more negative coverage here than positive coverage, so we can presume that Japan’s currency had some bad news last month, but that leaves with a big question: why was there so much negative press about the Yen last month?

To find out, we used the keywords feature. Analyzing the keywords in stories returns more detailed information than the entities endpoint we used on the soccer content above, so it is best used when you’re diving into a specific topic rather than getting an overview of some news content, since you’ll get a lot of noise then. It is more detailed because whereas the entities feature returns accurate information about the places, people, and organisations mentioned in stories, the keywords feature will also include the most important nouns and verbs in these stories. This means that we can see a more detailed picture of the things that happened.

Take a look below at the most-mentioned keywords from stories that were talking about the Yen last month.



You can see that the keywords feature returns a different kind of result than entities – words like “year,” and “week,” and “investor,” for example. If we looked at the keywords from all of the news content published in June, it would be hard to get insights because the keywords would be so general. But since we’re diving into a defined topic, we can extract some detailed insights about what actually happened.

Looking at the chart above you can probably guess for yourself what the main stories about the Yen last month involved. We can see from the fact that the most-mentioned terms above that keywords like “data,’ “growth,” “GDP,” and “economy” that Japan has had some negative data about economic growth, which explains the high volume of negative stories about the Yen. You can see below how the value of the Yen started a sustained drop in value after June 15th, the day this economic data was announced, and our News API has tracked the continued negative sentiment.

yen to usd

These are just a couple of examples of steps our users take to automatically extract insights from content on subjects that interest them, whether it is for media monitoring, content aggregation, or any of the thousands of use cases our News API facilitates.

If you can think of any categories you’d like to extract information from using the News API, sign up for a free 14-day trial by clicking on the link below (free means free – you don’t need a credit card and there’s no obligation to purchase).




News API - Sign up




Author


Avatar

Will Gannon

Content Marketing @ AYLIEN A Classics graduate from UCD, Will is on our Content Marketing Team here at AYLIEN. Before joining us, Will worked in research before completing a Master’s in Digital Humanities at Trinity College, where he used NLP methods to index where Latin terms appear in English Literature.