Analyze News Content in RapidMiner with the AYLIEN News API

Being able to leverage news content at scale is an extremely useful resource for anyone analyzing business, social, or economic trends. But in order to extract valuable insights from this content, we sometimes need to build analysis tools that help us understand it.

To serve the needs of everyone who needs a simple, end-to-end solution for this complex task, we’ve put together a fully-functional example of a RapidMiner process that sources data from the AYLIEN News API and analyzes this data using some of RapidMiner’s operators.

aylien-rapidminer-banner

What can you do with the News API in RapidMiner?

With news content now accessible at web scale, data scientists are constantly creating new ways to generate value with insights from news content that were previously almost impossible to extract. Every month, our News API gathers millions of stories in near-real time, analyzes every news story as it is published, and stores each of them along with dozens of extracted data points and metadata.

Equipped with this structured data about what the world’s media is talking about, RapidMiner users can leverage the extensive range of tools the studio has to offer, including:

  • 1,500+ built-in operations & Extensions to dive into your data
  • 100+ data modelling & machine learning operators
  • Advanced visualization tools.

News_API_RM_vizUsing a 3-D scatter plot to visualize news data with four variables in the RapidMiner Studio

How do I get started with the News API process?

For the this blog we’re going to showcase a example of how you can use our News API within RapidMiner to build useful content analysis processes that aggregate and analyze news content with ease. In this example we’ve picked a fun little example that analyzes articles from TechCrunch and builds a classification model to predict which reporter wrote any new article it is shown (protip: you can use the same model to pick which TechCrunch journalist you should target for your pitch!). We hope this blog might spark some creative ideas and use cases around combining RapidMiner and our News API.

This sample process consists of two main steps:

  1. Gathering enriched news content from the News API using the Web Mining extension
  2. Building a classification model by using RapidMiner’s Naive Bayes operator.

If you are unfamiliar with RapidMiner, there are some great introductory videos and walkthroughs for beginners on their YouTube channel.

So let’s get started!

We’ve made it really easy to get started with the News API and RapidMiner, download this pre-built process and open it with RapidMiner. Next, grab your credentials for the News API by signing up for our free two-week trial.

Once you’ve downloaded the process and opened it up with RapidMiner, you’ll see the main operators outlined in the Process tab. You will see that there are seven operators in total, the first three gather data from the News API while the last four train the classifier.

Screenshot (853)

To make your first calls to the News API, the first thing you need to do is build your search criteria. In order to build your News API query click on the Set Macros operator in the top left of your console. Once you’ve selected the operator, clicking on the Edit List button in the Parameters tab will show you the list of parameters for your News API query. Enter your API credentials (your API key and application ID) that you obtained from the News API developer portal) when you signed up and configure your search parameters – check out the full list of query parameters in our News API documentation to build your search query.

Screenshot (865)

The purpose of this blog is to build a classifier that will predict which TechCrunch journalist wrote an article. In order to do this, we first need to teach the model by gathering relevant training data from TechCrunch. To get this data, we built a query that searched for every article published on on the site in the past 30 days and returned the author of each one, along with the contents of the articles they wrote. The News API can return up to 100 results at a time, but since we wanted more than 100 articles, we used pagination to iterate on the search results for five pages, giving us 500 results. You can see what query we used in the screenshot above.

Importantly, after you have defined these parameters in the Set Macros operator, you’ll need to make the same changes by editing the query list in the Get Page operator within the Process Loop. To do this, double-click on the Loop icon in the Process tab, then double-click the Get Page icon, and select the Edit List button next to Query Parameters.

Screenshot (855)

When you’re entering the parameters, be sure to enter every parameter you entered in the previous window and follow the convention already set in the list (entering the parameter in the “%{___}” format).

News API Results

Once you have defined your parameters in both lists, hit the Run (play) button at the top of the console and let RapidMiner run your News API query. Once it has finished running, you can view the results in the Results window. Below you can see a screenshot of the enriched results with the dozens of data points that the News API returns.

Screenshot (877)

Having access to this enriched news content in Rapidminer allows you to extract useful insights from this unstructured data. After running the analysis you can browse the results of the search using simple visualizations to show data on results like sentiment, or as in the graph below, authorship, which shows us which authors have published the most articles in the time period we set.

Screenshot (883)

 

Training a Classifier

For the sample analysis in this blog, we’re building a classifier using RapidMiner’s Naive Bayes operator.

Naive Bayes is a common algorithm used in Machine Learning for data classification. You can read more about Naive Bayes in an explainer for novices blog we wrote which talks you through how the algorithm works. Essentially, this classifier will guess which author new articles belong to by learning from features in the training data – the news content we retrieved from our News API results. By analyzing the most common features in the articles from each author in these results, the model will learn that different words and phrases are more likely to be in articles from each author.

For example, take a look below at our how our classifier has learned which writers are most likely to talk about ‘cryptocurrency’. You can test how your classifier by selecting the Attribute button in the top left corner.

Screenshot (879)

Results

Once the process is fully run, it will retrieve and process the news content, and train a Naive Bayes classifier that given the body of an article, tries to predict who the likely authors for that article are, from among all TechCrunch journalists.

Additionally, RapidMiner will also evaluate this classifier for us on a held out subset of the data we retrieved from the News API, by comparing the true labels (known authors) to the model’s predictions (predicted authors) on the test set, and providing us with an accuracy score and a confusion matrix based on the same:


Screen Shot 2017-12-18 at 5.29.56 PM

There are many ways to improve the performance of this classifier, for example by using a more advanced classification algorithm like SVM instead of Naive Bayes. In this post, our goal was to show you how easy it is to retrieve news content from our News API and load it into RapidMiner for further analysis and processing.

Things to try next:

  • Try changing your News API query to repeat this process for journalists from a different news outlet
  • Try using a more powerful algorithm such as SVM or Logistic Regression (RapidMiner includes implementations for many different classifiers, and you can easily replace them with one another)
  • Try to apply a minimum threshold on the number of articles that must exist for each author that the model is trained on

This process is just one simple example of what RapidMiner’s analytic capabilities can perform on enriched news content. By running the first three operators on their own, you can take a look at the enriched content that the News API generates and begin to leverage RapidMiner’s advanced capabilities on an ever-growing dataset of structured news data.

To get started with a free trial of the News API, click on the link below, and be sure to check back on our blog over the coming weeks to see some sample analyses and walkthroughs.




News API - Sign up




 

Let's Talk