In this tutorial we’re going to walk you through using the “Text Analysis by AYLIEN” Extension for RapidMiner, to build a “News Analyzer” that monitors and analyzes articles from a particular RSS feed, or feeds.
If you’re new to RapidMiner, or it’s your first time using the Text Analysis Extension you should read Part 1 of our Getting Started Blog which takes you through the installation process. Also, If you haven’t got an AYLIEN account, which you’ll need to use the Extension, you can grab one here.
Prefer to watch a video tutorial? Click here.
So, here’s what we’re going to do:
- Monitor an RSS feed collecting Article updates using the Read RSS Feed Operator
- Extract the main body of text and Title from the article with the Extract Article Operator
- Analyze and categorize these articles using the Categorize Operator
- Extract Entities from the article, mentions of People, Places, Organization etc. using the Extract Entities Operator
- Visualize our results and make them more consumable and understandable
Please note: This tutorial assumes you have the Web Mining Extension for RapidMiner installed. You can download and install the Extension through the RapidMiner Marketplace.
Collecting your articles
The first step to building our News Analyzer will involve adding a Read RSS Feed Operator to our Process. When you add the Read RSS Feed you need to specify what RSS feed you want to monitor by adding the URL in the RSS feed input and adding your timeout counters, we’ve kept the default values.
Extracting the articles and titles
To extract the relevant pieces of text from the URLs collected we can use the Extract Article Operator. This will pull the main body of text, the title and any image present directly from the URL.
To prepare our extracted text for analysis we use a Data to Document Operator. This will transform the dataset of text to a collection of documents making it easier to categorize.
As part of this process, you need to specify which column(s) in the ExampleSet contain the text you want to create a Document from.
The first thing we’re going to do with the extracted text is, try and get a high level understanding for what it’s about by categorizing it based on a particular taxonomy, in this case the IAB QAG taxonomy.
We’ll then add a Document to Data Operator which transforms our documents back to a dataset making it a lot more manageable.
Finally, the last piece of analysis we’ll do on the text is extract any mention of an Entity (Keywords, People, Places, Organizations, % values, $ values etc.) using an Extract Entities Operator.
The entire Process will look like the one below when it’s fully built:
Running the process is simple, just hit the play button. Your results will be displayed in an ExampleSet tab like the one below. Each row will contain the extracted text and title, its appropriate categories as well as any Entities that were extracted separated out in columns.
Visualizing your results
RapidMiner let’s you display and visualize results of your Process really easily using simple charts and visualizations like the ones below, which can all be created using the Charts widget on the left hand side of your results display.
We put together a simple pie-chart below visualizing the categories of the articles extracted with our News Analyzer.
Pie-chart showing IAB categories of Articles analyzed:
For the Data Junkies among us however, you may want to export your results and visualize them using something else like Tableau for example, which by the way, there’s an integration on the way for.
So, there you have it, that’s how you build a News Analyzer Process in RapidMiner using “Text Analysis by AYLIEN” and the Read RSS Feed Operator. We’ve also put together a tutorial guide to analyzing the Sentiment of tweets, which you may also find interesting.
We’ve also created a repository for sample Processes, that we’ll be adding to on a regular basis. It will be a collection of use case focused RapidMiner Processes, that can be downloaded and imported directly in RapidMiner. You can find more info in the documentation section of our website.