Analyzing Text in Rapidminer – Part 2: Rotten Tomatoes movie reviews
Welcome to the second part of our blog series; Analyzing text in Rapidminer. In the first part of the series, we built a basic setup for analyzing the sentiment of any arbitrary text, to find out if it’s positive, negative or neutral. In this blog, we’re going to build a slightly more sophisticated process than the last one, which we can use to scrape movie reviews from Rotten Tomatoes and analyze them in RapidMiner.
In a nutshell, we’re going to:
- Scrape movie reviews for The Dark Knight using the Web Mining extension for RapidMiner.
- Run the reviews through AYLIEN Text Analysis API to extract their sentiment.
- Compare the extracted sentiment values with the Fresh or Rotten ratings from reviewers to see if they follow the same pattern.
Step 1: Extract Reviews from Rotten Tomatoes
The Web Mining extension comes with a set of useful tools designed to crawl the web and scrape web pages for information. Here we’re using the Process Documents from Web operator to scrape a review page, and we use XPath queries to extract the text of the reviews:
- First, drag and drop a Process Control > Loop > Loop operator into your process. We will use the Loop operator to scrape multiple pages of reviews, which gives us more reviews to analyze.
- Next, configure the Loop operator to run 5 times, which means we’re going to scrape 5 pages of The Dark Knight reviews – a total of 100 reviews.
- Now double click the Loop operator and add the Web Mining > Process Documents from Web operator, which will fetch the contents of each review page and provide its HTML for further analyses.
- Configure the newly added operator to fetch the Nth page of reviews, where N is the current iteration of the Loop operator. The
urlparameter should look like this
- Process Documents from Web exposes a sub-process for further analysis of the page contents. Double click the operator to access this sub-process and add a Text Processing > Transformation > Cut Document operator, which will extract individual reviews from a single review page.
- Configure the Cut Document operator to segment the page using the following XPath query:
- The Cut Document operator will expose each extracted segment in a sub-process, so let’s add a Text Processing > Extraction > Extract Information operator to extract the actual text of the review.
- Now let’s connect everything and run the process to get our 100 reviews.
Step 2: Analyze Reviews using Sentiment Analysis API
Now that we have run the process and we have our reviews, it’s time to send them to Text API’s
/sentiment endpoint and see if they are positive, negative or neutral.
- Let’s URL-encode the reviews first. To do that, we’re going to use the Web Mining > Utility > Encode URLs operator.
- Next we’ll send the encoded text to the Text API using the Web Mining > Services > Enrich Data by Webservice operator.
- So now we have our reviews and we have sent them to the Text API it’s time to run the entire process and analyze these 100 reviews!
As you can see, we get a
polarity column that tells us whether each review is positive, negative or neutral.
Step 3: Extract Freshness scores and compare them to Sentiment values
What we accomplished in Step 2 is cool, but let’s evaluate the results by checking if the sentiment polarity scores match the “Freshness” scores given by Rotten Tomatoes reviewers.
For anyone who doesn’t know, the “Freshness” score on Rotten Tomatoes basically tells us whether a review is positive (Fresh ) or negative (Rotten ).
- First things first, add a second XPath query to extract the Freshness score as a boolean value (Fresh/Rotten=not Fresh)
- Before we can check the data for correlations, we must do a bit of a cleanup and pre-processing:
- Remove the
textcolumn after Sentiment Analysis is done, using the Select Attributes operator.
- Convert the
freshcolumns to Numerical columns so that for instance, Polarity=true becomes Polarity_true=1. For that, we’ll use the Nominal to Numerical operator.
- Remove the
- Then we need to add a Modeling > Correlation and Dependancy Computation > Correlation Matrix operator, which basically discovers statistical correlations between independent variables.
- Finally, Run the process again to produce a table similar to below.
What we see in the Correlation Matrix, is that
polarity_positive has a positive correlation to
polarity_negative has a positive correlation to
fresh_false, which means we’ve predicted most of the polarity scores correctly.
That’s it 100 reviews scraped an analyzed using RapidMiner and AYLIEN Text Analysis API, pat yourself on the back, Good Job!